<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Property Prediction on Hunter Heidenreich | ML Research Scientist</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/property-prediction/</link><description>Recent content in Property Prediction on Hunter Heidenreich | ML Research Scientist</description><image><title>Hunter Heidenreich | ML Research Scientist</title><url>https://hunterheidenreich.com/img/avatar.webp</url><link>https://hunterheidenreich.com/img/avatar.webp</link></image><generator>Hugo -- 0.147.7</generator><language>en-US</language><copyright>2026 Hunter Heidenreich</copyright><lastBuildDate>Sun, 05 Apr 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/property-prediction/index.xml" rel="self" type="application/rss+xml"/><item><title>MTL-BERT: Multitask BERT for Property Prediction</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/property-prediction/mtl-bert-multitask-smiles-enumeration/</link><pubDate>Fri, 27 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/property-prediction/mtl-bert-multitask-smiles-enumeration/</guid><description>MTL-BERT combines BERT pretraining, multitask learning, and SMILES enumeration for molecular property prediction across 60 drug discovery datasets.</description><content:encoded><![CDATA[<h2 id="a-multitask-bert-framework-for-molecular-property-prediction">A Multitask BERT Framework for Molecular Property Prediction</h2>
<p>MTL-BERT is a <strong>Method</strong> paper that introduces a multitask learning framework built on BERT for predicting molecular properties from <a href="/notes/computational-chemistry/molecular-representations/smiles/">SMILES strings</a>. The primary contribution is the combination of three strategies to address data scarcity in drug discovery: (1) masked token pretraining on 1.7 million unlabeled molecules from <a href="https://en.wikipedia.org/wiki/ChEMBL">ChEMBL</a>, (2) multitask fine-tuning across 60 property prediction datasets simultaneously, and (3) <a href="/notes/computational-chemistry/molecular-representations/randomized-smiles-generative-models/">SMILES enumeration</a> as a data augmentation technique applied during pretraining, fine-tuning, and inference. The model achieves strong performance across 60 <a href="https://en.wikipedia.org/wiki/ADME">ADMET</a> and molecular property datasets (44 classification and 16 regression), outperforming baselines including GNNs, XGBoost with molecular fingerprints, and prior <a href="/notes/computational-chemistry/chemical-language-models/molecular-encoders/smiles-bert/">SMILES-BERT</a> approaches.</p>
<h2 id="data-scarcity-in-molecular-property-prediction">Data Scarcity in Molecular Property Prediction</h2>
<p>Deep learning methods for molecular property prediction face a fundamental tension: they require large amounts of labeled data to learn effectively, but labeled bioactivity data is scarce due to the cost and time of laboratory experiments. Existing approaches at the time of publication addressed this in isolation. Graph neural networks (GNNs) learn from molecular graphs but are typically shallow (2-3 layers) and prone to overfitting on small datasets. The original SMILES-BERT model applied masked language modeling to SMILES strings but fine-tuned separately for each task, missing opportunities to share information across related properties. Fixed molecular representations like <a href="/notes/computational-chemistry/chemical-language-models/molecular-encoders/cddd-translation-molecular-descriptors/">CDDD</a> (continuous and data-driven descriptors) cannot be further optimized for specific downstream tasks.</p>
<p>The authors identify three specific gaps: (1) single-task fine-tuning wastes the correlations between related ADMET properties (e.g., <a href="https://en.wikipedia.org/wiki/Lipophilicity">lipophilicity</a> relates to many ADMET endpoints), (2) using only canonical SMILES limits the model&rsquo;s ability to learn robust molecular features, and (3) no prior work had combined pretraining, multitask learning, and SMILES enumeration into a unified framework.</p>
<h2 id="three-strategies-combined-pretraining-multitask-learning-and-smiles-enumeration">Three Strategies Combined: Pretraining, Multitask Learning, and SMILES Enumeration</h2>
<p>The core innovation of MTL-BERT is the synergistic combination of three strategies in a single pipeline.</p>
<h3 id="masked-smiles-pretraining">Masked SMILES Pretraining</h3>
<p>Following the BERT paradigm, MTL-BERT pretrains on 1.7 million unlabeled molecules from ChEMBL using a masked token recovery task. For each SMILES string, 15% of tokens are randomly selected: 80% are replaced with a [MASK] token, 10% are replaced with a random token, and 10% remain unchanged. The loss is computed only at masked positions. Unlike the original BERT, MTL-BERT omits the next-sentence prediction task since there is no sequential relationship between SMILES strings (following the RoBERTa finding that this task is unnecessary).</p>
<p>SMILES strings are tokenized with a regular expression that captures multi-character tokens (e.g., Si, Br, Cl) and common SMILES syntax. The model uses positional encoding to capture token order.</p>
<h3 id="transformer-architecture">Transformer Architecture</h3>
<p>The model uses a standard Transformer encoder with multihead self-attention. The scaled dot-product attention computes:</p>
<p>$$\mathbf{O}_h = \text{softmax}\left(\frac{\mathbf{Q}_h \mathbf{K}_h^T}{\sqrt{d_k}}\right) \mathbf{V}_h$$</p>
<p>where $\mathbf{Q}_h$, $\mathbf{K}_h$, and $\mathbf{V}_h$ are the query, key, and value matrices for head $h$, and $\sqrt{d_k}$ is a scaling factor. The outputs from all heads are concatenated and projected. Each attention sublayer is followed by a position-wise feedforward network with GELU activation, layer normalization, and residual connections.</p>
<p>Three model sizes were compared:</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Layers</th>
          <th>Heads</th>
          <th>Embedding Size</th>
          <th>FFN Size</th>
          <th>Recovery Accuracy</th>
          <th>Fine-tuning Performance</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>MTL-BERT_SMALL</td>
          <td>4</td>
          <td>4</td>
          <td>128</td>
          <td>512</td>
          <td>0.931</td>
          <td>0.826</td>
      </tr>
      <tr>
          <td>MTL-BERT_MEDIUM</td>
          <td>8</td>
          <td>8</td>
          <td>256</td>
          <td>1,024</td>
          <td>0.962</td>
          <td>0.852</td>
      </tr>
      <tr>
          <td>MTL-BERT_LARGE</td>
          <td>12</td>
          <td>12</td>
          <td>576</td>
          <td>2,304</td>
          <td>0.974</td>
          <td>0.848</td>
      </tr>
  </tbody>
</table>
<p>The medium model was selected for its best fine-tuning performance with lower computational cost, despite the large model achieving higher pretraining recovery accuracy. The slight performance drop for the large model suggests mild overfitting.</p>
<h3 id="multitask-fine-tuning-with-task-tokens">Multitask Fine-tuning with Task Tokens</h3>
<p>During fine-tuning, task tokens ([T0], [T1], &hellip;) are prepended to each input SMILES string. The Transformer output at each task token position is passed through a task-specific two-layer feedforward network for the corresponding prediction task. An attention mask prevents direct information exchange between task tokens, allowing each task to learn directly from SMILES tokens without interference. This design also reduces the discrepancy between pretraining (no task tokens visible) and fine-tuning.</p>
<p>Cross-entropy loss is used for classification tasks and mean squared error for regression tasks. The total multitask loss is a simple sum of per-task losses without learned weighting.</p>
<h3 id="smiles-enumeration-as-data-augmentation">SMILES Enumeration as Data Augmentation</h3>
<p>A molecule can be represented by multiple valid SMILES strings by varying starting atoms and traversal orders. MTL-BERT applies SMILES enumeration at all three stages:</p>
<ol>
<li><strong>Pretraining</strong>: Enumerated SMILES increase diversity of the self-supervised training data.</li>
<li><strong>Fine-tuning</strong>: Each dataset is augmented 20x with random SMILES variants, increasing data diversity and helping the model learn position-invariant features.</li>
<li><strong>Inference</strong>: Multiple SMILES are generated per test molecule, predictions are fused (averaged) for a more robust final prediction.</li>
</ol>
<p>The 20x augmentation factor was chosen based on prior work showing diminishing returns beyond this level while significantly increasing computational cost.</p>
<h2 id="experimental-evaluation-across-60-datasets">Experimental Evaluation Across 60 Datasets</h2>
<h3 id="setup">Setup</h3>
<p>MTL-BERT was evaluated on 60 datasets (44 classification, 16 regression) covering ADMET properties and common molecular benchmarks. Datasets were sourced from ADMETlab and <a href="/notes/computational-chemistry/benchmark-problems/moleculenet-benchmark-molecular-ml/">MoleculeNet</a>. Each dataset was split 8:1:1 (train/validation/test), and experiments were repeated 10 times with random splits, reporting mean and standard deviation.</p>
<p>Classification tasks were evaluated with <a href="https://en.wikipedia.org/wiki/Receiver_operating_characteristic">ROC-AUC</a> and accuracy; regression tasks with $R^2$ and RMSE.</p>
<h3 id="baselines">Baselines</h3>
<p>Five baselines were compared:</p>
<ul>
<li><strong>ECFP4-XGBoost</strong>: Extended-connectivity fingerprints (diameter 4) with gradient boosting</li>
<li><strong>Graph Attention Network (GAT)</strong></li>
<li><strong>Graph Convolutional Network (GCN)</strong></li>
<li><strong>AttentiveFP</strong>: A GNN with attention for molecular property prediction</li>
<li><strong>CDDD</strong>: Continuous and data-driven descriptors from a pretrained RNN auto-encoder</li>
</ul>
<h3 id="ablation-study">Ablation Study</h3>
<p>Three model variants were compared to isolate contributions:</p>
<ul>
<li><strong>MTL-BERT</strong>: Full model (pretraining + multitask + SMILES enumeration)</li>
<li><strong>STL-BERT</strong>: Single-task fine-tuning with SMILES enumeration (no multitask)</li>
<li><strong>Cano-BERT</strong>: Canonical SMILES only, single-task fine-tuning (equivalent to SMILES-BERT)</li>
</ul>
<p>Cano-BERT showed more than 10% degradation on several datasets (CL, Fu, LC50DM) compared to STL-BERT, demonstrating the importance of SMILES enumeration. MTL-BERT outperformed STL-BERT on most datasets, with improvements exceeding 5% on $F_{20\%}$, SR-ARE, and SR-ATAD5, confirming that multitask learning provides additional benefit on top of enumeration.</p>
<h3 id="results-vs-baselines">Results vs. Baselines</h3>
<p>MTL-BERT outperformed all baselines on nearly all 60 datasets. Specific findings:</p>
<ul>
<li>ECFP4-XGBoost performed inconsistently, doing well on some tasks (e.g., $F_{30\%}$, BACE, CL) but poorly on others, reflecting the limitation of fixed-length fingerprint representations.</li>
<li>GNNs generally improved over fingerprints but still suffered from data scarcity, falling behind ECFP4-XGBoost by more than 3% on $F_{30\%}$, Carcinogenicity, CL, and VD.</li>
<li>MTL-BERT surpassed all baselines except on CYP2C19-sub and BACE (by less than 1.1%).</li>
<li>On 14 tasks (NR-ER, NR-PPAR-gamma, SR-ARE, SR-ATAD5, SR-HSE, SR-MMP, Bioconcentration Factor, Fu, LC50FM, Lipophilicity, CL, PPB, VD, LC50DM), MTL-BERT exceeded the best baseline by more than 5-10%.</li>
<li>Improvements were statistically significant at the 95% confidence level (paired t-test, $P \leq 0.001$).</li>
</ul>
<h3 id="representation-analysis">Representation Analysis</h3>
<p><a href="https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding">t-SNE</a> visualization of pretrained token embeddings (from 1,000 randomly selected molecules, approximately 35,000 tokens) showed that:</p>
<ul>
<li>Tokens of the same type cluster together (capturing atomic type information).</li>
<li>Within type clusters, sub-groups correspond to different chemical environments (e.g., oxygen atoms in nitrate groups vs. carbonyl groups).</li>
<li>Nearby embeddings share similar molecular neighborhood environments.</li>
</ul>
<h3 id="attention-based-interpretability">Attention-based Interpretability</h3>
<p>The model&rsquo;s attention weights provide interpretability for predictions:</p>
<ul>
<li>For a solubility task (LogS/LogD), attention concentrated on polar groups, which are known determinants of aqueous solubility.</li>
<li>For <a href="https://en.wikipedia.org/wiki/Ames_test">AMES</a> (mutagenicity), attention focused on <a href="https://en.wikipedia.org/wiki/Azide">azide</a>, nitrosamide, <a href="https://en.wikipedia.org/wiki/Acyl_chloride">acylchloride</a>, and nitrite groups, which are known mutagenic structural alerts.</li>
</ul>
<h2 id="performance-gains-from-combined-strategies-with-interpretable-attention">Performance Gains from Combined Strategies with Interpretable Attention</h2>
<p>MTL-BERT demonstrates that the combination of pretraining, multitask learning, and SMILES enumeration is more effective than any individual strategy for molecular property prediction. The ablation study provides clear evidence for the additive benefit of each component.</p>
<p>Key strengths include the breadth of evaluation (60 datasets covering diverse ADMET endpoints), the consistent improvement over multiple baseline types (fingerprints, GNNs, pretrained representations), and the interpretable attention mechanism that highlights chemically meaningful substructures.</p>
<p>Limitations to note: the simple sum of multitask losses (no learned task weighting) may not be optimal when tasks have very different scales or when some tasks are unrelated. The authors observe slight degradation on a few datasets (AMES, CYP1A2-Sub, FreeSolv), suggesting negative transfer in those cases. The 20x SMILES enumeration significantly increases computational cost during fine-tuning and inference. The paper does not report wall-clock training times or GPU hours, making it difficult to assess the practical cost of the enumeration strategy. Hardware details are not specified beyond acknowledgment of the High-Performance Computing Center at Central South University.</p>
<p>The hierarchical clustering of task representations reveals meaningful task groupings (e.g., LogD and LogP cluster together due to their shared relationship with water solubility), supporting the premise that multitask learning captures cross-task correlations.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pretraining</td>
          <td>ChEMBL</td>
          <td>1.7M molecules</td>
          <td>Unlabeled SMILES; 10% held out for evaluation</td>
      </tr>
      <tr>
          <td>Fine-tuning/Evaluation</td>
          <td>ADMETlab + MoleculeNet</td>
          <td>60 datasets (44 classification, 16 regression)</td>
          <td>8:1:1 train/val/test split</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Pretraining</strong>: Masked token prediction (15% masking rate: 80% [MASK], 10% random, 10% unchanged). Adam optimizer, learning rate 1e-4, batch size 512, 50 epochs.</li>
<li><strong>Fine-tuning</strong>: Adam optimizer, learning rate 5e-5, batch size 64, dropout 0.1. Cross-entropy for classification, MSE for regression. Early stopping with patience 20, max 200 epochs.</li>
<li><strong>SMILES enumeration</strong>: 20x augmentation. Repeated search up to 100 times if enumerated SMILES is identical to a previous one.</li>
<li><strong>Inference fusion</strong>: Predictions from multiple enumerated SMILES are averaged.</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>MTL-BERT_MEDIUM (selected model): 8 layers, 8 attention heads, 256 embedding size, 1,024 FFN size</li>
<li>Pretraining recovery accuracy: 0.962</li>
<li>1,000 task tokens pre-allocated for future tasks</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Task Type</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ROC-AUC</td>
          <td>Classification</td>
          <td>Primary metric</td>
      </tr>
      <tr>
          <td>Accuracy</td>
          <td>Classification</td>
          <td>Secondary metric</td>
      </tr>
      <tr>
          <td>$R^2$</td>
          <td>Regression</td>
          <td>Primary metric</td>
      </tr>
      <tr>
          <td>RMSE</td>
          <td>Regression</td>
          <td>Secondary metric</td>
      </tr>
  </tbody>
</table>
<p>All experiments repeated 10 times with random splits; mean and standard deviation reported.</p>
<h3 id="hardware">Hardware</h3>
<p>Hardware specifications are not reported in the paper. The authors acknowledge the High-Performance Computing Center of Central South University.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/zhang-xuan1314/MTL-BERT">MTL-BERT</a></td>
          <td>Code</td>
          <td>Not specified</td>
          <td>Official implementation</td>
      </tr>
      <tr>
          <td><a href="https://www.ebi.ac.uk/chembl/">ChEMBL</a></td>
          <td>Dataset</td>
          <td>CC BY-SA 3.0</td>
          <td>Pretraining data source</td>
      </tr>
      <tr>
          <td><a href="https://moleculenet.org/">MoleculeNet</a></td>
          <td>Dataset</td>
          <td>MIT</td>
          <td>Fine-tuning benchmark</td>
      </tr>
      <tr>
          <td><a href="https://admetmesh.scbdd.com/">ADMETlab</a></td>
          <td>Dataset</td>
          <td>Free for academic use</td>
          <td>ADMET property datasets</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Zhang, X.-C., Wu, C.-K., Yi, J.-C., Zeng, X.-X., Yang, C.-Q., Lu, A.-P., Hou, T.-J., &amp; Cao, D.-S. (2022). Pushing the boundaries of molecular property prediction for drug discovery with multitask learning BERT enhanced by SMILES enumeration. <em>Research</em>, 2022, Article 0004. <a href="https://doi.org/10.34133/research.0004">https://doi.org/10.34133/research.0004</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{zhang2022mtlbert,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Pushing the Boundaries of Molecular Property Prediction for Drug Discovery with Multitask Learning BERT Enhanced by SMILES Enumeration}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Zhang, Xiao-Chen and Wu, Cheng-Kun and Yi, Jia-Cai and Zeng, Xiang-Xiang and Yang, Can-Qun and Lu, Ai-Ping and Hou, Ting-Jun and Cao, Dong-Sheng}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Research}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{2022}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{Article 0004}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2022}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.34133/research.0004}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{American Association for the Advancement of Science (AAAS)}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Maxsmi: SMILES Augmentation for Property Prediction</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/property-prediction/maxsmi-smiles-augmentation-property-prediction/</link><pubDate>Fri, 27 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/property-prediction/maxsmi-smiles-augmentation-property-prediction/</guid><description>Maxsmi systematically evaluates five SMILES augmentation strategies with CNN and RNN models across solubility, lipophilicity, and bioactivity tasks.</description><content:encoded><![CDATA[<h2 id="systematic-benchmarking-of-smiles-data-augmentation">Systematic Benchmarking of SMILES Data Augmentation</h2>
<p>This is an <strong>Empirical</strong> paper that systematically evaluates how SMILES augmentation affects deep learning molecular property prediction. The primary contribution is a comprehensive comparison of five augmentation strategies across three neural network architectures and four datasets, producing the &ldquo;Maxsmi&rdquo; models that maximize prediction performance. The study also demonstrates that test-time augmentation provides a practical confidence measure for predictions.</p>
<h2 id="the-data-scarcity-problem-in-qsar-modeling">The Data Scarcity Problem in QSAR Modeling</h2>
<p>Deep learning models require large training sets to perform well, but experimental physico-chemical and bioactivity datasets remain small, typically ranging from hundreds to a few thousand compounds. SMILES augmentation, where the non-unique <a href="/notes/computational-chemistry/molecular-representations/smiles/">SMILES representation</a> of a molecule is exploited to generate multiple training examples per compound, has been shown to help in prior work by Bjerrum (2017), Kimber et al. (2018), and Li and Fourches (2020). However, no prior study had systematically compared different augmentation strategies, analyzed how much augmentation is needed, or examined the relationship between augmentation factor and prediction confidence. Most previous work chose augmentation numbers a priori without justification. Maxsmi fills this gap by providing a systematic analysis and practical guidelines.</p>
<h2 id="five-augmentation-strategies-and-test-time-ensemble-learning">Five Augmentation Strategies and Test-Time Ensemble Learning</h2>
<p>The core insight is twofold. First, the authors define five distinct strategies for generating augmented SMILES:</p>
<ol>
<li><strong>No augmentation</strong>: use only the canonical SMILES (baseline)</li>
<li><strong>Augmentation with duplication</strong>: generate $m$ random SMILES per compound, allowing duplicates; dataset grows to $N \times m$</li>
<li><strong>Augmentation without duplication</strong>: generate $m$ random SMILES and discard exact duplicates</li>
<li><strong>Augmentation with reduced duplication</strong>: keep only $f(m) = \sqrt{m}$ copies of each duplicate, a compromise between the above</li>
<li><strong>Augmentation with estimated maximum</strong>: sample random SMILES until the same string has been generated 10 times, attempting to cover most of the valid SMILES space</li>
</ol>
<p>Second, the authors formalize test-time augmentation as ensemble learning. Given a trained model $M_{\Theta}$, each test compound $C$ is represented by $k$ random SMILES $S_1(C), \ldots, S_k(C)$. The per-SMILES predictions are:</p>
<p>$$
\hat{y}_i(C) = M_{\Theta}(S_i(C))
$$</p>
<p>The compound-level prediction is an aggregation (mean) over these:</p>
<p>$$
\hat{y}(C) = A\big(\hat{y}_1(C), \ldots, \hat{y}_k(C)\big)
$$</p>
<p>The standard deviation of the per-SMILES predictions serves as a confidence measure: high variance indicates the model is uncertain about a compound.</p>
<h2 id="experimental-design-three-architectures-four-datasets">Experimental Design: Three Architectures, Four Datasets</h2>
<h3 id="datasets">Datasets</h3>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Size (after preprocessing)</th>
          <th>Train / Test</th>
          <th>Task</th>
          <th>Provenance</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ESOL</td>
          <td>1,128</td>
          <td>902 / 226</td>
          <td>Water solubility</td>
          <td><a href="/notes/computational-chemistry/benchmark-problems/moleculenet-benchmark-molecular-ml/">MoleculeNet</a></td>
      </tr>
      <tr>
          <td>ESOL_small</td>
          <td>1,068</td>
          <td>854 / 214</td>
          <td>Solubility (max 25 heavy atoms)</td>
          <td>MoleculeNet</td>
      </tr>
      <tr>
          <td>FreeSolv</td>
          <td>642</td>
          <td>513 / 129</td>
          <td>Hydration free energy</td>
          <td>MoleculeNet</td>
      </tr>
      <tr>
          <td><a href="https://en.wikipedia.org/wiki/Lipophilicity">Lipophilicity</a></td>
          <td>4,199</td>
          <td>3,359 / 840</td>
          <td>Octanol/water distribution</td>
          <td><a href="https://en.wikipedia.org/wiki/ChEMBL">ChEMBL</a></td>
      </tr>
      <tr>
          <td>Affinity (EGFR)</td>
          <td>5,849</td>
          <td>4,679 / 1,170</td>
          <td><a href="https://en.wikipedia.org/wiki/IC50">pIC50</a> against <a href="https://en.wikipedia.org/wiki/Epidermal_growth_factor_receptor">EGFR</a> kinase</td>
          <td>Kinodata</td>
      </tr>
  </tbody>
</table>
<h3 id="architectures">Architectures</h3>
<p>Three shallow neural networks are compared:</p>
<ul>
<li><strong>CONV1D</strong>: 1D convolution (kernel size 10, stride 1) followed by two fully connected layers</li>
<li><strong>CONV2D</strong>: 2D convolution on the one-hot encoded SMILES matrix, followed by two fully connected layers</li>
<li><strong>RNN</strong>: LSTM layer followed by two fully connected layers (128 and 64 units)</li>
</ul>
<p>All models are trained for 250 epochs with batch size 16, MSE loss, SGD optimizer, and learning rate 0.001. A Random Forest baseline with Morgan fingerprints (radius 2, length 1024) is also included.</p>
<h3 id="augmentation-sweep">Augmentation sweep</h3>
<p>The augmentation number $m$ is varied from 1 to 20 (step 1) and from 20 to 100 (step 10) for three strategies (with, without, and reduced duplication). The estimated maximum strategy is tested on the smaller datasets. Both training and test sets receive the same augmentation.</p>
<h2 id="key-findings-augmentation-consistently-improves-rmse">Key Findings: Augmentation Consistently Improves RMSE</h2>
<h3 id="augmentation-always-helps">Augmentation always helps</h3>
<p>Across all datasets and architectures, SMILES augmentation reduces test RMSE compared to the no-augmentation baseline. Performance improves sharply in the low augmentation range (1 to 10) and reaches a plateau around 40 to 70, after which additional augmentation provides diminishing returns.</p>
<h3 id="best-models-maxsmi">Best models (Maxsmi)</h3>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Model</th>
          <th>Augmentation Number</th>
          <th>Strategy</th>
          <th>Test RMSE</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ESOL</td>
          <td>CONV1D</td>
          <td>70</td>
          <td>Reduced duplication</td>
          <td>0.569</td>
      </tr>
      <tr>
          <td>FreeSolv</td>
          <td>CONV1D</td>
          <td>70</td>
          <td>With duplication</td>
          <td>1.032</td>
      </tr>
      <tr>
          <td>Lipophilicity</td>
          <td>CONV1D</td>
          <td>80</td>
          <td>Without duplication</td>
          <td>0.593</td>
      </tr>
  </tbody>
</table>
<p>The CONV1D architecture consistently outperforms RNN and CONV2D. For ESOL, the CONV1D model improves from 0.839 RMSE (no augmentation) to 0.569 RMSE (70x reduced duplication), a 32% reduction.</p>
<h3 id="no-single-best-augmentation-strategy">No single best augmentation strategy</h3>
<p>The three main augmentation strategies (with, without, and reduced duplication) perform similarly. Generating the estimated maximum number of unique SMILES does not yield the best results, suggesting a saturation point exists where additional SMILES diversity stops helping.</p>
<h3 id="canonical-smiles-outperform-single-random-smiles">Canonical SMILES outperform single random SMILES</h3>
<p>When augmentation is limited to a single representation ($m = 1$), the canonical SMILES consistently outperforms a single random SMILES. On ESOL with CONV1D, the canonical model achieves 0.839 RMSE versus 0.964 for a random SMILES. The authors attribute this to the simpler, more readable structure of canonical SMILES (fewer branches and brackets).</p>
<h3 id="comparison-to-prior-work">Comparison to prior work</h3>
<table>
  <thead>
      <tr>
          <th>Study</th>
          <th>ESOL</th>
          <th>FreeSolv</th>
          <th>Lipophilicity</th>
          <th>Model</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Maxsmi</td>
          <td>0.569</td>
          <td>1.032</td>
          <td>0.593</td>
          <td>CNN</td>
      </tr>
      <tr>
          <td>MoleculeNet</td>
          <td>0.58 +/- 0.03</td>
          <td>1.15 +/- 0.12</td>
          <td>0.655 +/- 0.036</td>
          <td>GNN</td>
      </tr>
      <tr>
          <td>CNF</td>
          <td>0.62</td>
          <td>1.11</td>
          <td>0.67</td>
          <td>CNN</td>
      </tr>
      <tr>
          <td><a href="/notes/computational-chemistry/chemical-language-models/property-prediction/molpmofit-transfer-learning-qsar/">MolPMoFiT</a></td>
          <td>N/A</td>
          <td>1.197 +/- 0.127</td>
          <td>0.565 +/- 0.037</td>
          <td>RNN</td>
      </tr>
  </tbody>
</table>
<p>Maxsmi outperforms or matches MoleculeNet&rsquo;s graph neural networks and the CNF model on all three tasks. MolPMoFiT slightly outperforms Maxsmi on lipophilicity (0.565 vs 0.593) but performs worse on FreeSolv.</p>
<h3 id="confidence-estimation">Confidence estimation</h3>
<p>The standard deviation of per-SMILES predictions correlates with prediction error. Confidence curves show that sequentially removing compounds with the highest uncertainty leads to monotonically decreasing mean prediction error. For ESOL, keeping only the top 10% most confident predictions yields errors below 0.25.</p>
<h3 id="egfr-affinity-test-case">EGFR affinity test case</h3>
<p>Applying the Maxsmi approach (CONV1D, 70x augmentation, reduced duplication) to EGFR kinase affinity prediction yields test RMSE of 0.777 and R2 of 0.712, compared to 1.031 RMSE and 0.494 R2 for the canonical model (a 25% RMSE improvement). The Random Forest baseline (0.758 RMSE, 0.726 R2) performs comparably, which the authors note without further explanation.</p>
<h3 id="limitations">Limitations</h3>
<ul>
<li>All experiments use a single train/test split (80/20) without cross-validation, due to the computational cost of the full augmentation sweep. This means reported RMSE values lack uncertainty estimates for the Maxsmi models.</li>
<li>The study uses shallow networks only. Whether the same augmentation benefits apply to deeper architectures or pre-trained models is untested.</li>
<li>The EGFR test case shows the Random Forest baseline performing comparably to the Maxsmi model, raising questions about when SMILES augmentation provides a meaningful advantage over traditional fingerprint-based methods.</li>
<li>The comparison to prior work uses different splits, preprocessing, and evaluation protocols across studies, which the authors acknowledge limits direct comparability.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training/Evaluation</td>
          <td>ESOL</td>
          <td>1,128</td>
          <td>MoleculeNet, water solubility</td>
      </tr>
      <tr>
          <td>Training/Evaluation</td>
          <td>FreeSolv</td>
          <td>642</td>
          <td>MoleculeNet, hydration free energy</td>
      </tr>
      <tr>
          <td>Training/Evaluation</td>
          <td>Lipophilicity</td>
          <td>4,199</td>
          <td>ChEMBL, logD</td>
      </tr>
      <tr>
          <td>Test case</td>
          <td>EGFR Affinity</td>
          <td>5,849</td>
          <td>Kinodata (ChEMBL v28), pIC50</td>
      </tr>
  </tbody>
</table>
<p>All datasets are publicly available through MoleculeNet/DeepChem and Kinodata.</p>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>SMILES generation via <a href="https://en.wikipedia.org/wiki/RDKit">RDKit</a>&rsquo;s random SMILES enumeration</li>
<li>One-hot encoding of SMILES characters with padding to max length</li>
<li>Five augmentation strategies applied to both training and test sets</li>
<li>Mean aggregation for compound-level predictions</li>
</ul>
<h3 id="models">Models</h3>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Architecture</th>
          <th>Parameters</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>CONV1D</td>
          <td>1D conv (kernel 10, stride 1) + 2 FC layers</td>
          <td>Not specified</td>
      </tr>
      <tr>
          <td>CONV2D</td>
          <td>2D conv (single channel) + 2 FC layers</td>
          <td>Not specified</td>
      </tr>
      <tr>
          <td>RNN</td>
          <td>LSTM + FC(128) + FC(64)</td>
          <td>Not specified</td>
      </tr>
      <tr>
          <td>RF Baseline</td>
          <td>Random Forest (default sklearn)</td>
          <td>Morgan FP, radius 2, length 1024</td>
      </tr>
  </tbody>
</table>
<p>Training: 250 epochs, batch size 16, MSE loss, SGD, lr=0.001.</p>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Best Value</th>
          <th>Baseline</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>RMSE (ESOL)</td>
          <td>0.569</td>
          <td>1.102 (RF)</td>
          <td>CONV1D, 70x reduced dup</td>
      </tr>
      <tr>
          <td>RMSE (FreeSolv)</td>
          <td>1.032</td>
          <td>2.563 (RF)</td>
          <td>CONV1D, 70x with dup</td>
      </tr>
      <tr>
          <td>RMSE (Lipophilicity)</td>
          <td>0.593</td>
          <td>0.860 (RF)</td>
          <td>CONV1D, 80x without dup</td>
      </tr>
      <tr>
          <td>RMSE (EGFR)</td>
          <td>0.777</td>
          <td>0.758 (RF)</td>
          <td>CONV1D, 70x reduced dup</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Training was performed on a GeForce GTX 1080 Ti, provided by the HPC cluster at Freie Universitat Berlin. Training CONV1D on ESOL with 100x augmentation (keeping duplicates, 90,200 data points) takes approximately 3 hours. Training with 19x augmentation achieves RMSE of 0.605 in under 30 minutes.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/volkamerlab/maxsmi">volkamerlab/maxsmi</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Full source code, trained models, CLI for prediction</td>
      </tr>
      <tr>
          <td><a href="https://maxsmi.readthedocs.io/en/latest/">Documentation</a></td>
          <td>Docs</td>
          <td>N/A</td>
          <td>Read the Docs documentation</td>
      </tr>
      <tr>
          <td><a href="https://github.com/openkinome/kinodata">Kinodata</a></td>
          <td>Dataset</td>
          <td>N/A</td>
          <td>Curated kinase bioactivity data from ChEMBL v28</td>
      </tr>
  </tbody>
</table>
<p><strong>Reproducibility status</strong>: Highly Reproducible. Code, data, trained models, and a command-line prediction tool are all publicly available under the MIT license.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Kimber, T. B., Gagnebin, M., &amp; Volkamer, A. (2021). Maxsmi: Maximizing molecular property prediction performance with confidence estimation using SMILES augmentation and deep learning. <em>Artificial Intelligence in the Life Sciences</em>, 1, 100014. <a href="https://doi.org/10.1016/j.ailsci.2021.100014">https://doi.org/10.1016/j.ailsci.2021.100014</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{kimber2021maxsmi,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Maxsmi: Maximizing molecular property prediction performance with confidence estimation using SMILES augmentation and deep learning}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Kimber, Talia B. and Gagnebin, Maxime and Volkamer, Andrea}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Artificial Intelligence in the Life Sciences}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{100014}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2021}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Elsevier}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1016/j.ailsci.2021.100014}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Transformer-CNN: SMILES Embeddings for QSAR Modeling</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/property-prediction/transformer-cnn-qsar-modeling/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/property-prediction/transformer-cnn-qsar-modeling/</guid><description>Transformer-CNN uses SMILES embeddings from a canonicalization Transformer with a CNN head for interpretable QSAR property prediction.</description><content:encoded><![CDATA[<h2 id="transformer-based-smiles-embeddings-for-property-prediction">Transformer-Based SMILES Embeddings for Property Prediction</h2>
<p>This is a <strong>Method</strong> paper that introduces Transformer-CNN, a two-stage architecture for <a href="https://en.wikipedia.org/wiki/Quantitative_structure%E2%80%93activity_relationship">QSAR</a> (Quantitative Structure-Activity Relationship) modeling. The primary contribution is a transfer learning approach: a Transformer model is first trained on the task of <a href="/notes/computational-chemistry/molecular-representations/smiles/">SMILES</a> canonicalization (mapping non-canonical SMILES to canonical forms), and the encoder&rsquo;s internal representations are then used as &ldquo;dynamic SMILES embeddings&rdquo; for downstream property prediction via a convolutional neural network (TextCNN). The authors also contribute an interpretability framework based on Layer-wise Relevance Propagation (LRP) that traces predictions back to individual atom contributions.</p>
<h2 id="from-descriptors-to-learned-embeddings-in-qsar">From Descriptors to Learned Embeddings in QSAR</h2>
<p>Traditional QSAR methods rely on hand-engineered molecular descriptors (fragment counts, physicochemical features) coupled with feature selection and classical ML algorithms. While deep learning approaches that operate on raw SMILES strings or molecular graphs have reduced the need for manual feature engineering, they typically require large training datasets to learn effective representations from scratch. QSAR datasets, in contrast, often contain only hundreds of molecules, making it difficult to train end-to-end deep models.</p>
<p>The authors identify two specific gaps. First, existing SMILES-based autoencoders such as <a href="/notes/computational-chemistry/chemical-language-models/molecular-encoders/cddd-translation-molecular-descriptors/">CDDD</a> (Continuous and Data-Driven molecular Descriptors) produce fixed-length latent vectors, discarding positional information that could be useful for property prediction and interpretation. Second, QSAR models built on deep architectures generally lack interpretability, making it hard to verify that predictions rely on chemically meaningful structural features rather than spurious correlations.</p>
<h2 id="dynamic-smiles-embeddings-via-canonicalization-pre-training">Dynamic SMILES Embeddings via Canonicalization Pre-training</h2>
<p>The core insight is that training a Transformer to perform SMILES canonicalization (a Seq2Seq task mapping non-canonical SMILES to canonical SMILES) produces an encoder whose internal states serve as information-rich, position-dependent molecular embeddings.</p>
<h3 id="pre-training-on-smiles-canonicalization">Pre-training on SMILES Canonicalization</h3>
<p>The Transformer encoder-decoder is trained on approximately 17.7 million canonicalization pairs derived from the <a href="https://en.wikipedia.org/wiki/ChEMBL">ChEMBL</a> database (SMILES with length up to 110 characters). Each molecule is augmented 10 times by generating non-canonical SMILES variants, plus one identity pair where both sides are canonical. The training uses character-level tokenization with a 66-symbol vocabulary covering drug-like molecules including stereochemistry, charges, and inorganic ions.</p>
<p>The Transformer architecture follows Vaswani et al. with 3 layers and 10 self-attention heads. The learning rate schedule follows:</p>
<p>$$\lambda = \text{factor} \cdot \min(1.0,; \text{step} / \text{warmup}) / \max(\text{step},; \text{warmup})$$</p>
<p>where factor = 20, warmup = 16,000 steps, and $\lambda$ is clipped at a minimum of $10^{-4}$. Training runs for 10 epochs (275,907 batches per epoch) without early stopping.</p>
<p>On validation with 500,000 generated ChEMBL-like SMILES, the model correctly canonicalizes 83.6% of all samples. Performance drops for stereochemistry (37.2% for @-containing SMILES) and cis/trans notation (73.9%).</p>
<h3 id="from-encoder-states-to-qsar-predictions">From Encoder States to QSAR Predictions</h3>
<p>After pre-training, the encoder&rsquo;s output for a molecule with $N$ characters is a matrix of dimensions $(N, \text{EMBEDDINGS})$. Unlike fixed-length CDDD descriptors, these &ldquo;dynamic embeddings&rdquo; preserve positional information, meaning equivalent characters receive different embedding values depending on their context and position.</p>
<p>To handle variable-length embeddings, the authors use a TextCNN architecture (from DeepChem) with 1D convolutional filters at kernel sizes (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20) producing (100, 200, 200, 200, 200, 100, 100, 100, 100, 100, 160, 160) filters respectively. After GlobalMaxPool and concatenation, the features pass through Dropout (rate = 0.25), a Dense layer ($N = 512$), a Highway layer, and finally an output layer (1 neuron for regression, 2 for classification).</p>
<p>The Transformer weights are frozen during QSAR training. The Adam optimizer is used with a fixed learning rate of $10^{-4}$ and early stopping on a 10% held-out validation set. Critically, SMILES augmentation ($n = 10$) is applied during both training and inference, with the final prediction being the average over augmented SMILES for each molecule.</p>
<h3 id="interpretability-via-layer-wise-relevance-propagation">Interpretability via Layer-wise Relevance Propagation</h3>
<p>The LRP algorithm propagates relevance scores from the output back through the CNN layers to the Transformer encoder output (which is position-wise). The relevance conservation property holds:</p>
<p>$$y = R = f(x) = \sum_{l \in (L)} R_{l} = \sum_{l \in (L-1)} R_{l} = \cdots = \sum_{l \in (1)} R_{l}$$</p>
<p>In practice, biases absorb some relevance, so the total propagated to the input is less than the output:</p>
<p>$$\sum_{l \in (L)} R_{l} = \sum_{l \in (L-1)} R_{l} + B$$</p>
<p>For gated connections in the Highway block, the authors implement the signal-take-all redistribution rule. The interpretation algorithm generates one SMILES per non-hydrogen atom (each drawn starting from that atom), runs LRP on each, and averages contributions. If more than 50% of relevance dissipates on biases, the interpretation may be unreliable, serving as an applicability domain indicator.</p>
<h2 id="benchmarks-across-18-regression-and-classification-datasets">Benchmarks Across 18 Regression and Classification Datasets</h2>
<p>The authors evaluate on the same 18 datasets (9 regression, 9 classification) used in their previous SMILES augmentation study, enabling direct comparison. All experiments use five-fold cross-validation.</p>
<h3 id="regression-results-r2">Regression Results ($r^2$)</h3>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th style="text-align: center">Descriptor-based</th>
          <th style="text-align: center">SMILES-based (augm=10)</th>
          <th style="text-align: center">Transformer-CNN (no augm)</th>
          <th style="text-align: center">Transformer-CNN (augm=10)</th>
          <th style="text-align: center">CDDD</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>MP (19,104)</td>
          <td style="text-align: center">0.83</td>
          <td style="text-align: center">0.85</td>
          <td style="text-align: center">0.83</td>
          <td style="text-align: center"><strong>0.86</strong></td>
          <td style="text-align: center">0.85</td>
      </tr>
      <tr>
          <td>BP (11,893)</td>
          <td style="text-align: center">0.98</td>
          <td style="text-align: center">0.98</td>
          <td style="text-align: center">0.97</td>
          <td style="text-align: center"><strong>0.98</strong></td>
          <td style="text-align: center">0.98</td>
      </tr>
      <tr>
          <td>BCF (378)</td>
          <td style="text-align: center">0.85</td>
          <td style="text-align: center">0.85</td>
          <td style="text-align: center">0.71</td>
          <td style="text-align: center"><strong>0.85</strong></td>
          <td style="text-align: center">0.81</td>
      </tr>
      <tr>
          <td>FreeSolv (642)</td>
          <td style="text-align: center"><strong>0.94</strong></td>
          <td style="text-align: center">0.93</td>
          <td style="text-align: center">0.72</td>
          <td style="text-align: center">0.91</td>
          <td style="text-align: center">0.93</td>
      </tr>
      <tr>
          <td>LogS (1,311)</td>
          <td style="text-align: center"><strong>0.92</strong></td>
          <td style="text-align: center">0.92</td>
          <td style="text-align: center">0.85</td>
          <td style="text-align: center">0.91</td>
          <td style="text-align: center">0.91</td>
      </tr>
      <tr>
          <td>Lipo (4,200)</td>
          <td style="text-align: center">0.70</td>
          <td style="text-align: center">0.72</td>
          <td style="text-align: center">0.60</td>
          <td style="text-align: center">0.73</td>
          <td style="text-align: center"><strong>0.74</strong></td>
      </tr>
      <tr>
          <td>BACE (1,513)</td>
          <td style="text-align: center">0.73</td>
          <td style="text-align: center">0.72</td>
          <td style="text-align: center">0.66</td>
          <td style="text-align: center"><strong>0.76</strong></td>
          <td style="text-align: center">0.75</td>
      </tr>
      <tr>
          <td>DHFR (739)</td>
          <td style="text-align: center">0.62</td>
          <td style="text-align: center">0.63</td>
          <td style="text-align: center">0.46</td>
          <td style="text-align: center"><strong>0.67</strong></td>
          <td style="text-align: center">0.61</td>
      </tr>
      <tr>
          <td>LEL (483)</td>
          <td style="text-align: center">0.19</td>
          <td style="text-align: center">0.25</td>
          <td style="text-align: center">0.20</td>
          <td style="text-align: center"><strong>0.27</strong></td>
          <td style="text-align: center">0.23</td>
      </tr>
  </tbody>
</table>
<h3 id="classification-results-auc">Classification Results (AUC)</h3>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th style="text-align: center">Descriptor-based</th>
          <th style="text-align: center">SMILES-based (augm=10)</th>
          <th style="text-align: center">Transformer-CNN (no augm)</th>
          <th style="text-align: center">Transformer-CNN (augm=10)</th>
          <th style="text-align: center">CDDD</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>HIV (41,127)</td>
          <td style="text-align: center">0.82</td>
          <td style="text-align: center">0.78</td>
          <td style="text-align: center">0.81</td>
          <td style="text-align: center"><strong>0.83</strong></td>
          <td style="text-align: center">0.74</td>
      </tr>
      <tr>
          <td>AMES (6,542)</td>
          <td style="text-align: center">0.86</td>
          <td style="text-align: center">0.88</td>
          <td style="text-align: center">0.86</td>
          <td style="text-align: center"><strong>0.89</strong></td>
          <td style="text-align: center">0.86</td>
      </tr>
      <tr>
          <td>BACE (1,513)</td>
          <td style="text-align: center">0.88</td>
          <td style="text-align: center">0.89</td>
          <td style="text-align: center">0.89</td>
          <td style="text-align: center"><strong>0.91</strong></td>
          <td style="text-align: center">0.90</td>
      </tr>
      <tr>
          <td>ClinTox (1,478)</td>
          <td style="text-align: center"><strong>0.77</strong></td>
          <td style="text-align: center">0.76</td>
          <td style="text-align: center">0.71</td>
          <td style="text-align: center">0.77</td>
          <td style="text-align: center">0.73</td>
      </tr>
      <tr>
          <td>Tox21 (7,831)</td>
          <td style="text-align: center">0.79</td>
          <td style="text-align: center"><strong>0.83</strong></td>
          <td style="text-align: center">0.81</td>
          <td style="text-align: center">0.82</td>
          <td style="text-align: center">0.82</td>
      </tr>
      <tr>
          <td>BBBP (2,039)</td>
          <td style="text-align: center">0.90</td>
          <td style="text-align: center">0.91</td>
          <td style="text-align: center">0.90</td>
          <td style="text-align: center"><strong>0.92</strong></td>
          <td style="text-align: center">0.89</td>
      </tr>
      <tr>
          <td>JAK3 (886)</td>
          <td style="text-align: center">0.79</td>
          <td style="text-align: center"><strong>0.80</strong></td>
          <td style="text-align: center">0.70</td>
          <td style="text-align: center">0.78</td>
          <td style="text-align: center">0.76</td>
      </tr>
      <tr>
          <td>BioDeg (1,737)</td>
          <td style="text-align: center">0.92</td>
          <td style="text-align: center"><strong>0.93</strong></td>
          <td style="text-align: center">0.91</td>
          <td style="text-align: center">0.93</td>
          <td style="text-align: center">0.92</td>
      </tr>
      <tr>
          <td>RP AR (930)</td>
          <td style="text-align: center">0.85</td>
          <td style="text-align: center"><strong>0.87</strong></td>
          <td style="text-align: center">0.83</td>
          <td style="text-align: center">0.87</td>
          <td style="text-align: center">0.86</td>
      </tr>
  </tbody>
</table>
<h3 id="key-comparisons">Key Comparisons</h3>
<p>Baselines include descriptor-based methods (the best from LibSVM, Random Forest, XGBoost, ASNN, and DNNs), direct SMILES-based models with augmentation, and CDDD descriptors analyzed by the same classical ML methods. CDDD descriptors come from the Sml2canSml autoencoder approach, which produces fixed 512-dimensional vectors.</p>
<p>Transformer-CNN with augmentation matches or exceeds all baselines on 14 of 18 datasets. The effect of augmentation is dramatic: without it, Transformer-CNN underperforms substantially (e.g., BCF drops from 0.85 to 0.71, JAK3 from 0.78 to 0.70). This confirms that the internal consensus from multiple SMILES representations is essential to the method&rsquo;s effectiveness.</p>
<p>A practical advantage over CDDD is that Transformer-CNN imposes no constraints on molecular properties (CDDD requires logP in (-5, 7), molecular weight under 12,600, 3-50 heavy atoms, and organic molecules only), since the Transformer was trained on the full diversity of ChEMBL.</p>
<h3 id="interpretability-case-studies">Interpretability Case Studies</h3>
<p>For <a href="https://en.wikipedia.org/wiki/Ames_test">AMES</a> mutagenicity, the LRP analysis of 1-Bromo-4-nitrobenzene correctly identifies the nitro group and halogen as structural alerts, consistent with known mutagenicity rules. For aqueous solubility of <a href="https://en.wikipedia.org/wiki/Haloperidol">haloperidol</a>, the model assigns positive contributions to hydroxyl, carbonyl, and aliphatic nitrogen groups (which increase solubility) and negative contributions to aromatic carbons (which decrease it). Both cases align with established chemical knowledge, supporting the trustworthiness of the model.</p>
<h2 id="effective-transfer-learning-for-small-qsar-datasets">Effective Transfer Learning for Small QSAR Datasets</h2>
<p>Transformer-CNN achieves competitive or superior QSAR performance across 18 diverse benchmarks by combining three ingredients: (1) Transformer-based pre-training via SMILES canonicalization, (2) SMILES augmentation during training and inference, and (3) a lightweight CNN head. The method requires minimal hyperparameter tuning, as the Transformer weights are frozen and the CNN architecture is fixed.</p>
<p>The authors acknowledge several limitations and future directions:</p>
<ul>
<li>Stereochemistry canonicalization accuracy is low (37.2%), which could impact models for stereo-sensitive properties</li>
<li>The LRP interpretability depends on sufficient relevance propagation (at least 50% reaching the input layer)</li>
<li>The variance among augmented SMILES predictions could serve as a confidence estimate, but this is left to future work</li>
<li>Applicability domain assessment based on SMILES reconstruction quality is proposed but not fully developed</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pre-training</td>
          <td>ChEMBL (SMILES &lt;= 110 chars)</td>
          <td>17.7M pairs</td>
          <td>10x augmentation + 1 identity pair per molecule</td>
      </tr>
      <tr>
          <td>Validation (canon.)</td>
          <td>Generated ChEMBL-like SMILES</td>
          <td>500,000</td>
          <td>From a molecular generator</td>
      </tr>
      <tr>
          <td>QSAR benchmarks</td>
          <td>9 regression + 9 classification</td>
          <td>378-41,127</td>
          <td>Available on OCHEM (<a href="https://ochem.eu">https://ochem.eu</a>)</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Transformer: 3 layers, 10 self-attention heads, character-level tokenization (66 symbols)</li>
<li>TextCNN: 12 kernel sizes (1-10, 15, 20) with 100-200 filters each, GlobalMaxPool, Dense(512), Highway, Dropout(0.25)</li>
<li>Augmentation: n=10 non-canonical SMILES per molecule during training and inference</li>
<li>LRP: signal-take-all redistribution for Highway gates, standard LRP for Dense and Conv layers</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>Transformer encoder weights pre-trained on canonicalization task (frozen during QSAR training)</li>
<li>QSAR CNN trained with Adam optimizer, learning rate $10^{-4}$, early stopping</li>
<li>Pre-trained embeddings and standalone prediction models available in the GitHub repository</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li>Regression: coefficient of determination $r^2 = 1 - SS_{\text{res}} / SS_{\text{tot}}$</li>
<li>Classification: Area Under the ROC Curve (AUC)</li>
<li>Five-fold cross-validation with bootstrap standard errors</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li>NVIDIA Quadro P6000, Titan Xp, and Titan V GPUs (donated by NVIDIA)</li>
<li>TensorFlow v1.12.0, RDKit v2018.09.2</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/bigchem/transformer-cnn">transformer-cnn</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Source code, pre-trained embeddings, standalone prediction models</td>
      </tr>
      <tr>
          <td><a href="https://ochem.eu">OCHEM</a></td>
          <td>Other</td>
          <td>N/A</td>
          <td>Online platform hosting the method, training datasets, and models</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Karpov, P., Godin, G., &amp; Tetko, I. V. (2020). Transformer-CNN: Swiss knife for QSAR modeling and interpretation. <em>Journal of Cheminformatics</em>, 12, 17. <a href="https://doi.org/10.1186/s13321-020-00423-w">https://doi.org/10.1186/s13321-020-00423-w</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{karpov2020transformer,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Transformer-{CNN}: Swiss knife for {QSAR} modeling and interpretation}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Karpov, Pavel and Godin, Guillaume and Tetko, Igor V.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Cheminformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{12}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{17}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2020}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Springer}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1186/s13321-020-00423-w}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>SMILES2Vec: Interpretable Chemical Property Prediction</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/property-prediction/smiles2vec-interpretable-property-prediction/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/property-prediction/smiles2vec-interpretable-property-prediction/</guid><description>SMILES2Vec uses a Bayesian-optimized CNN-GRU architecture to predict chemical properties directly from SMILES strings with an interpretable explanation mask.</description><content:encoded><![CDATA[<h2 id="a-general-purpose-rnn-for-chemical-property-prediction-from-smiles">A General-Purpose RNN for Chemical Property Prediction from SMILES</h2>
<p>SMILES2Vec is a <strong>Method</strong> paper that introduces a deep recurrent neural network architecture for predicting chemical properties directly from <a href="/notes/computational-chemistry/molecular-representations/smiles/">SMILES</a> text representations. The primary contributions are: (1) a Bayesian-optimized CNN-<a href="https://en.wikipedia.org/wiki/Gated_recurrent_unit">GRU</a> architecture that serves as a general-purpose predictor for diverse chemical properties (toxicity, activity, solubility, <a href="https://en.wikipedia.org/wiki/Solvation">solvation</a> energy), (2) an explanation mask mechanism that provides interpretable predictions by identifying which SMILES characters drive the network&rsquo;s decisions, and (3) evidence that representation learning from raw SMILES can match or outperform models using hand-crafted molecular descriptors.</p>
<h2 id="motivation-beyond-engineered-features-in-chemical-modeling">Motivation: Beyond Engineered Features in Chemical Modeling</h2>
<p>At the time of writing (2017), deep learning models in chemistry relied heavily on engineered <a href="https://en.wikipedia.org/wiki/Molecular_descriptor">molecular descriptors</a> and fingerprints as input features. Over 5,000 molecular descriptors had been developed since the late 1940s, and <a href="https://en.wikipedia.org/wiki/Quantitative_structure%E2%80%93activity_relationship">QSAR</a>/QSPR modeling remained the dominant paradigm. The authors identified two key limitations with this approach:</p>
<ol>
<li><strong>Restricted search space</strong>: Engineered features limit the neural network&rsquo;s ability to discover potentially useful representations that domain experts have not anticipated.</li>
<li><strong>Incomplete domain knowledge</strong>: For complex properties where first-principles understanding is incomplete, the lack of appropriate descriptors constrains model performance.</li>
</ol>
<p>In contrast, computer vision and NLP had shown that deep learning models trained on raw data (unaltered images, raw text) could learn powerful representations without feature engineering. The chemical SMILES notation, a text-based encoding of molecular structure that serves as the standard interchange format in cheminformatics, provided a natural analog to text data for NLP-style modeling.</p>
<p>A secondary motivation was interpretability. Most ML and DL models for chemistry operated as black boxes, which posed particular problems for regulated applications like FDA drug approval where mechanistic explanations are required.</p>
<h2 id="core-innovation-cnn-gru-architecture-with-explanation-masks">Core Innovation: CNN-GRU Architecture with Explanation Masks</h2>
<h3 id="architecture-design-via-bayesian-optimization">Architecture Design via <a href="https://en.wikipedia.org/wiki/Bayesian_optimization">Bayesian Optimization</a></h3>
<p>SMILES2Vec treats SMILES strings as character-level text input. The network processes one-hot encoded characters (padded to length 250, covering 99.9% of the <a href="https://en.wikipedia.org/wiki/ChEMBL">ChEMBL</a> database) through three stages:</p>
<ol>
<li><strong>Embedding layer</strong>: Maps one-hot character vectors to a learned embedding space (size 50)</li>
<li><strong>1D convolutional layer</strong>: 192 filters with kernel size 3, stride 1</li>
<li><strong>Bidirectional GRU layers</strong>: Two layers with 224 and 384 units respectively</li>
</ol>
<p>The authors explored four architectural classes (GRU, LSTM, CNN-GRU, CNN-LSTM) using Bayesian optimization via SigOpt. Each class was evaluated over 60 trials, optimizing embedding size, convolutional filter count, and RNN layer widths. The CNN-GRU class was selected as the best compromise: CNN-LSTM performed best on classification (Tox21), while GRU-based networks excelled at regression (FreeSolv). The final architecture is summarized by the hyperparameters:</p>
<table>
  <thead>
      <tr>
          <th>Component</th>
          <th>Parameter</th>
          <th>Value</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Embedding</td>
          <td>Size</td>
          <td>50</td>
      </tr>
      <tr>
          <td>Conv1D</td>
          <td>Filters</td>
          <td>192</td>
      </tr>
      <tr>
          <td>BiGRU Layer 1</td>
          <td>Units</td>
          <td>224</td>
      </tr>
      <tr>
          <td>BiGRU Layer 2</td>
          <td>Units</td>
          <td>384</td>
      </tr>
  </tbody>
</table>
<h3 id="explanation-mask-for-interpretability">Explanation Mask for Interpretability</h3>
<p>The explanation mask is a post-hoc interpretability mechanism. Given a trained (frozen) SMILES2Vec base model, a separate explanation network learns to produce a per-character mask over the input SMILES string. The mask is trained to preserve the base model&rsquo;s output while masking as much input as possible. The loss function for a single sample is:</p>
<p>$$
\text{Loss}_i = | f(\text{SMILES}_i, \theta) - \text{Sol}(\text{SMILES}_i) |_2 + 10^{-6} | \text{MASK}_i |_2 + 0.05 , H(\text{MASK}_i)
$$</p>
<p>where $f(\text{SMILES}_i, \theta)$ is the base network prediction, $\text{Sol}(\text{SMILES}_i)$ is the ground truth solubility, $H$ is the entropy of the normalized mask, and $\text{MASK}_i$ is the per-character mask vector. The L2 term encourages sparsity and the entropy term penalizes uniform attention distributions.</p>
<p>The explanation network itself is a 20-layer residual network with SELU activations, ending in a 1D convolution of length 1, batch normalization, and a softplus activation. The softplus output ranges from 0 (fully masked) to infinity (amplified attention), allowing the mask to both suppress and emphasize specific SMILES characters.</p>
<h2 id="experimental-setup-and-baseline-comparisons">Experimental Setup and Baseline Comparisons</h2>
<h3 id="datasets">Datasets</h3>
<p>The model was evaluated on four datasets from the <a href="/notes/computational-chemistry/benchmark-problems/moleculenet-benchmark-molecular-ml/">MoleculeNet</a> benchmark and the ESOL solubility dataset:</p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Property</th>
          <th>Task</th>
          <th>Size</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Tox21</td>
          <td>Toxicity</td>
          <td>Multi-task classification</td>
          <td>8,014</td>
      </tr>
      <tr>
          <td>HIV</td>
          <td>Activity</td>
          <td>Single-task classification</td>
          <td>41,193</td>
      </tr>
      <tr>
          <td>FreeSolv</td>
          <td>Solvation energy</td>
          <td>Single-task regression</td>
          <td>643</td>
      </tr>
      <tr>
          <td>ESOL</td>
          <td>Solubility</td>
          <td>Single-task regression</td>
          <td>1,128</td>
      </tr>
  </tbody>
</table>
<p>SMILES strings longer than 250 characters were excluded. Classification datasets (Tox21, HIV) used 1/6 test split with minority class oversampling; regression datasets (FreeSolv, ESOL) used 1/10 test split. All experiments used 5-fold cross-validation.</p>
<h3 id="training-protocol">Training Protocol</h3>
<ul>
<li><strong>Optimizer</strong>: RMSprop with learning rate $10^{-3}$, $\rho = 0.9$, $\epsilon = 10^{-8}$</li>
<li><strong>Batch size</strong>: 32</li>
<li><strong>Epochs</strong>: 250 with early stopping (patience of 25 epochs based on validation loss)</li>
<li><strong>Classification loss</strong>: Binary cross-entropy</li>
<li><strong>Regression loss</strong>: Mean absolute error</li>
<li><strong>Metrics</strong>: AUC for classification, RMSE for regression</li>
</ul>
<h3 id="baselines">Baselines</h3>
<p>SMILES2Vec was compared against:</p>
<ul>
<li><strong>MLP with engineered features</strong>: Standard multi-layer perceptron using molecular fingerprints (from MoleculeNet)</li>
<li><strong>Molecular graph convolutions</strong>: Graph-based neural network from MoleculeNet</li>
<li><strong>Chemception</strong>: CNN operating on 2D chemical images</li>
</ul>
<h3 id="bayesian-optimization-protocol">Bayesian Optimization Protocol</h3>
<p>Only two datasets were used for architecture optimization: the nr-ahr toxicity task from Tox21 (classification) and FreeSolv (regression). The remaining datasets (full Tox21, HIV, ESOL) served purely for generalization evaluation. A fixed test set was held out during optimization, and correlation between validation and test metrics (0.54 for Tox21, 0.78 for FreeSolv) confirmed limited overfitting to the validation set.</p>
<h2 id="results-competitive-accuracy-with-interpretable-predictions">Results: Competitive Accuracy with Interpretable Predictions</h2>
<h3 id="property-prediction-performance">Property Prediction Performance</h3>
<p>SMILES2Vec achieved the following validation metrics (with a pre-training approach from ChemNet improving performance slightly):</p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Metric</th>
          <th>SMILES2Vec</th>
          <th>SMILES2Vec + Pre-training</th>
          <th>Graph Conv</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Tox21</td>
          <td>AUC</td>
          <td>0.80</td>
          <td>0.81</td>
          <td>0.81</td>
      </tr>
      <tr>
          <td>HIV</td>
          <td>AUC</td>
          <td>0.78</td>
          <td>0.80</td>
          <td>0.80</td>
      </tr>
      <tr>
          <td>FreeSolv</td>
          <td>RMSE (kcal/mol)</td>
          <td>1.4</td>
          <td>1.2</td>
          <td>1.3</td>
      </tr>
      <tr>
          <td>ESOL</td>
          <td>RMSE</td>
          <td>0.63</td>
          <td>-</td>
          <td>-</td>
      </tr>
  </tbody>
</table>
<p>Exact numbers for MLP and Chemception baselines were reported only in a bar chart (Figure 6) and not as precise values. The paper states that MLP with fingerprints performed worst across all tasks, and Chemception fell between MLP and the graph/SMILES methods.</p>
<p>Key findings:</p>
<ul>
<li>SMILES2Vec outperformed MLP models using engineered features across all tasks, despite using no feature engineering.</li>
<li>Against graph convolutions (the state-of-the-art at the time), SMILES2Vec matched on classification (Tox21: 0.81 vs 0.81, HIV: 0.80 vs 0.80) and outperformed on regression (FreeSolv: 1.2 vs 1.3).</li>
<li>SMILES2Vec outperformed Chemception (2D image CNN) on classification tasks but slightly underperformed on regression, which the authors attributed to SMILES lacking explicit atomic number information.</li>
</ul>
<h3 id="interpretability-evaluation">Interpretability Evaluation</h3>
<p>On the ESOL solubility dataset, the explanation mask was evaluated against first-principles chemical knowledge. The authors separated compounds into soluble (&gt; 1.0) and insoluble (&lt; -5.0) categories and defined ground truth: soluble compounds should attend to hydrophilic atoms (O, N) while insoluble compounds should attend to hydrophobic atoms (C, F, Cl, Br, I). The top-3 character accuracy was 88%, confirming that SMILES2Vec learned representations consistent with known functional group chemistry.</p>
<p>Qualitative analysis of the masks showed that for low-solubility molecules, characters corresponding to hydrophobic groups (c, C, Cl) received high attention, while high-solubility molecules showed attention focused on hydrophilic groups (O, N).</p>
<h3 id="limitations">Limitations</h3>
<ul>
<li>The interpretability evaluation was limited to solubility, a well-understood property with simple first-principles rules. The authors acknowledged that quantifying interpretability for complex properties (toxicity, activity) where no simple ground truth exists is nontrivial.</li>
<li>The Bayesian optimization used only a subset of datasets, so the architecture may not be globally optimal across all chemical tasks.</li>
<li>SMILES strings lack explicit atomic number information, which may limit performance on physical property prediction compared to image or graph representations.</li>
<li>The explanation mask approach requires training a separate 20-layer network per property, adding computational overhead.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Architecture optimization</td>
          <td>Tox21 (nr-ahr task)</td>
          <td>8,014</td>
          <td>Single toxicity task for Bayesian optimization</td>
      </tr>
      <tr>
          <td>Architecture optimization</td>
          <td>FreeSolv</td>
          <td>643</td>
          <td>Solvation free energy regression</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>Tox21 (full, 12 tasks)</td>
          <td>8,014</td>
          <td>Multi-task classification</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>HIV</td>
          <td>41,193</td>
          <td>Single-task classification</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>ESOL</td>
          <td>1,128</td>
          <td>Solubility regression, also used for interpretability</td>
      </tr>
  </tbody>
</table>
<p>All datasets are publicly available through MoleculeNet. The ESOL dataset is from Delaney (2004).</p>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Bayesian optimization via SigOpt (60 trials per architectural class, 4 classes, 6 manually seeded initial designs per class)</li>
<li>RMSprop optimizer with standard settings</li>
<li>Explanation mask trained with Adam, learning rate annealed from $10^{-2}$ to $10^{-6}$</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>Final architecture: Embedding(50) -&gt; Conv1D(192, kernel=3, stride=1) -&gt; BiGRU(224) -&gt; BiGRU(384)</li>
<li>Explanation network: 20-layer residual network with SELU activations</li>
<li>No pre-trained weights or code were released</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Dataset</th>
          <th>Value</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>AUC</td>
          <td>Tox21</td>
          <td>0.81</td>
          <td>With pre-training</td>
      </tr>
      <tr>
          <td>AUC</td>
          <td>HIV</td>
          <td>0.80</td>
          <td>With pre-training</td>
      </tr>
      <tr>
          <td>RMSE</td>
          <td>FreeSolv</td>
          <td>1.2 kcal/mol</td>
          <td>With pre-training</td>
      </tr>
      <tr>
          <td>RMSE</td>
          <td>ESOL</td>
          <td>0.63</td>
          <td>Base model</td>
      </tr>
      <tr>
          <td>Top-3 accuracy</td>
          <td>ESOL interpretability</td>
          <td>88%</td>
          <td>Explanation mask</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>The authors report using TensorFlow with GPU acceleration via NVIDIA cuDNN libraries. Specific GPU models and training times were not reported.</p>
<h3 id="artifacts">Artifacts</h3>
<p>No code, models, or data artifacts were released by the authors. The datasets used are publicly available through MoleculeNet.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Goh, G. B., Hodas, N. O., Siegel, C., &amp; Vishnu, A. (2017). SMILES2Vec: An Interpretable General-Purpose Deep Neural Network for Predicting Chemical Properties. <em>arXiv preprint arXiv:1712.02034</em>.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{goh2017smiles2vec,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{SMILES2Vec: An Interpretable General-Purpose Deep Neural Network for Predicting Chemical Properties}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Goh, Garrett B. and Hodas, Nathan O. and Siegel, Charles and Vishnu, Abhinav}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{arXiv preprint arXiv:1712.02034}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2017}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.48550/arxiv.1712.02034}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>MolPMoFiT: Inductive Transfer Learning for QSAR</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/property-prediction/molpmofit-transfer-learning-qsar/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/property-prediction/molpmofit-transfer-learning-qsar/</guid><description>MolPMoFiT adapts ULMFiT for QSAR by pre-training an LSTM language model on 1M ChEMBL SMILES and fine-tuning on small molecular property datasets.</description><content:encoded><![CDATA[<h2 id="transfer-learning-meets-molecular-property-prediction">Transfer Learning Meets Molecular Property Prediction</h2>
<p>This is a <strong>Method</strong> paper that introduces MolPMoFiT (Molecular Prediction Model Fine-Tuning), a transfer learning approach for <a href="https://en.wikipedia.org/wiki/Quantitative_structure%E2%80%93activity_relationship">QSPR/QSAR</a> modeling. The primary contribution is adapting the ULMFiT framework from NLP to molecular property prediction by treating <a href="/notes/computational-chemistry/molecular-representations/smiles/">SMILES strings</a> as a chemical language. A general-purpose molecular structure prediction model (MSPM) is pre-trained on one million <a href="https://en.wikipedia.org/wiki/ChEMBL">ChEMBL</a> molecules via self-supervised next-token prediction, then fine-tuned for specific QSAR endpoints. The approach achieves competitive or superior results to graph neural networks and descriptor-based methods across four benchmark datasets, with particular benefits for small datasets.</p>
<h2 id="the-small-data-problem-in-qsar-modeling">The Small Data Problem in QSAR Modeling</h2>
<p>Deep learning models for molecular property prediction typically require large labeled training sets to learn useful structural representations. While methods like graph convolutional neural networks and SMILES-based models have achieved strong results on well-studied endpoints, they must be trained from scratch for each new task. This presents a challenge for small chemical datasets with limited labeled data, which remain common in drug discovery for specialized endpoints like <a href="https://en.wikipedia.org/wiki/Allosteric_regulation">allosteric inhibition</a>, renal clearance, and inhibitor residence times.</p>
<p>Transfer learning had already shown transformative impact in computer vision (ImageNet pre-training) and NLP (ELMo, BERT, ULMFiT). In chemistry, prior transfer learning efforts included ChemNet (supervised pre-training on computed descriptors), <a href="/notes/computational-chemistry/chemical-language-models/molecular-encoders/mol2vec-unsupervised-chemical-intuition/">Mol2vec</a> (unsupervised substructure embeddings), and pre-trained graph neural networks. However, a systematic application of the ULMFiT self-supervised pre-training pipeline to SMILES-based molecular models had not been explored. MolPMoFiT fills this gap by treating the vast corpus of unlabeled molecular structures as a self-supervised training signal, analogous to how language models learn from unlabeled text.</p>
<h2 id="core-innovation-ulmfit-adapted-for-smiles">Core Innovation: ULMFiT Adapted for SMILES</h2>
<p>MolPMoFiT adapts ULMFiT&rsquo;s three-stage transfer learning pipeline to molecular property prediction:</p>
<p><strong>Stage 1: General-Domain MSPM Pre-training.</strong> A molecular structure prediction model is trained on one million curated ChEMBL molecules to predict the next token in a SMILES string. This is purely self-supervised: the SMILES string provides its own labels. The model learns general chemical syntax and structural patterns.</p>
<p><strong>Stage 2: Task-Specific MSPM Fine-tuning (Optional).</strong> The general MSPM is further fine-tuned on the unlabeled SMILES of the target task dataset. This adapts the language model to the specific chemical distribution of interest (e.g., HIV inhibitors vs. general bioactive molecules). Discriminative fine-tuning adjusts learning rates per layer:</p>
<p>$$\eta^{layer-1} = \eta^{layer} / 2.6$$</p>
<p>where higher layers (containing more task-specific features) receive higher learning rates.</p>
<p><strong>Stage 3: QSAR/QSPR Model Fine-tuning.</strong> The embedding and encoder weights from the pre-trained MSPM are transferred to a new model with a task-specific classifier head. Fine-tuning uses three key techniques from ULMFiT:</p>
<ul>
<li><strong>Discriminative fine-tuning</strong>: Different learning rates per layer group</li>
<li><strong>Gradual unfreezing</strong>: Layers are unfrozen sequentially (classifier first, then progressively deeper LSTM layers)</li>
<li><strong>One cycle policy</strong>: Learning rate scheduling following Smith&rsquo;s approach</li>
</ul>
<p>The model architecture is AWD-LSTM (ASGD Weight-Dropped LSTM) with an embedding dimension of 400, three LSTM layers with 1152 hidden units, and dropouts applied at every layer (embedding, input, weights, hidden). The QSAR classifier concatenates max pooling, mean pooling, and the last hidden state $h_T$ from the final LSTM layer, feeding this into two feedforward layers.</p>
<p><strong>SMILES Augmentation.</strong> Since multiple valid SMILES can represent the same molecule through different atom orderings, the authors use <a href="/notes/computational-chemistry/molecular-representations/randomized-smiles-generative-models/">SMILES enumeration</a> as data augmentation. For regression tasks, Gaussian noise ($\sigma_{noise}$) is added to labels of augmented SMILES to simulate experimental error. Test-time augmentation (TTA) averages predictions across the canonical SMILES and four randomized SMILES.</p>
<h2 id="benchmarks-across-four-qsar-datasets">Benchmarks Across Four QSAR Datasets</h2>
<h3 id="datasets">Datasets</h3>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Size</th>
          <th>Task</th>
          <th>Metric</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://en.wikipedia.org/wiki/Lipophilicity">Lipophilicity</a></td>
          <td>4,200</td>
          <td>Regression (logD)</td>
          <td>RMSE</td>
      </tr>
      <tr>
          <td>FreeSolv</td>
          <td>642</td>
          <td>Regression (<a href="https://en.wikipedia.org/wiki/Solvation">solvation energy</a>)</td>
          <td>RMSE</td>
      </tr>
      <tr>
          <td>HIV</td>
          <td>41,127</td>
          <td>Classification (replication inhibition)</td>
          <td>AUROC</td>
      </tr>
      <tr>
          <td>BBBP</td>
          <td>2,039</td>
          <td>Classification (<a href="https://en.wikipedia.org/wiki/Blood%E2%80%93brain_barrier">blood-brain barrier</a>)</td>
          <td>AUROC</td>
      </tr>
  </tbody>
</table>
<p>All datasets use the same 10 random 80:10:10 splits from <a href="/notes/computational-chemistry/benchmark-problems/systematic-study-molecular-property-prediction/">Yang et al. (2019)</a> for fair comparison. Both random and scaffold splits were evaluated, with scaffold splits representing a more realistic test of generalization to novel chemical scaffolds.</p>
<h3 id="baselines">Baselines</h3>
<p>Models were compared against results reported by Yang et al. (2019): directed message passing neural network (D-MPNN), D-MPNN with RDKit features, random forest on Morgan fingerprints, feed-forward networks on Morgan fingerprints, and feed-forward networks on <a href="https://en.wikipedia.org/wiki/RDKit">RDKit</a> descriptors.</p>
<h3 id="hyperparameters">Hyperparameters</h3>
<p>The same set of fine-tuning hyperparameters was used across all four tasks (tuned on the HIV dataset):</p>
<table>
  <thead>
      <tr>
          <th>Layer Group</th>
          <th>Base Learning Rate</th>
          <th>Epochs</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Linear head only</td>
          <td>3e-2</td>
          <td>4</td>
      </tr>
      <tr>
          <td>+ Final LSTM layer</td>
          <td>5e-3</td>
          <td>4</td>
      </tr>
      <tr>
          <td>+ Final two LSTM layers</td>
          <td>5e-4</td>
          <td>4</td>
      </tr>
      <tr>
          <td>Full model</td>
          <td>5e-5</td>
          <td>6</td>
      </tr>
  </tbody>
</table>
<p>Data augmentation settings were task-specific: lipophilicity training SMILES augmented 25x ($\sigma_{noise} = 0.3$); FreeSolv augmented 50x ($\sigma_{noise} = 0.5$); HIV active class augmented 60x and inactive 2x; BBBP positive class 10x and negative 30x.</p>
<h2 id="key-findings-and-limitations">Key Findings and Limitations</h2>
<h3 id="benchmark-results">Benchmark Results</h3>
<p><strong>Lipophilicity (random split):</strong> MolPMoFiT achieved RMSE of $0.565 \pm 0.037$ with TTA and $0.625 \pm 0.032$ without, outperforming D-MPNN and other baselines.</p>
<p><strong>FreeSolv (random split):</strong> RMSE of $1.197 \pm 0.127$ with TTA. The small dataset size (642 compounds) led to high variance across splits.</p>
<p><strong>BBBP (random split):</strong> AUROC of $0.950 \pm 0.020$, outperforming all comparison models. Task-specific MSPM fine-tuning showed no clear benefit over the general MSPM.</p>
<p><strong>HIV (random split):</strong> General MolPMoFiT achieved AUROC of $0.828 \pm 0.029$ with TTA. Task-specific fine-tuning yielded a slightly higher $0.834 \pm 0.025$ with TTA.</p>
<p>Scaffold splits consistently produced lower performance than random splits across all datasets, as expected for out-of-distribution generalization.</p>
<h3 id="transfer-learning-impact">Transfer Learning Impact</h3>
<p>Across all four datasets and varying training set sizes, MolPMoFiT consistently outperformed models trained from scratch with the same architecture. The improvement was most pronounced at smaller training set sizes, confirming the utility of pre-trained representations for low-data regimes.</p>
<h3 id="smiles-augmentation-analysis">SMILES Augmentation Analysis</h3>
<p>Training data augmentation provided significant improvements across all tasks. For classification (HIV, BBBP), augmentation improved performance regardless of whether class re-balancing was applied. For regression (lipophilicity, FreeSolv), both SMILES augmentation and label noise were beneficial, with optimal noise levels varying by dataset.</p>
<h3 id="limitations">Limitations</h3>
<p>The authors note a fundamental limitation: the model learns mappings from individual SMILES strings to properties rather than from molecular structures to properties. SMILES augmentation acts as a regularization technique to mitigate this, making the model more robust to different SMILES representations of the same molecule. The task-specific MSPM fine-tuning stage did not consistently improve results, requiring further investigation. All hyperparameters were tuned on one dataset (HIV) and applied uniformly, which may not be optimal for all endpoints.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pre-training</td>
          <td>ChEMBL (curated)</td>
          <td>1M molecules</td>
          <td>Filtered: no mixtures, max 50 heavy atoms, standardized with MolVS, canonized with RDKit</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>Lipophilicity</td>
          <td>4,200</td>
          <td><a href="/notes/computational-chemistry/benchmark-problems/moleculenet-benchmark-molecular-ml/">MoleculeNet</a> benchmark</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>FreeSolv</td>
          <td>642</td>
          <td>MoleculeNet benchmark</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>HIV</td>
          <td>41,127</td>
          <td>MoleculeNet benchmark</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>BBBP</td>
          <td>2,039</td>
          <td>MoleculeNet benchmark</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>AWD-LSTM architecture with embedding dim 400, three LSTM layers (1152 hidden units), dropouts at all layers</li>
<li>ULMFiT fine-tuning: discriminative learning rates ($\eta^{layer-1} = \eta^{layer}/2.6$), gradual unfreezing, one cycle policy</li>
<li>SMILES character-level tokenization with special handling for two-character tokens (Cl, Br) and bracket-enclosed tokens</li>
<li>SMILES enumeration for data augmentation with optional Gaussian label noise for regression</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>General-domain MSPM pre-trained on 1M ChEMBL molecules (10 epochs)</li>
<li>Task-specific MSPMs fine-tuned per dataset (optional stage)</li>
<li>QSAR models fine-tuned with transferred embeddings and encoder</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Split</th>
          <th>Metric</th>
          <th>MolPMoFiT (TTA)</th>
          <th>Best Baseline</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Lipophilicity</td>
          <td>Random</td>
          <td>RMSE</td>
          <td>$0.565 \pm 0.037$</td>
          <td>D-MPNN</td>
      </tr>
      <tr>
          <td>Lipophilicity</td>
          <td>Scaffold</td>
          <td>RMSE</td>
          <td>$0.635 \pm 0.031$</td>
          <td>D-MPNN</td>
      </tr>
      <tr>
          <td>FreeSolv</td>
          <td>Random</td>
          <td>RMSE</td>
          <td>$1.197 \pm 0.127$</td>
          <td>D-MPNN</td>
      </tr>
      <tr>
          <td>FreeSolv</td>
          <td>Scaffold</td>
          <td>RMSE</td>
          <td>$2.082 \pm 0.460$</td>
          <td>D-MPNN</td>
      </tr>
      <tr>
          <td>BBBP</td>
          <td>Random</td>
          <td>AUROC</td>
          <td>$0.950 \pm 0.020$</td>
          <td>D-MPNN</td>
      </tr>
      <tr>
          <td>BBBP</td>
          <td>Scaffold</td>
          <td>AUROC</td>
          <td>$0.931 \pm 0.025$</td>
          <td>D-MPNN</td>
      </tr>
      <tr>
          <td>HIV</td>
          <td>Random</td>
          <td>AUROC</td>
          <td>$0.828 \pm 0.029$</td>
          <td>D-MPNN</td>
      </tr>
      <tr>
          <td>HIV</td>
          <td>Scaffold</td>
          <td>AUROC</td>
          <td>$0.816 \pm 0.022$</td>
          <td>D-MPNN</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>NVIDIA Quadro P4000 GPU (single GPU)</li>
<li>General-domain MSPM pre-training: approximately 1 day</li>
<li>Pre-training needs to be done only once; fine-tuning is fast per task</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/XinhaoLi74/MolPMoFiT">MolPMoFiT</a></td>
          <td>Code</td>
          <td>Not specified</td>
          <td>PyTorch + fastai v1 implementation with curated datasets</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Li, X., &amp; Fourches, D. (2020). Inductive transfer learning for molecular activity prediction: Next-Gen QSAR Models with MolPMoFiT. <em>Journal of Cheminformatics</em>, 12, 27. <a href="https://doi.org/10.1186/s13321-020-00430-x">https://doi.org/10.1186/s13321-020-00430-x</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{li2020molpmofit,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Inductive transfer learning for molecular activity prediction: Next-Gen QSAR Models with MolPMoFiT}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Li, Xinhao and Fourches, Denis}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Cheminformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{12}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{27}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2020}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1186/s13321-020-00430-x}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>LLM-Prop: Predicting Crystal Properties from Text</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/property-prediction/llm-prop-crystal-property-prediction/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/property-prediction/llm-prop-crystal-property-prediction/</guid><description>LLM-Prop fine-tunes the T5 encoder on crystal text descriptions to predict band gap, volume, and other properties, outperforming GNN baselines.</description><content:encoded><![CDATA[<h2 id="text-based-crystal-property-prediction-with-llms">Text-Based Crystal Property Prediction with LLMs</h2>
<p>LLM-Prop is a <strong>Method</strong> paper that proposes using the encoder portion of <a href="https://en.wikipedia.org/wiki/T5_(language_model)">T5</a> (a general-purpose language model) fine-tuned on crystal text descriptions to predict physical and electronic properties of crystalline materials. The primary contribution is demonstrating that text-based representations of crystals, generated by Robocrystallographer, can serve as effective inputs for <a href="/notes/computational-chemistry/chemical-language-models/property-prediction/">property prediction</a>, outperforming graph neural network (GNN) baselines on several tasks despite using a non-domain-specific pre-trained model with fewer parameters.</p>
<h2 id="why-text-instead-of-crystal-graphs">Why Text Instead of Crystal Graphs?</h2>
<p>Graph neural networks have been the dominant approach for crystal property prediction. Models like CGCNN, MEGNet, and ALIGNN represent crystals as graphs where atoms are nodes and bonds are edges. However, GNNs face several fundamental challenges for crystals:</p>
<ol>
<li><strong>Periodicity encoding</strong>: Crystals have repetitive unit cell arrangements that are distinct from standard molecular graphs, and GNNs struggle to encode this periodicity efficiently.</li>
<li><strong>Information incorporation</strong>: Critical structural information like bond angles, <a href="https://en.wikipedia.org/wiki/Space_group">space group</a> symmetry, and <a href="https://en.wikipedia.org/wiki/Wyckoff_positions">Wyckoff sites</a> is difficult to incorporate into graph representations.</li>
<li><strong>Expressiveness</strong>: Graphs may lack the expressiveness needed to convey complex crystal information relevant to property prediction.</li>
</ol>
<p>Meanwhile, textual descriptions of crystals (generated by tools like Robocrystallographer) naturally encode space group information, bond geometries, coordination environments, and symmetry details in human-readable form. Despite this richness, text-based approaches for crystal property prediction had been largely unexplored.</p>
<h2 id="core-innovation-t5-encoder-with-careful-fine-tuning">Core Innovation: T5 Encoder with Careful Fine-Tuning</h2>
<p>The key insight of LLM-Prop is to take a pre-trained encoder-decoder model (T5-small) and discard the decoder entirely, using only the encoder with a linear prediction head. This design has several advantages:</p>
<ul>
<li>Cutting the network in half (from ~60M to ~37M parameters) allows processing of longer input sequences</li>
<li>Longer sequences mean more crystal information can be included</li>
<li>The encoder-only approach avoids T5&rsquo;s known weakness at regression in text-to-text format</li>
</ul>
<p>The framework applies several preprocessing strategies to the crystal text descriptions:</p>
<ol>
<li><strong>Stopword removal</strong>: Standard English stopwords are removed, except digits and symbols carrying chemical information</li>
<li><strong>Numerical token replacement</strong>: Bond distances are replaced with a <code>[NUM]</code> token and bond angles with <code>[ANG]</code>, reducing sequence length while preserving structural cues</li>
<li><strong>[CLS] token prepending</strong>: A classification token is added at the start, and its learned embedding is used as input to the prediction layer</li>
<li><strong>Label scaling</strong>: For regression tasks, targets are normalized using z-score, min-max, or log normalization</li>
</ol>
<p>The normalization schemes are defined as:</p>
<p>$$
\hat{Y}_{i}(\text{z-score}) = \frac{Y_{i} - \mu}{\sigma}
$$</p>
<p>$$
\hat{Y}_{i}(\text{min-max}) = \frac{Y_{i} - Y_{\min}}{Y_{\max} - Y_{\min}}
$$</p>
<p>$$
\hat{Y}_{i}(\text{log-norm}) = \log(Y_{i} + 1)
$$</p>
<p>The tokenizer is also retrained on the crystal text corpus with a vocabulary size of 32k, and the special tokens <code>[NUM]</code>, <code>[ANG]</code>, and <code>[CLS]</code> are added to the vocabulary.</p>
<h2 id="experimental-setup-and-baselines">Experimental Setup and Baselines</h2>
<h3 id="dataset-textedge">Dataset: TextEdge</h3>
<p>The authors collected data from the <a href="https://en.wikipedia.org/wiki/Materials_Project">Materials Project</a> database (as of November 2022), yielding 144,931 crystal structure-description pairs split into 125,098 training, 9,945 validation, and 9,888 test samples. Crystal text descriptions were generated using Robocrystallographer. The dataset covers six prediction tasks:</p>
<table>
  <thead>
      <tr>
          <th>Task</th>
          <th>Type</th>
          <th>Metric</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Band gap (eV)</td>
          <td>Regression</td>
          <td>MAE (lower is better)</td>
      </tr>
      <tr>
          <td>Unit cell volume (A^3/cell)</td>
          <td>Regression</td>
          <td>MAE (lower is better)</td>
      </tr>
      <tr>
          <td>Formation energy per atom (eV/atom)</td>
          <td>Regression</td>
          <td>MAE (lower is better)</td>
      </tr>
      <tr>
          <td>Energy per atom (eV/atom)</td>
          <td>Regression</td>
          <td>MAE (lower is better)</td>
      </tr>
      <tr>
          <td>Energy above hull (eV/atom)</td>
          <td>Regression</td>
          <td>MAE (lower is better)</td>
      </tr>
      <tr>
          <td>Is-gap-direct</td>
          <td>Classification</td>
          <td>AUC (higher is better)</td>
      </tr>
  </tbody>
</table>
<h3 id="baselines">Baselines</h3>
<p>Seven baselines were compared:</p>
<ul>
<li><strong>GNN-based</strong>: CGCNN, MEGNet, ALIGNN, DeeperGATGNN</li>
<li><strong>Classic ML</strong>: XGBoost, Random Forest (on Robocrystallographer features)</li>
<li><strong>Text-based</strong>: MatBERT (domain-specific pre-trained BERT, ~110M parameters)</li>
</ul>
<p>All models were trained and evaluated on the same dataset splits for fair comparison. GNN models were retrained on the new data rather than using results from older, smaller Materials Project versions.</p>
<h3 id="main-results-llm-prop-vs-gnn-baselines">Main Results: LLM-Prop vs. GNN Baselines</h3>
<p>When using crystal text descriptions as input, LLM-Prop achieved:</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Band gap (eV)</th>
          <th>Volume (A^3/cell)</th>
          <th>FEPA (eV/atom)</th>
          <th>EPA (eV/atom)</th>
          <th>Ehull (eV/atom)</th>
          <th>Is-gap-direct (AUC)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>CGCNN</td>
          <td>0.293</td>
          <td>188.834</td>
          <td>0.046</td>
          <td>0.082</td>
          <td>0.040</td>
          <td>0.830</td>
      </tr>
      <tr>
          <td>MEGNet</td>
          <td>0.304</td>
          <td>297.948</td>
          <td>0.077</td>
          <td>0.056</td>
          <td>0.051</td>
          <td>N/A</td>
      </tr>
      <tr>
          <td>ALIGNN</td>
          <td>0.250</td>
          <td>129.580</td>
          <td>0.027</td>
          <td>0.059</td>
          <td>0.028</td>
          <td>0.678</td>
      </tr>
      <tr>
          <td>DeeperGATGNN</td>
          <td>0.291</td>
          <td>111.857</td>
          <td>0.081</td>
          <td>0.116</td>
          <td>0.045</td>
          <td>N/A</td>
      </tr>
      <tr>
          <td>LLM-Prop (Descr.)</td>
          <td><strong>0.231</strong></td>
          <td><strong>39.252</strong></td>
          <td>0.056</td>
          <td>0.067</td>
          <td>0.047</td>
          <td><strong>0.857</strong></td>
      </tr>
  </tbody>
</table>
<p>LLM-Prop outperformed the best GNN baseline (ALIGNN) by approximately 8% on <a href="https://en.wikipedia.org/wiki/Band_gap">band gap</a> prediction, 65% on volume prediction, and 3% on band gap classification (Is-gap-direct). For formation energy per atom, energy per atom, and energy above hull, ALIGNN retained an advantage.</p>
<h3 id="llm-prop-vs-matbert">LLM-Prop vs. MatBERT</h3>
<p>LLM-Prop also outperformed MatBERT (a domain-specific pre-trained BERT) across all tasks despite having roughly 3x fewer parameters. The table below shows the best result for each model across the three input preprocessing strategies (w/ Numbers, w/o Numbers, w/ [NUM]&amp;[ANG]):</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Band gap (eV)</th>
          <th>Volume (A^3/cell)</th>
          <th>FEPA (eV/atom)</th>
          <th>EPA (eV/atom)</th>
          <th>Ehull (eV/atom)</th>
          <th>Is-gap-direct (AUC)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>MatBERT (best)</td>
          <td>0.258</td>
          <td>54.969</td>
          <td>0.071</td>
          <td>0.098</td>
          <td>0.050</td>
          <td>0.722</td>
      </tr>
      <tr>
          <td>LLM-Prop (best)</td>
          <td><strong>0.231</strong></td>
          <td><strong>39.138</strong></td>
          <td><strong>0.056</strong></td>
          <td><strong>0.067</strong></td>
          <td><strong>0.047</strong></td>
          <td><strong>0.857</strong></td>
      </tr>
  </tbody>
</table>
<p>Note: LLM-Prop&rsquo;s best band gap (0.231) comes from the &ldquo;w/o Numbers&rdquo; configuration, while the best volume (39.138) comes from &ldquo;w/ Numbers&rdquo;. The best Is-gap-direct AUC (0.857) uses the &ldquo;[NUM]&amp;[ANG]&rdquo; configuration.</p>
<h3 id="ablation-studies">Ablation Studies</h3>
<p>The contribution of each preprocessing strategy was evaluated:</p>
<table>
  <thead>
      <tr>
          <th>Configuration</th>
          <th>Band gap</th>
          <th>Volume</th>
          <th>Is-gap-direct (AUC)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>LLM-Prop (baseline)</td>
          <td>0.256</td>
          <td>69.352</td>
          <td>0.796</td>
      </tr>
      <tr>
          <td>+ modified tokenizer</td>
          <td>0.247</td>
          <td>78.632</td>
          <td>0.785</td>
      </tr>
      <tr>
          <td>+ label scaling</td>
          <td>0.242</td>
          <td>44.515</td>
          <td>N/A</td>
      </tr>
      <tr>
          <td>+ [CLS] token</td>
          <td>0.231</td>
          <td>39.520</td>
          <td>0.842</td>
      </tr>
      <tr>
          <td>+ [NUM] token</td>
          <td>0.251</td>
          <td>86.090</td>
          <td>0.793</td>
      </tr>
      <tr>
          <td>+ [ANG] token</td>
          <td>0.242</td>
          <td>64.965</td>
          <td>0.810</td>
      </tr>
      <tr>
          <td>- stopwords</td>
          <td>0.252</td>
          <td>56.593</td>
          <td>0.779</td>
      </tr>
      <tr>
          <td>LLM-Prop+all (no space group)</td>
          <td>0.235</td>
          <td>97.457</td>
          <td>0.705</td>
      </tr>
      <tr>
          <td>LLM-Prop+all</td>
          <td><strong>0.229</strong></td>
          <td>42.259</td>
          <td><strong>0.857</strong></td>
      </tr>
  </tbody>
</table>
<p>The [CLS] token provided the single largest improvement across all tasks. Label scaling was critical for volume prediction (reducing MAE from 69.352 to 44.515). Removing space group information from descriptions degraded volume prediction dramatically (from 42.259 to 97.457), confirming that space group symmetry is a key factor.</p>
<h3 id="data-efficiency-and-transfer-learning">Data Efficiency and Transfer Learning</h3>
<p>LLM-Prop achieved SOTA results on band gap and volume prediction with only about 90k training samples (35k fewer than baselines). For volume prediction specifically, LLM-Prop outperformed all GNN baselines with just 30k training samples.</p>
<p>Transfer learning experiments showed that LLM-Prop transferred well between band gap and volume prediction tasks:</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Volume-to-Band gap (Test)</th>
          <th>Band gap-to-Volume (Test)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>CGCNN-transfer</td>
          <td>0.295</td>
          <td>182.997</td>
      </tr>
      <tr>
          <td>ALIGNN-transfer</td>
          <td>0.322</td>
          <td>136.164</td>
      </tr>
      <tr>
          <td>MatBERT-transfer</td>
          <td>0.266</td>
          <td>54.289</td>
      </tr>
      <tr>
          <td>LLM-Prop-transfer</td>
          <td><strong>0.244</strong></td>
          <td><strong>50.753</strong></td>
      </tr>
  </tbody>
</table>
<h2 id="key-findings-limitations-and-future-directions">Key Findings, Limitations, and Future Directions</h2>
<p><strong>Key findings</strong>:</p>
<ul>
<li>Text descriptions of crystals carry rich structural information (space groups, Wyckoff sites, coordination geometries) that is difficult to encode in graphs but naturally expressed in text</li>
<li>A carefully fine-tuned general-purpose LLM encoder can outperform domain-specific pre-trained models, challenging the assumption that in-domain pre-training is always necessary</li>
<li>Removing numerical information (bond distances and angles) from descriptions often improves performance, because current LLMs treat numbers as regular tokens without understanding their quantitative meaning</li>
<li>Longer input sequences correlate with better performance, with 888 tokens as the default maximum on the hardware used</li>
</ul>
<p><strong>Limitations acknowledged by the authors</strong>:</p>
<ul>
<li>The origin of LLM-Prop&rsquo;s performance advantage over GNNs is not fully understood. It remains unclear whether the boost comes from additional structured information in text or from the different data modality itself</li>
<li>LLM-Prop cannot perform zero-shot predictions since T5 was not pre-trained on materials science data</li>
<li>The approach depends on Robocrystallographer to generate text descriptions, adding a preprocessing dependency</li>
<li>Current LLMs&rsquo; inability to reason about numerical values limits the use of quantitative information in descriptions</li>
</ul>
<p><strong>Future directions</strong> suggested by the authors include investigating techniques to use <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/autoregressive/3d-chemical-language-models-xyz-cif-pdb/">CIF files</a> directly as LLM inputs, developing new GNN architectures that incorporate space group and Wyckoff site information, and further exploring which information in crystal descriptions contributes most to each property prediction task.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training/Eval</td>
          <td>TextEdge</td>
          <td>144,931 crystals</td>
          <td>From Materials Project (Nov 2022), text generated by Robocrystallographer</td>
      </tr>
      <tr>
          <td>Training split</td>
          <td>TextEdge</td>
          <td>125,098</td>
          <td>Random split</td>
      </tr>
      <tr>
          <td>Validation split</td>
          <td>TextEdge</td>
          <td>9,945</td>
          <td>Random split</td>
      </tr>
      <tr>
          <td>Test split</td>
          <td>TextEdge</td>
          <td>9,888</td>
          <td>Random split</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Optimizer</strong>: Adam with one-cycle learning rate scheduler</li>
<li><strong>Learning rate</strong>: 1e-3 for LLM-Prop, 5e-5 for MatBERT</li>
<li><strong>Dropout</strong>: 0.2 for LLM-Prop, 0.5 for MatBERT</li>
<li><strong>Batch size</strong>: 64 (888 tokens) or 16 (2000 tokens) for LLM-Prop</li>
<li><strong>Epochs</strong>: 200-300 depending on task</li>
<li><strong>Loss</strong>: MAE for regression, BCE for classification</li>
<li><strong>Evaluation</strong>: MAE for regression, AUC for classification</li>
<li><strong>Each model run 5 times on test set</strong>, averaged MAE reported</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Base model</strong>: T5-small encoder (~60M parameters total, ~37M after discarding decoder and adding prediction head)</li>
<li><strong>Vocabulary size</strong>: 32k (retrained tokenizer)</li>
<li><strong>Max input tokens</strong>: 888 (default) or 2000</li>
<li><strong>Special tokens</strong>: [CLS], [NUM], [ANG]</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/vertaix/LLM-Prop">LLM-Prop</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official implementation</td>
      </tr>
      <tr>
          <td><a href="https://drive.google.com/drive/folders/1YCDBzwjwNRIc1FRkB662G3Y5AOWaokUG">TextEdge + Checkpoints</a></td>
          <td>Dataset + Model</td>
          <td>Not specified</td>
          <td>Benchmark dataset and trained model checkpoints</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>GPUs</strong>: NVIDIA RTX A6000</li>
<li><strong>Training time</strong>: ~40 minutes per epoch for LLM-Prop</li>
<li><strong>Inference</strong>: ~1 minute for 10,000 materials on one GPU</li>
</ul>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Rubungo, A. N., Arnold, C. B., Rand, B. P., &amp; Dieng, A. B. (2025). LLM-Prop: predicting the properties of crystalline materials using large language models. <em>npj Computational Materials</em>, 11, 186. <a href="https://doi.org/10.1038/s41524-025-01536-2">https://doi.org/10.1038/s41524-025-01536-2</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{rubungo2025llmprop,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{LLM-Prop: predicting the properties of crystalline materials using large language models}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Rubungo, Andre Niyongabo and Arnold, Craig B. and Rand, Barry P. and Dieng, Adji Bousso}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{npj Computational Materials}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{11}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{186}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2025}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Nature Publishing Group}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1038/s41524-025-01536-2}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Perplexity for Molecule Ranking and CLM Bias Detection</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/property-prediction/perplexity-molecule-ranking-bias-clms/</link><pubDate>Wed, 25 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/property-prediction/perplexity-molecule-ranking-bias-clms/</guid><description>Perplexity scoring enables intrinsic molecule ranking and pretraining bias detection in chemical language models for de novo drug design.</description><content:encoded><![CDATA[<h2 id="a-method-for-intrinsic-scoring-and-bias-detection-in-chemical-language-models">A Method for Intrinsic Scoring and Bias Detection in Chemical Language Models</h2>
<p>This is a <strong>Method</strong> paper that introduces two contributions to the chemical language model (CLM) pipeline for <a href="/notes/computational-chemistry/chemical-language-models/surveys-and-reviews/clms-de-novo-drug-design-review/">de novo molecular design</a>. First, the authors propose using perplexity as a model-intrinsic score to rank generated <a href="/notes/computational-chemistry/molecular-representations/smiles/">SMILES</a> strings by how well they match the design objectives encoded in the fine-tuning data. Second, they introduce a &ldquo;delta score&rdquo; that compares molecule rankings from pretrained and fine-tuned CLMs to detect pretraining bias, where molecules are generated primarily based on generic pretraining knowledge rather than task-specific fine-tuning objectives.</p>
<h2 id="the-ranking-and-bias-problem-in-clm-based-molecule-generation">The Ranking and Bias Problem in CLM-Based Molecule Generation</h2>
<p>Chemical language models generate new molecules as SMILES strings by iteratively predicting the next character based on learned probability distributions. After training, CLMs can produce large virtual libraries of candidate molecules via multinomial sampling. However, two key challenges remain: (1) the generated molecules lack a natural ranking, requiring external scoring methods such as similarity assessment or activity prediction for prioritization, and (2) <a href="/notes/computational-chemistry/chemical-language-models/property-prediction/molpmofit-transfer-learning-qsar/">transfer learning</a> (pretraining on a large corpus followed by fine-tuning on a small target set) can introduce &ldquo;pretraining bias,&rdquo; where some generated molecules reflect generic chemical knowledge from pretraining rather than the specific design objectives of the fine-tuning data.</p>
<p>Beam search offers an alternative sampling approach that produces inherently ranked molecules by greedily selecting the most probable SMILES strings. However, beam search explores only a narrow portion of chemical space. The authors sought to combine the ranking advantage of beam search with the chemical space exploration of multinomial sampling by applying perplexity scoring as a post-hoc ranking criterion.</p>
<h2 id="perplexity-scoring-and-the-delta-score-for-bias-estimation">Perplexity Scoring and the Delta Score for Bias Estimation</h2>
<p>The core innovation is the application of <a href="https://en.wikipedia.org/wiki/Perplexity">perplexity</a>, a standard evaluation metric from natural language processing, to score SMILES strings generated by CLMs. For a SMILES string of length $N$ with character probabilities $p_i$ assigned by the CLM, perplexity is computed as:</p>
<p>$$
\text{perplexity} = 2^{-\frac{1}{N} \sum_{i=1}^{N} \log(p_{i})}
$$</p>
<p>Low perplexity indicates that the CLM assigns high probability to each character in the SMILES string, suggesting the molecule closely matches the learned distribution of the fine-tuning data. The metric is normalized by string length, making it comparable across molecules of different sizes.</p>
<p>To address pretraining bias, the authors introduce a delta score. For each generated molecule, the perplexity-based rank from the fine-tuned model ($\text{rank}_{ft}$) is compared against the rank from the pretrained model ($\text{rank}_{pt}$):</p>
<p>$$
\text{delta} = \text{rank}_{ft} - \text{rank}_{pt}
$$</p>
<p>A positive delta score indicates that the fine-tuned model ranks the molecule higher than the pretrained model, suggesting the molecule was generated based on task-specific fine-tuning knowledge. A negative delta score flags molecules that may have been generated primarily from pretraining information, which do not necessarily match the design objectives.</p>
<p>The multinomial sampling probability for each character is computed via the softmax function:</p>
<p>$$
p_{i} = \frac{e^{z_{i}/T}}{\sum_{j} e^{z_{j}/T}}
$$</p>
<p>where $z_{i}$ is the CLM output logit for the $i$th character, $j$ runs over all dictionary characters, and $T$ is the temperature parameter (set to $T = 1$ in this study).</p>
<h2 id="experimental-setup-10-protein-targets-across-four-data-regimes">Experimental Setup: 10 Protein Targets Across Four Data Regimes</h2>
<p>The authors systematically evaluated perplexity scoring across 10 macromolecular targets and four low-data fine-tuning regimes (5, 10, 20, and 40 molecules per target).</p>
<p><strong>Model architecture</strong>: A four-layer LSTM-based RNN (5,820,515 parameters) with batch normalization layers, LSTM layers of 1024 and 256 units, trained using the Adam optimizer with a learning rate of $10^{-4}$.</p>
<p><strong>Pretraining</strong>: The model was pretrained on 1,683,181 molecules from <a href="https://en.wikipedia.org/wiki/ChEMBL">ChEMBL</a> (version 28), encoded as canonical SMILES (20-90 characters), for 90 epochs.</p>
<p><strong>Fine-tuning</strong>: For each of 10 randomly selected protein targets (Table 1), bioactive ligands with pChEMBL &gt; 6 were selected. Fine-tuning sets of 5, 10, 20, and 40 molecules were compiled for each target. Fine-tuning ran for 100 epochs, with 1,000 SMILES strings sampled every second epoch via multinomial sampling ($T = 1$).</p>
<table>
  <thead>
      <tr>
          <th>CHEMBL ID</th>
          <th>Target</th>
          <th>Protein Classification</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>CHEMBL1836</td>
          <td>Prostanoid EP4 receptor</td>
          <td><a href="https://en.wikipedia.org/wiki/G_protein-coupled_receptor">G protein-coupled receptor</a></td>
      </tr>
      <tr>
          <td>CHEMBL1945</td>
          <td>Melatonin receptor 1A</td>
          <td>G protein-coupled receptor</td>
      </tr>
      <tr>
          <td>CHEMBL1983</td>
          <td>Serotonin 1D (5-HT1D) receptor</td>
          <td>Family A GPCR</td>
      </tr>
      <tr>
          <td>CHEMBL202</td>
          <td><a href="https://en.wikipedia.org/wiki/Dihydrofolate_reductase">Dihydrofolate reductase</a></td>
          <td>Oxidoreductase</td>
      </tr>
      <tr>
          <td>CHEMBL3522</td>
          <td><a href="https://en.wikipedia.org/wiki/Cytochrome_P450">Cytochrome P450</a> 17A1</td>
          <td>Cytochrome P450</td>
      </tr>
      <tr>
          <td>CHEMBL4029</td>
          <td>Interleukin-8 receptor A</td>
          <td>Family A GPCR</td>
      </tr>
      <tr>
          <td>CHEMBL5073</td>
          <td>CaM kinase I delta</td>
          <td>Kinase</td>
      </tr>
      <tr>
          <td>CHEMBL5137</td>
          <td>Metabotropic glutamate receptor 2</td>
          <td>G protein-coupled receptor</td>
      </tr>
      <tr>
          <td>CHEMBL5408</td>
          <td>Serine/threonine-protein kinase TBK1</td>
          <td>Kinase</td>
      </tr>
      <tr>
          <td>CHEMBL5608</td>
          <td>NT-3 growth factor receptor</td>
          <td>Kinase</td>
      </tr>
  </tbody>
</table>
<p><strong>Sampling comparison</strong>: Beam search sampling was performed with beam widths $k = 10$ and $k = 50$ for comparison against multinomial sampling.</p>
<p><strong>Molecular similarity</strong>: <a href="https://en.wikipedia.org/wiki/Jaccard_index">Tanimoto similarity</a> was computed using Morgan fingerprints (radius 2, length 1024) and 2D <a href="https://en.wikipedia.org/wiki/Pharmacophore">pharmacophore</a> fingerprints via RDKit (2019.03.2).</p>
<h2 id="key-findings-multinomial-sampling-outperforms-beam-search">Key Findings: Multinomial Sampling Outperforms Beam Search</h2>
<p><strong>Perplexity correlates with molecular similarity.</strong> The Pearson correlation between perplexity and Tanimoto distance to the fine-tuning set stabilized at approximately 0.5 across all data regimes. This correlation emerged earlier with larger fine-tuning sets. The result confirms that perplexity captures both substructural and pharmacophore features while also incorporating additional CLM-learned information.</p>
<p><strong>Multinomial sampling produces better-ranked molecules than beam search.</strong> With the smallest fine-tuning sets (5 molecules), the top 50 molecules from multinomial sampling consistently exhibited lower (better) perplexity values than beam search at $k = 10$ or $k = 50$. Increasing the beam width from 10 to 50 did not markedly improve beam search performance. For novel molecules (Tanimoto similarity below 50% to the nearest fine-tuning compound), multinomial sampling identified lower-perplexity molecules in 72% of cases with the smallest fine-tuning sets.</p>
<p><strong>Perplexity scoring narrows the quality distribution.</strong> The top 50 molecules selected by perplexity from multinomial sampling spanned a narrower range of perplexity values compared to beam search, suggesting a more consistent pool of high-quality candidates for follow-up synthesis.</p>
<p><strong>Pretraining bias is substantial.</strong> The delta score analysis revealed that more than 40% of sampled molecules had negative delta scores during the first 20 fine-tuning epochs, meaning they were ranked higher by the pretrained model than the fine-tuned model. This fraction remained above 10% even at the end of 100 fine-tuning epochs across all data regimes, confirming that 10-40% of generated molecules reflect &ldquo;generic&rdquo; pretraining rather than task-focused fine-tuning.</p>
<p><strong>Perplexity alone partially mitigates bias.</strong> Among the top 50 molecules selected by perplexity from multinomial sampling, only up to 3% had negative delta scores, compared to 10-40% in the unfiltered population. This suggests that perplexity-based ranking already reduces pretraining bias, though the delta score provides additional filtering power.</p>
<p><strong>SMILES validity remained high.</strong> Mean SMILES string validity consistently exceeded 90% across all fine-tuned models and fine-tuning epochs.</p>
<h3 id="limitations">Limitations</h3>
<p>The authors note several limitations and future directions. The study used a fixed temperature of $T = 1$ for multinomial sampling; combining perplexity with temperature tuning or <a href="/notes/computational-chemistry/chemical-language-models/property-prediction/maxsmi-smiles-augmentation-property-prediction/">SMILES augmentation</a> remains unexplored. The evaluation focused on 10 protein targets, and broader validation across diverse target classes would strengthen the conclusions. The authors also suggest that combining CLMs with perplexity scoring could be applied to screen large collections of commercially available compounds, which has not yet been tested.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pretraining</td>
          <td>ChEMBL v28</td>
          <td>1,683,181 molecules</td>
          <td>Canonical SMILES, 20-90 characters, salts and duplicates removed</td>
      </tr>
      <tr>
          <td>Validation</td>
          <td>ChEMBL v28 (split)</td>
          <td>84,160 molecules</td>
          <td>Random split from pretraining set</td>
      </tr>
      <tr>
          <td>Fine-tuning</td>
          <td>ChEMBL v28 (per target)</td>
          <td>5, 10, 20, or 40 molecules</td>
          <td>pChEMBL &gt; 6, 10 targets</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>LSTM-based CLM with character-level SMILES prediction</li>
<li>Multinomial sampling at $T = 1$</li>
<li>Beam search at $k = 10$ and $k = 50$</li>
<li>Perplexity computed per Equation 1; delta score per Equation 2</li>
<li>Adam optimizer, learning rate $10^{-4}$, 90 pretraining epochs, 100 fine-tuning epochs</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>4-layer LSTM RNN: batch normalization, LSTM (1024 units), LSTM (256 units), batch normalization</li>
<li>5,820,515 parameters total</li>
<li>One-hot encoded SMILES input</li>
<li>Pretrained weights available in the GitHub repository</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Description</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Perplexity</td>
          <td>Model confidence in generated SMILES</td>
          <td>Lower is better</td>
      </tr>
      <tr>
          <td>Delta score</td>
          <td>Rank difference between fine-tuned and pretrained models</td>
          <td>Positive indicates task-relevant generation</td>
      </tr>
      <tr>
          <td>Tanimoto similarity</td>
          <td>Morgan and pharmacophore fingerprints</td>
          <td>Compared to fine-tuning set</td>
      </tr>
      <tr>
          <td>Pearson correlation</td>
          <td>Perplexity vs. Tanimoto distance</td>
          <td>Stabilizes at ~0.5</td>
      </tr>
      <tr>
          <td>SMILES validity</td>
          <td>Fraction of valid SMILES strings</td>
          <td>Consistently &gt; 90%</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Hardware specifications are not reported in the paper. The implementation uses Keras (v2.2.0) with TensorFlow GPU backend (v1.9.0).</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/ETHmodlab/CLM_perplexity">CLM_perplexity</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Framework, pretrained weights, and training data</td>
      </tr>
      <tr>
          <td><a href="https://github.com/ETHmodlab/molecular_design_with_beam_search">Beam search implementation</a></td>
          <td>Code</td>
          <td>Unknown</td>
          <td>Referenced beam search implementation</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Moret, M., Grisoni, F., Katzberger, P., &amp; Schneider, G. (2022). Perplexity-Based Molecule Ranking and Bias Estimation of Chemical Language Models. <em>Journal of Chemical Information and Modeling</em>, 62(5), 1199-1206. <a href="https://doi.org/10.1021/acs.jcim.2c00079">https://doi.org/10.1021/acs.jcim.2c00079</a></p>
<p><strong>Publication</strong>: Journal of Chemical Information and Modeling, 2022</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/ETHmodlab/CLM_perplexity">GitHub: CLM_perplexity (MIT License)</a></li>
</ul>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{moret2022perplexity,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Perplexity-Based Molecule Ranking and Bias Estimation of Chemical Language Models}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Moret, Michael and Grisoni, Francesca and Katzberger, Paul and Schneider, Gisbert}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Chemical Information and Modeling}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{62}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{5}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{1199--1206}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2022}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{American Chemical Society}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1021/acs.jcim.2c00079}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Regression Transformer: Prediction Meets Generation</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/property-prediction/regression-transformer/</link><pubDate>Sun, 22 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/property-prediction/regression-transformer/</guid><description>The Regression Transformer unifies property prediction and conditional generation in one multitask model by casting regression as sequence modelling.</description><content:encoded><![CDATA[<h2 id="a-multitask-model-that-unifies-regression-and-generation">A Multitask Model That Unifies Regression and Generation</h2>
<p>The Regression Transformer (RT) is a <strong>Method</strong> paper. It introduces a single model architecture that can both predict continuous molecular properties and conditionally generate molecules with desired property values. The core idea is to reformulate regression as a sequence modelling task: instead of training a dedicated regression head, continuous property values are tokenized into sequences of digits and predicted alongside molecular tokens using a cross-entropy loss.</p>
<h2 id="closing-the-gap-between-predictors-and-generators">Closing the Gap Between Predictors and Generators</h2>
<p>Existing transformer-based approaches in computational chemistry develop property predictors and generative models as separate systems. Even when a single architecture like <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/autoregressive/chemformer/">Chemformer</a> (Irwin et al., 2022) addresses both tasks, it does so through task-specific heads. This means the two capabilities remain disjoint, and the generative model cannot use its own property prediction ability during generation.</p>
<p>The RT addresses three specific gaps:</p>
<ol>
<li><strong>No true multitask entanglement</strong>: Prior work either tunes separate heads for prediction and generation or limits communication between modules to a reward signal.</li>
<li><strong>No inductive bias for continuous properties</strong>: Molecular generative models lack mechanisms to condition generation on floating-point property values.</li>
<li><strong>Disconnected workflows</strong>: Property predictors cannot generate molecules, and generators cannot assess whether their outputs satisfy property constraints.</li>
</ol>
<h2 id="core-innovation-regression-as-conditional-sequence-modelling">Core Innovation: Regression as Conditional Sequence Modelling</h2>
<p>The RT&rsquo;s key insight is that regression can be cast as sequential classification over digit tokens while preserving predictive accuracy. This is achieved through three components:</p>
<h3 id="numerical-tokenization">Numerical Tokenization</h3>
<p>Floating-point property values are split into individual digit tokens that preserve decimal order. Each token $t_{v,p}$ encodes a digit value $v \in [0, 9]$ and its decimal place $p \in \mathbb{Z}$. For example, the value 12.3 becomes the token sequence <code>[1_1, 2_0, 3_-1]</code>.</p>
<h3 id="numerical-encodings">Numerical Encodings</h3>
<p>To provide an inductive bias about the semantic proximity of digit tokens (which cross-entropy loss cannot convey), the RT introduces Numerical Encodings (NEs), analogous to positional encodings. For a token $t_{v,p}$ at embedding dimension $j$:</p>
<p>$$
\text{NE}_{\text{Float}}(v, p, j) = (-1)^j \cdot \frac{v \cdot 10^p}{j + 1}
$$</p>
<p>These encodings ensure that pairwise distances between digit tokens decay monotonically with their floating-point proximity. The model can also learn digit orderings from data alone, but NEs provide a useful inductive bias.</p>
<h3 id="alternating-training-with-self-consistency">Alternating Training with Self-Consistency</h3>
<p>The RT uses an <a href="https://en.wikipedia.org/wiki/XLNet">XLNet</a> backbone trained with permutation language modelling (PLM). The key is that the same model serves two roles depending on which tokens are masked:</p>
<ul>
<li><strong>Mask numerical tokens</strong>: the model performs property prediction (regression)</li>
<li><strong>Mask textual tokens</strong>: the model performs conditional sequence generation</li>
</ul>
<p>The base PLM objective is:</p>
<p>$$
\mathcal{L}_{\text{PLM}} = \mathbb{E}_{\mathbf{z} \sim \mathcal{Z}_T} \left[ \sum_{i=c+1}^{T} \log p_\theta(x_{z_i} \mid \mathbf{x}_{\mathbf{z}_{&lt; i}}) \right]
$$</p>
<p>This is refined into two specialized objectives: a property prediction objective $\mathcal{L}_P$ that masks only numerical tokens, and a generation objective $\mathcal{L}_G$ that masks only textual tokens. Training alternates between these every 50 steps.</p>
<p>The self-consistency (SC) loss adds a critical feedback loop. After generating a candidate molecule $\hat{\mathbf{x}}$, the model re-evaluates it by predicting the property of the generated sequence:</p>
<p>$$
\mathcal{L}_{\text{SC}} = \mathcal{L}_G(\mathbf{x}) + \alpha \cdot \mathcal{L}_P(\hat{\mathbf{x}})
$$</p>
<p>This rewards generating molecules whose predicted properties match the primed property value, exploiting the RT&rsquo;s dual capability as both predictor and generator.</p>
<h2 id="experiments-across-molecules-proteins-and-reactions">Experiments Across Molecules, Proteins, and Reactions</h2>
<h3 id="drug-likeness-qed">Drug Likeness (QED)</h3>
<p>Initial validation on a synthetic QED dataset (~1.4M molecules from <a href="https://en.wikipedia.org/wiki/ChEMBL">ChEMBL</a>) demonstrated that the RT can simultaneously learn to predict QED scores (RMSE &lt; 0.06) and generate novel molecules conditioned on desired QED values (Spearman&rsquo;s $\rho$ up to 0.517 between primers and generated molecule properties). Novelty exceeded 99% across all configurations. The alternating training scheme with SC loss outperformed both single-task models and the vanilla PLM objective.</p>
<p><a href="/notes/computational-chemistry/molecular-representations/selfies/">SELFIES</a> representations proved comparable to <a href="/notes/computational-chemistry/molecular-representations/smiles/">SMILES</a> for property prediction and far superior for generation (~100% validity vs. ~40% for SMILES).</p>
<h3 id="moleculenet-regression-benchmarks">MoleculeNet Regression Benchmarks</h3>
<p>On <a href="/notes/computational-chemistry/benchmark-problems/moleculenet-benchmark-molecular-ml/">MoleculeNet</a> benchmarks ESOL, FreeSolv, and Lipophilicity, the RT outperformed XGBoost and MPNN baselines despite using only a classification loss. It performed on par with XLNet using a conventional regression head, and was only mildly inferior to models like BERT and BART that used large-scale self-supervised pre-training with regression losses.</p>
<p>Critically, only the RT could also conditionally generate molecules for these tasks. External validation with Grover (a self-supervised Graph Transformer) confirmed high correlation with the RT&rsquo;s own property predictions (0.86, 0.84, and 0.75 for ESOL, FreeSolv, and Lipophilicity respectively).</p>
<h3 id="constrained-property-optimization">Constrained Property Optimization</h3>
<p>On the penalized logP (plogP) benchmark with similarity constraints, the RT outperformed JT-VAE and GCPN by large margins. At similarity threshold $\delta = 0.4$, the RT achieved 3.16 average improvement with 97.1% success rate, while also predicting plogP with PCC of 0.92. Competing methods cannot perform property prediction at all.</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Improvement ($\delta$=0.4)</th>
          <th>Success</th>
          <th>Property Prediction</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>JT-VAE</td>
          <td>0.84</td>
          <td>83.6%</td>
          <td>Unfeasible</td>
      </tr>
      <tr>
          <td>GCPN</td>
          <td>2.49</td>
          <td>100%</td>
          <td>Unfeasible</td>
      </tr>
      <tr>
          <td>MoFlow</td>
          <td>4.71</td>
          <td>85.7%</td>
          <td>Unfeasible</td>
      </tr>
      <tr>
          <td><strong>RT</strong></td>
          <td><strong>3.16</strong></td>
          <td><strong>97.1%</strong></td>
          <td><strong>PCC = 0.92</strong></td>
      </tr>
  </tbody>
</table>
<p>The comparison is not strictly fair: all competing methods are trained specifically to maximize plogP, and some (GCPN, JT-VAE) apply gradient optimization at inference time. The RT is only trained to reconstruct molecules with similar predicted plogP to the seed, so its training objective is property-agnostic rather than directly optimizing for higher plogP values.</p>
<h3 id="protein-language-modelling">Protein Language Modelling</h3>
<p>On the TAPE benchmark, the RT matched or outperformed conventional transformers on fluorescence and stability prediction tasks, despite those baselines being pre-trained on 24-106 million protein sequences (vs. 2.6 million for the RT). The RT also performed conditional protein generation, a task that none of the TAPE baselines can address.</p>
<h3 id="chemical-reaction-modelling">Chemical Reaction Modelling</h3>
<p>The RT was applied to reaction yield prediction on <a href="https://en.wikipedia.org/wiki/Buchwald%E2%80%93Hartwig_amination">Buchwald-Hartwig amination</a> and <a href="https://en.wikipedia.org/wiki/Suzuki_reaction">Suzuki coupling</a> datasets. It matched Yield-BERT performance ($R^2$ = 0.939 and 0.81 respectively) while also enabling novel capabilities: reconstructing missing precursors from partial reactions and decorating existing reactions to achieve higher predicted yields. Across both datasets, over 40% of top-five predicted sequences contained reactions with novel precursors and higher predicted yield.</p>
<h2 id="key-findings-and-limitations">Key Findings and Limitations</h2>
<h3 id="key-findings">Key Findings</h3>
<ol>
<li>Regression can be successfully reformulated as sequential classification over digit tokens without losing predictive accuracy compared to models using regression losses.</li>
<li>The alternating training scheme with self-consistency loss enables cross-task benefits, where the model outperforms single-task variants at both prediction and generation.</li>
<li>A single ~27M parameter model handles property prediction, conditional molecular generation, conditional protein generation, and reaction yield prediction with precursor generation.</li>
<li>The model learns the natural ordering of digits from data: 47% of embedding dimensions for the tenths place directly encode digit ordering even without explicit numerical encodings.</li>
</ol>
<h3 id="limitations">Limitations</h3>
<ol>
<li><strong>No large-scale pre-training</strong>: The RT uses ~27M parameters trained from scratch on task-specific datasets, unlike <a href="/notes/computational-chemistry/chemical-language-models/molecular-encoders/bartsmiles-molecular-representations/">BARTSmiles</a> or MoLFormer which pre-train on billions of molecules. Scaling up could improve results.</li>
<li><strong>Fine-grained regression precision</strong>: The model sometimes struggles with intra-mode precision (e.g., on the fluorescence dataset where predictions cluster around bright/dark modes rather than capturing continuous variation).</li>
<li><strong>Single-property focus</strong>: All reported experiments use a single continuous property, though the framework naturally extends to multi-property settings.</li>
<li><strong>SELFIES validity caveats</strong>: While SELFIES are always syntactically valid, they can produce degenerate short molecules (~1.9% defective generations where the output has less than 50% of the seed&rsquo;s atoms).</li>
<li><strong>XLNet backbone limitations</strong>: Results on MoleculeNet regression are slightly below models using BART or BERT backbones with large-scale pre-training, suggesting the RT framework could benefit from stronger base models.</li>
</ol>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/IBM/regression-transformer">Regression Transformer (GitHub)</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Training and evaluation scripts</td>
      </tr>
      <tr>
          <td><a href="https://github.com/GT4SD/gt4sd-core">GT4SD Integration</a></td>
          <td>Code + Models</td>
          <td>MIT</td>
          <td>Pre-trained model inference pipelines</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/spaces/GT4SD/regression_transformer">HuggingFace Demo</a></td>
          <td>Demo</td>
          <td>-</td>
          <td>Interactive inference webapp</td>
      </tr>
  </tbody>
</table>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Drug likeness</td>
          <td>ChEMBL (QED)</td>
          <td>~1.4M molecules</td>
          <td>Synthetic QED labels computed with RDKit</td>
      </tr>
      <tr>
          <td>Regression benchmark</td>
          <td>MoleculeNet (ESOL, FreeSolv, Lipo)</td>
          <td>642-4,200 compounds</td>
          <td>16x SMILES augmentation, 3 random splits</td>
      </tr>
      <tr>
          <td>Property optimization</td>
          <td>ZINC (plogP)</td>
          <td>215,381 train / 799 test</td>
          <td>Fixed split from Jin et al. (2018)</td>
      </tr>
      <tr>
          <td>Protein pre-training</td>
          <td><a href="https://en.wikipedia.org/wiki/UniProt">UniProt</a> (Boman)</td>
          <td>2,648,205 peptides</td>
          <td>15-45 amino acid peptides</td>
      </tr>
      <tr>
          <td>Protein benchmarks</td>
          <td>TAPE (Fluorescence, Stability)</td>
          <td>21,446-53,416 samples</td>
          <td>Fixed splits</td>
      </tr>
      <tr>
          <td>Reaction pre-training</td>
          <td>USPTO</td>
          <td>2,830,616 reactions</td>
          <td>Molecular weight as numerical property</td>
      </tr>
      <tr>
          <td>Reaction yield</td>
          <td>Buchwald-Hartwig / Suzuki</td>
          <td>3,955 / 5,760 reactions</td>
          <td>Ten 70/30 random splits</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Architecture: XLNet (32 hidden layers, 256 hidden dim, 1024 FFN dim, 16 attention heads, 20% dropout)</li>
<li>Parameters: ~27 million</li>
<li>Training: Permutation language modelling pre-training, then alternating objectives (property prediction + conditional generation with SC loss)</li>
<li>Decoding: Greedy for property prediction, beam search for sequence generation</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Task</th>
          <th>Metric</th>
          <th>RT Result</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>QED prediction</td>
          <td>RMSE</td>
          <td>0.037</td>
          <td>Best config (NE + SC)</td>
      </tr>
      <tr>
          <td>QED generation</td>
          <td>Spearman&rsquo;s $\rho$</td>
          <td>0.517</td>
          <td>Between primers and generated QED</td>
      </tr>
      <tr>
          <td>ESOL</td>
          <td>RMSE</td>
          <td>Comparable to XLNet</td>
          <td>Within s.d. of regression-loss XLNet</td>
      </tr>
      <tr>
          <td>plogP optimization ($\delta$=0.4)</td>
          <td>Improvement</td>
          <td>3.16</td>
          <td>Outperforms JT-VAE, GCPN</td>
      </tr>
      <tr>
          <td>Protein fluorescence</td>
          <td>Spearman&rsquo;s $\rho$</td>
          <td>0.72</td>
          <td>Outperforms TAPE baselines</td>
      </tr>
      <tr>
          <td>BH yield prediction</td>
          <td>$R^2$</td>
          <td>0.939</td>
          <td>Near Yield-BERT (0.951)</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>All models trained on single GPUs (NVIDIA A100 or V100)</li>
<li>Training time: ~4 days for pre-training, ~1 day for fine-tuning</li>
<li>Framework: PyTorch 1.3.1 with HuggingFace Transformers 3.1.0</li>
</ul>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Born, J. &amp; Manica, M. (2023). Regression Transformer enables concurrent sequence regression and generation for molecular language modelling. <em>Nature Machine Intelligence</em>, 5(4), 432-444. <a href="https://doi.org/10.1038/s42256-023-00639-z">https://doi.org/10.1038/s42256-023-00639-z</a></p>
<p><strong>Publication</strong>: Nature Machine Intelligence, April 2023</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/IBM/regression-transformer">Regression Transformer GitHub Repository</a></li>
<li><a href="https://github.com/GT4SD/gt4sd-core/tree/main/examples/regression_transformer">GT4SD Integration</a></li>
<li><a href="https://huggingface.co/spaces/GT4SD/regression_transformer">HuggingFace Demo</a></li>
</ul>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{born2023regression,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Regression Transformer enables concurrent sequence regression and generation for molecular language modelling}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Born, Jannis and Manica, Matteo}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Nature Machine Intelligence}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{5}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{4}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{432--444}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2023}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Nature Publishing Group}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Language Models Learn Complex Molecular Distributions</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/property-prediction/lm-complex-molecular-distributions/</link><pubDate>Sun, 22 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/property-prediction/lm-complex-molecular-distributions/</guid><description>RNN language models trained on SMILES and SELFIES outperform graph models at learning complex, multi-modal, and large-scale molecular distributions.</description><content:encoded><![CDATA[<h2 id="rnn-language-models-as-flexible-molecular-generators">RNN Language Models as Flexible Molecular Generators</h2>
<p>This is an <strong>Empirical</strong> paper that investigates the capacity of simple recurrent neural network (RNN) language models to learn complex molecular distributions. The core finding is that LSTM-based models trained on <a href="/notes/computational-chemistry/molecular-representations/smiles/">SMILES</a> (SM-RNN) or <a href="/notes/computational-chemistry/molecular-representations/selfies/">SELFIES</a> (SF-RNN) string representations consistently outperform popular graph generative models (JTVAE, CGVAE) across three increasingly challenging generative modeling tasks. The paper positions language models as flexible, scalable alternatives to graph-based approaches for molecular generation.</p>
<h2 id="scaling-beyond-standard-benchmarks">Scaling Beyond Standard Benchmarks</h2>
<p>Most molecular generative models are evaluated on relatively small, drug-like molecules from datasets like <a href="https://en.wikipedia.org/wiki/ZINC_database">ZINC</a> or <a href="/notes/computational-chemistry/benchmark-problems/molecular-sets-moses/">MOSES</a>. These standard benchmarks do not test whether models can handle larger, more structurally diverse molecules or distributions with complex shapes (multi-modal, heavy-tailed). This gap matters because there is increasing interest in larger, more complex molecules for therapeutics, including peptides and natural products.</p>
<p>Graph generative models like JTVAE and CGVAE impose structural constraints (tree decompositions, valency restrictions) that help with validity but limit their ability to scale. Language models, by contrast, only need to generate a single character sequence, making them inherently more flexible.</p>
<h2 id="three-challenging-generative-modeling-tasks">Three Challenging Generative Modeling Tasks</h2>
<p>The paper introduces three benchmark tasks designed to stress-test generative models:</p>
<h3 id="task-1-penalized-logp-distribution">Task 1: Penalized LogP Distribution</h3>
<p>A dataset of approximately 160K molecules from ZINC15 with penalized <a href="https://en.wikipedia.org/wiki/Partition_coefficient">LogP</a> scores exceeding 4.0. The training distribution is sharply peaked around 4.0 to 4.5 with a subtle tail extending above 6.0. Molecules in the tail tend to have long carbon chains and fewer rings. The challenge is learning this skewed distribution rather than just finding individual high-scoring molecules.</p>
<h3 id="task-2-multi-modal-distribution">Task 2: Multi-Modal Distribution</h3>
<p>A composite dataset of approximately 200K molecules drawn from four sources with distinct molecular weight ranges:</p>
<ul>
<li><a href="/notes/computational-chemistry/datasets/gdb-13/">GDB-13</a> (MW $\leq$ 185)</li>
<li>ZINC (185 $\leq$ MW $\leq$ 425)</li>
<li>Harvard Clean Energy Project (460 $\leq$ MW $\leq$ 600)</li>
<li>POLYMERS (MW $&gt;$ 600)</li>
</ul>
<p>Models must learn to generate from all four modes simultaneously, each with very different molecular structures.</p>
<h3 id="task-3-large-scale-molecules">Task 3: Large-Scale Molecules</h3>
<p>The largest molecules in <a href="https://en.wikipedia.org/wiki/PubChem">PubChem</a> with more than 100 heavy atoms, yielding approximately 300K molecules with molecular weights ranging from 1,250 to 5,000. These include small biomolecules, photovoltaics, peptides, and cyclic peptides. This task is particularly challenging because the SMILES/SELFIES strings are very long.</p>
<h2 id="evaluation-by-distributional-fidelity">Evaluation by Distributional Fidelity</h2>
<p>The evaluation framework focuses on how well a model learns the full training distribution rather than generating individual good molecules. The primary quantitative metric is the <a href="https://en.wikipedia.org/wiki/Wasserstein_metric">Wasserstein distance</a> (earth mover&rsquo;s distance) between molecular property distributions of generated and training molecules:</p>
<p>$$W(P, Q) = \inf_{\gamma \in \Gamma(P,Q)} \int | x - y | , d\gamma(x, y)$$</p>
<p>Properties evaluated include LogP, synthetic accessibility (SA), quantitative estimate of drug-likeness (QED), molecular weight (MW), Bertz complexity (BCT), and natural product likeness (NP). An oracle baseline is computed by measuring the Wasserstein distance between different random samples of the training data itself.</p>
<p>Standard metrics (validity, uniqueness, novelty) are also reported but are secondary to distributional fidelity.</p>
<h2 id="architecture-lstm-language-models">Architecture: LSTM Language Models</h2>
<p>The language models use standard LSTM architectures trained autoregressively on molecular strings. Two variants are compared:</p>
<ul>
<li><strong>SM-RNN</strong>: Trained on canonical SMILES</li>
<li><strong>SF-RNN</strong>: Trained on SELFIES representations</li>
</ul>
<p>Hyperparameters are tuned via random search over learning rate ($\in [0.0001, 0.001]$), hidden units ($\in [100, 1000]$), layers (1 to 5), and dropout ($\in [0.0, 0.5]$). Model selection uses a combination of standard metrics and Wasserstein distance rankings.</p>
<p>The graph model baselines include JTVAE (junction tree VAE) and CGVAE (constrained graph VAE), along with several additional baselines (MolGAN, GraphNVP, and others).</p>
<h2 id="results-language-models-outperform-graph-models-across-all-tasks">Results: Language Models Outperform Graph Models Across All Tasks</h2>
<h3 id="penalized-logp">Penalized LogP</h3>
<p>Both RNN models learn the sharp training distribution far better than graph models. The SM-RNN achieves the lowest Wasserstein distances across most properties. The graph models produce substantial out-of-distribution mass around penalized LogP scores of 1.75 to 2.25, failing to capture the peaked nature of the training distribution.</p>
<p>Critically, the RNNs also learn the subtle tail above penalized LogP of 6.0, generating molecules with long carbon chains and fewer rings that match the structural characteristics of high-scoring training molecules. CGVAE and JTVAE almost entirely miss this tail.</p>
<h3 id="multi-modal-distribution">Multi-Modal Distribution</h3>
<p>Both RNN models capture all four modes of the training distribution. JTVAE entirely misses the GDB13 mode and poorly learns the ZINC and CEP modes. CGVAE learns GDB13 but misses the CEP mode. The SM-RNN again achieves the best Wasserstein metrics.</p>
<h3 id="large-scale-molecules">Large-Scale Molecules</h3>
<p>This is the most discriminating task. Both JTVAE and CGVAE completely fail to train on these large molecules. JTVAE&rsquo;s tree decomposition produces a vocabulary of approximately 11,000 substructures, making training intractable. Only the RNN models succeed, with the SF-RNN achieving slightly better distributional match due to SELFIES guaranteeing 100% validity even for very long strings.</p>
<p>Both RNN models also learn the bimodal LogP structure within the large-molecule distribution and can generate molecules with substructures resembling peptides, including backbone chains and standard amino acid side chains.</p>
<h3 id="summary-of-wasserstein-distance-results">Summary of Wasserstein Distance Results</h3>
<table>
  <thead>
      <tr>
          <th>Task</th>
          <th>Model</th>
          <th>LogP</th>
          <th>SA</th>
          <th>QED</th>
          <th>MW</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>LogP</td>
          <td>SM-RNN</td>
          <td>0.095</td>
          <td>0.031</td>
          <td>0.007</td>
          <td>3.3</td>
      </tr>
      <tr>
          <td>LogP</td>
          <td>SF-RNN</td>
          <td>0.177</td>
          <td>0.290</td>
          <td>0.010</td>
          <td>6.3</td>
      </tr>
      <tr>
          <td>LogP</td>
          <td>JTVAE</td>
          <td>0.536</td>
          <td>0.289</td>
          <td>0.081</td>
          <td>35.9</td>
      </tr>
      <tr>
          <td>LogP</td>
          <td>CGVAE</td>
          <td>1.000</td>
          <td>2.120</td>
          <td>0.115</td>
          <td>69.3</td>
      </tr>
      <tr>
          <td>Multi</td>
          <td>SM-RNN</td>
          <td>0.081</td>
          <td>0.025</td>
          <td>0.006</td>
          <td>5.5</td>
      </tr>
      <tr>
          <td>Multi</td>
          <td>SF-RNN</td>
          <td>0.286</td>
          <td>0.179</td>
          <td>0.023</td>
          <td>11.4</td>
      </tr>
      <tr>
          <td>Multi</td>
          <td>JTVAE</td>
          <td>0.495</td>
          <td>0.274</td>
          <td>0.034</td>
          <td>27.7</td>
      </tr>
      <tr>
          <td>Multi</td>
          <td>CGVAE</td>
          <td>1.617</td>
          <td>1.802</td>
          <td>0.076</td>
          <td>30.3</td>
      </tr>
      <tr>
          <td>Large</td>
          <td>SM-RNN</td>
          <td>1.367</td>
          <td>0.213</td>
          <td>0.003</td>
          <td>124.5</td>
      </tr>
      <tr>
          <td>Large</td>
          <td>SF-RNN</td>
          <td>1.095</td>
          <td>0.342</td>
          <td>0.010</td>
          <td>67.3</td>
      </tr>
      <tr>
          <td>Large</td>
          <td>JTVAE</td>
          <td>&ndash;</td>
          <td>&ndash;</td>
          <td>&ndash;</td>
          <td>&ndash;</td>
      </tr>
      <tr>
          <td>Large</td>
          <td>CGVAE</td>
          <td>&ndash;</td>
          <td>&ndash;</td>
          <td>&ndash;</td>
          <td>&ndash;</td>
      </tr>
  </tbody>
</table>
<h3 id="smiles-vs-selfies-trade-off">SMILES vs. SELFIES Trade-off</h3>
<p>An interesting finding is that SMILES and SELFIES RNNs each have complementary strengths. The SF-RNN consistently achieves better standard metrics (validity, uniqueness, novelty) across all tasks, while the SM-RNN achieves better Wasserstein distance metrics. The authors suggest that the SELFIES grammar may reduce memorization of the training data, improving novelty but slightly hurting distributional fidelity.</p>
<h2 id="limitations">Limitations</h2>
<p>The authors acknowledge several limitations. Language models cannot account for molecular geometry or 3D information, which is important for many applications. The study evaluates distributional fidelity but does not test downstream utility for specific molecular design tasks (e.g., optimizing for a particular biological target). Additionally, while the graph models (JTVAE, CGVAE) are more interpretable, the language models operate as black boxes over string representations. The comparison is also limited to two specific graph model architectures, and more recent or specialized graph models may close the performance gap. Finally, trained model weights are only available upon request rather than being publicly released.</p>
<h2 id="reproducibility">Reproducibility</h2>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/danielflamshep/genmoltasks">danielflamshep/genmoltasks</a></td>
          <td>Dataset</td>
          <td>Apache-2.0</td>
          <td>Processed training data and generated samples</td>
      </tr>
  </tbody>
</table>
<p><strong>Data</strong>: Three custom datasets constructed from ZINC15, <a href="/notes/computational-chemistry/datasets/gdb-13/">GDB-13</a>, Harvard Clean Energy Project, POLYMERS, and PubChem. Processed data available at the GitHub repository.</p>
<p><strong>Code</strong>: LSTM networks implemented in PyTorch using the char-rnn code from the <a href="https://github.com/molecularsets/moses">MOSES repository</a>. Baselines use the official <a href="https://github.com/wengong-jin/icml18-jtnn">JTVAE</a> and <a href="https://github.com/microsoft/constrained-graph-variational-autoencoder">CGVAE</a> implementations. No unified training script is provided in the repository.</p>
<p><strong>Evaluation</strong>: Wasserstein distances computed using SciPy. Molecular properties computed using RDKit. 10K molecules generated from each model for evaluation.</p>
<p><strong>Hyperparameters</strong>: Task-specific configurations reported. For example, the LogP task SM-RNN uses 2 hidden layers with 400 units, dropout of 0.2, and learning rate of 0.0001.</p>
<p><strong>Hardware</strong>: Models were trained using the Canada Computing Systems (Compute Canada). Specific GPU types and training times are not reported.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Flam-Shepherd, D., Zhu, K., &amp; Aspuru-Guzik, A. (2022). Language models can learn complex molecular distributions. <em>Nature Communications</em>, 13, 3293. <a href="https://doi.org/10.1038/s41467-022-30839-x">https://doi.org/10.1038/s41467-022-30839-x</a></p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/danielflamshep/genmoltasks">GitHub: danielflamshep/genmoltasks</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{flamshepherd2022language,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Language models can learn complex molecular distributions}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Flam-Shepherd, Daniel and Zhu, Kevin and Aspuru-Guzik, Al{\&#39;a}n}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Nature Communications}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{13}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{3293}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2022}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Nature Publishing Group}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1038/s41467-022-30839-x}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item></channel></rss>