<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Molecular Encoders on Hunter Heidenreich | ML Research Scientist</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-encoders/</link><description>Recent content in Molecular Encoders on Hunter Heidenreich | ML Research Scientist</description><image><title>Hunter Heidenreich | ML Research Scientist</title><url>https://hunterheidenreich.com/img/avatar.webp</url><link>https://hunterheidenreich.com/img/avatar.webp</link></image><generator>Hugo -- 0.147.7</generator><language>en-US</language><copyright>2026 Hunter Heidenreich</copyright><lastBuildDate>Sun, 05 Apr 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-encoders/index.xml" rel="self" type="application/rss+xml"/><item><title>Mol2vec: Unsupervised ML with Chemical Intuition</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-encoders/mol2vec-unsupervised-chemical-intuition/</link><pubDate>Fri, 27 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-encoders/mol2vec-unsupervised-chemical-intuition/</guid><description>Mol2vec applies Word2vec to Morgan substructures, learning dense vector representations of molecules that capture chemical similarity for property prediction.</description><content:encoded><![CDATA[<h2 id="word2vec-meets-cheminformatics">Word2vec Meets Cheminformatics</h2>
<p>Mol2vec is a <strong>Method</strong> paper that introduces an unsupervised approach for learning dense vector representations of molecular substructures. The core idea is a direct analogy to <a href="/notes/machine-learning/classic-papers/distributed-representations/">Word2vec</a> from natural language processing: molecular substructures (derived from the Morgan algorithm) are treated as &ldquo;words,&rdquo; and entire molecules are treated as &ldquo;sentences.&rdquo; By training on a large unlabeled corpus of 19.9 million compounds, Mol2vec produces embeddings where chemically related substructures occupy nearby regions of vector space. Compound-level vectors are then obtained by summing constituent substructure vectors, and these can serve as features for downstream supervised learning tasks.</p>
<h2 id="sparse-fingerprints-and-their-limitations">Sparse Fingerprints and Their Limitations</h2>
<p>Molecular fingerprints, particularly Morgan fingerprints (extended-connectivity fingerprints, ECFP), are among the most widely used molecular representations in cheminformatics. They perform well for similarity searching, virtual screening, and activity prediction. However, they suffer from several practical drawbacks:</p>
<ul>
<li><strong>High dimensionality and sparsity</strong>: Morgan fingerprints are typically hashed to fixed-length binary vectors (e.g., 2048 or 4096 bits), resulting in very sparse representations.</li>
<li><strong>Bit collisions</strong>: The hashing step can map distinct substructures to the same bit position, losing structural information.</li>
<li><strong>No learned relationships</strong>: Each bit is independent, so the representation does not encode any notion of chemical similarity between substructures.</li>
</ul>
<p>At the time of this work (2017), NLP techniques had started to appear in cheminformatics. The <a href="https://en.wikipedia.org/wiki/Tf%E2%80%93idf">tf-idf</a> method had been applied to Morgan fingerprints for compound-protein interaction prediction, and <a href="https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation">Latent Dirichlet Allocation</a> had been used for chemical topic modeling. The Word2vec concept had been adapted for protein sequences (ProtVec) but had not yet been applied to small molecules. Mol2vec fills this gap.</p>
<h2 id="from-substructure-identifiers-to-dense-embeddings">From Substructure Identifiers to Dense Embeddings</h2>
<p>The central insight of Mol2vec is that the Morgan algorithm already produces a natural &ldquo;vocabulary&rdquo; of molecular substructures, and the order in which these substructures appear in a molecule provides local context, analogous to word order in a sentence.</p>
<h3 id="corpus-construction">Corpus Construction</h3>
<p>The training corpus was assembled from <a href="https://en.wikipedia.org/wiki/ZINC_database">ZINC</a> v15 and <a href="https://en.wikipedia.org/wiki/ChEMBL">ChEMBL</a> v23, merged and deduplicated, then filtered by molecular weight (12-600), heavy atom count (3-50), clogP (-5 to 7), and allowed elements (H, B, C, N, O, F, P, S, Cl, Br). This yielded 19.9 million compounds.</p>
<h3 id="sentence-generation">Sentence Generation</h3>
<p>For each molecule, the Morgan algorithm generates atom identifiers at radius 0 and radius 1. Each atom contributes two identifiers (one per radius), ordered according to the atom order in the canonical <a href="/notes/computational-chemistry/molecular-representations/smiles/">SMILES</a>. This sequence of identifiers forms a &ldquo;sentence&rdquo; for Word2vec training.</p>
<h3 id="word2vec-training">Word2vec Training</h3>
<p>The model was trained using the gensim implementation of Word2vec. After evaluating both CBOW and Skip-gram architectures with window sizes of 5, 10, and 20, and embedding dimensions of 100 and 300, the best configuration was:</p>
<ul>
<li><strong>Architecture</strong>: Skip-gram</li>
<li><strong>Window size</strong>: 10</li>
<li><strong>Embedding dimension</strong>: 300</li>
</ul>
<p>Rare identifiers appearing fewer than 3 times in the corpus were replaced with a special &ldquo;UNSEEN&rdquo; token, which learns a near-zero vector. This allows the model to handle novel substructures at inference time.</p>
<h3 id="compound-vector-generation">Compound Vector Generation</h3>
<p>The final vector for a molecule is the sum of all its substructure vectors:</p>
<p>$$\mathbf{v}_{\text{mol}} = \sum_{i=1}^{N} \mathbf{v}_{s_i}$$</p>
<p>where $\mathbf{v}_{s_i}$ is the 300-dimensional embedding for the $i$-th substructure identifier in the molecule. This summation implicitly captures substructure counts and importance through vector amplitude.</p>
<h2 id="benchmarking-across-regression-and-classification-tasks">Benchmarking Across Regression and Classification Tasks</h2>
<h3 id="datasets">Datasets</h3>
<p>The authors evaluated Mol2vec on four datasets:</p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Task</th>
          <th>Size</th>
          <th>Description</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ESOL</td>
          <td>Regression</td>
          <td>1,144</td>
          <td>Aqueous solubility prediction</td>
      </tr>
      <tr>
          <td>Ames</td>
          <td>Classification</td>
          <td>6,511</td>
          <td><a href="https://en.wikipedia.org/wiki/Mutagen">Mutagenicity</a> (balanced: 3,481 positive, 2,990 negative)</td>
      </tr>
      <tr>
          <td>Tox21</td>
          <td>Classification</td>
          <td>8,192</td>
          <td>12 human toxicity targets (imbalanced)</td>
      </tr>
      <tr>
          <td>Kinase</td>
          <td>Classification</td>
          <td>284 kinases</td>
          <td>Bioactivity from ChEMBL v23</td>
      </tr>
  </tbody>
</table>
<h3 id="machine-learning-methods">Machine Learning Methods</h3>
<p>Three ML methods were compared using both Mol2vec and Morgan FP features:</p>
<ul>
<li><strong>Random Forest (RF)</strong>: scikit-learn, 500 estimators</li>
<li><strong>Gradient Boosting Machine (GBM)</strong>: XGBoost, 2000 estimators, max depth 3, learning rate 0.1</li>
<li><strong>Deep Neural Network (DNN)</strong>: Keras/TensorFlow, 4 hidden layers with 2000 neurons each for Mol2vec; 1 hidden layer with 512 neurons for Morgan FP</li>
</ul>
<p>All models were validated using 20x 5-fold cross-validation with the <a href="https://en.wikipedia.org/wiki/Wilcoxon_signed-rank_test">Wilcoxon signed-rank test</a> for statistical comparison.</p>
<h3 id="esol-regression-results">ESOL Regression Results</h3>
<table>
  <thead>
      <tr>
          <th>Features</th>
          <th>Method</th>
          <th>$R^2_{\text{ext}}$</th>
          <th>MSE</th>
          <th>MAE</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Descriptors</td>
          <td>MLR</td>
          <td>0.81 +/- 0.01</td>
          <td>0.82</td>
          <td>0.69</td>
      </tr>
      <tr>
          <td>Molecular Graph</td>
          <td>CNN</td>
          <td>0.93</td>
          <td>0.31 +/- 0.03</td>
          <td>0.40 +/- 0.00</td>
      </tr>
      <tr>
          <td>Morgan FP</td>
          <td>GBM</td>
          <td>0.66 +/- 0.00</td>
          <td>1.43 +/- 0.00</td>
          <td>0.88 +/- 0.00</td>
      </tr>
      <tr>
          <td>Mol2vec</td>
          <td>GBM</td>
          <td>0.86 +/- 0.00</td>
          <td>0.62 +/- 0.00</td>
          <td>0.60 +/- 0.00</td>
      </tr>
  </tbody>
</table>
<p>Mol2vec substantially outperformed Morgan FP ($R^2_{\text{ext}}$ 0.86 vs. 0.66) but did not match the best graph convolution methods ($R^2_{\text{ext}}$ ~0.93).</p>
<h3 id="classification-results-ames-and-tox21">Classification Results (Ames and Tox21)</h3>
<p>On the Ames dataset, Mol2vec and Morgan FP performed comparably (AUC 0.87 vs. 0.88), both matching or exceeding prior SVM and Naive Bayes results. On Tox21, both achieved an average AUC of 0.83, outperforming literature results from graph convolution (0.71) and DNN/SVM approaches (0.71-0.72).</p>
<h3 id="proteochemometric-pcm-extension">Proteochemometric (PCM) Extension</h3>
<p>Mol2vec was combined with ProtVec (protein sequence embeddings using the same Word2vec approach on 3-grams) by concatenating vectors, forming PCM2vec. This was evaluated using a rigorous 4-level cross-validation scheme:</p>
<ul>
<li><strong>CV1</strong>: New compound-target pairs</li>
<li><strong>CV2</strong>: New targets</li>
<li><strong>CV3</strong>: New compounds</li>
<li><strong>CV4</strong>: New compounds and targets</li>
</ul>
<p>On Tox21, PCM2vec improved predictions for new compound-target pairs (CV1: AUC 0.87 vs. 0.79 for Morgan FP) and new compounds (CV3: AUC 0.85 vs. 0.78). On the kinase dataset, PCM2vec approached the performance of classical PCM (Morgan + z-scales) while being alignment-independent, meaning it can be applied to proteins with low sequence similarity.</p>
<h2 id="chemical-intuition-and-practical-value">Chemical Intuition and Practical Value</h2>
<h3 id="embedding-quality">Embedding Quality</h3>
<p>The learned substructure embeddings capture meaningful chemical relationships. Hierarchical clustering of the 25 most common substructures shows expected groupings: aromatic carbons cluster together, aliphatic ring carbons form a separate group, and carbonyl carbons and oxygens are closely related. Similarly, t-SNE projections of amino acid vectors encoded by Mol2vec reproduce known amino acid relationships (e.g., similar distances between Glu/Gln and Asp/Asn pairs, reflecting the carboxylic acid to amide transition).</p>
<h3 id="key-findings">Key Findings</h3>
<ol>
<li><strong>Skip-gram with 300-dimensional embeddings</strong> provides the best Mol2vec representations, consistent with NLP best practices.</li>
<li><strong>Mol2vec excels at regression tasks</strong>, substantially outperforming Morgan FP on ESOL solubility prediction ($R^2_{\text{ext}}$ 0.86 vs. 0.66).</li>
<li><strong>Classification performance is competitive</strong> with Morgan FP across Ames and Tox21 datasets.</li>
<li><strong>PCM2vec enables alignment-independent proteochemometrics</strong>, extending PCM approaches to diverse protein families with low sequence similarity.</li>
<li><strong>Tree-based methods (RF, GBM) outperformed DNNs</strong> on these tasks, though the authors note further DNN tuning could help.</li>
</ol>
<h3 id="limitations">Limitations</h3>
<ul>
<li>The compound vector is a simple sum of substructure vectors, which discards information about substructure arrangement and molecular topology.</li>
<li>Only Morgan identifiers at radii 0 and 1 were used. Larger radii might capture more context but would increase vocabulary size.</li>
<li>DNN architectures were not extensively optimized, leaving open the question of how well Mol2vec pairs with deep learning.</li>
<li>The approach was benchmarked against Morgan FP but not against other learned representations such as graph neural networks in a controlled comparison.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pre-training</td>
          <td>ZINC v15 + ChEMBL v23</td>
          <td>19.9M compounds</td>
          <td>Filtered by MW, atom count, clogP, element types</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>ESOL</td>
          <td>1,144 compounds</td>
          <td>Aqueous solubility regression</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>Ames</td>
          <td>6,511 compounds</td>
          <td>Mutagenicity classification</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>Tox21</td>
          <td>8,192 compounds</td>
          <td>12 toxicity targets, retrieved via DeepChem</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>Kinase (ChEMBL v23)</td>
          <td>284 kinases</td>
          <td>IC50/Kd/Ki binding assays</td>
      </tr>
      <tr>
          <td>Protein corpus</td>
          <td><a href="https://en.wikipedia.org/wiki/UniProt">UniProt</a></td>
          <td>554,241 sequences</td>
          <td>For ProtVec training</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Word2vec</strong>: Skip-gram, window size 10, 300-dimensional embeddings, min count 3</li>
<li><strong>Morgan algorithm</strong>: Radii 0 and 1 (119 and 19,831 unique identifiers respectively)</li>
<li><strong>UNSEEN token</strong>: Replaces identifiers occurring fewer than 3 times</li>
<li><strong>Compound vector</strong>: Sum of all substructure vectors</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>RF</strong>: scikit-learn, 500 estimators, sqrt features, balanced class weights</li>
<li><strong>GBM</strong>: XGBoost, 2000 estimators, max depth 3, learning rate 0.1</li>
<li><strong>DNN</strong>: Keras/TensorFlow, 4 layers x 2000 neurons (Mol2vec) or 1 layer x 512 neurons (Morgan FP), ReLU activation, dropout 0.1</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Mol2vec Best</th>
          <th>Morgan FP Best</th>
          <th>Task</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>$R^2_{\text{ext}}$</td>
          <td>0.86 (GBM)</td>
          <td>0.66 (GBM)</td>
          <td>ESOL regression</td>
      </tr>
      <tr>
          <td>AUC</td>
          <td>0.87 (RF)</td>
          <td>0.88 (RF)</td>
          <td>Ames classification</td>
      </tr>
      <tr>
          <td>AUC</td>
          <td>0.83 (RF)</td>
          <td>0.83 (RF)</td>
          <td>Tox21 classification</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Not specified in the paper.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/samoturk/mol2vec">mol2vec</a></td>
          <td>Code</td>
          <td>BSD-3-Clause</td>
          <td>Python package with pre-trained model</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Jaeger, S., Fulle, S., &amp; Turk, S. (2018). Mol2vec: Unsupervised Machine Learning Approach with Chemical Intuition. <em>Journal of Chemical Information and Modeling</em>, 58(1), 27-35. <a href="https://doi.org/10.1021/acs.jcim.7b00616">https://doi.org/10.1021/acs.jcim.7b00616</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{jaeger2018mol2vec,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Mol2vec: Unsupervised Machine Learning Approach with Chemical Intuition}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Jaeger, Sabrina and Fulle, Simone and Turk, Samo}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Chemical Information and Modeling}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{58}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{27--35}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2018}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{American Chemical Society}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1021/acs.jcim.7b00616}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>X-MOL: Pre-training on 1.1B Molecules for SMILES</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-encoders/x-mol-pretraining-molecular-understanding/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-encoders/x-mol-pretraining-molecular-understanding/</guid><description>X-MOL pre-trains a shared encoder-decoder Transformer on 1.1 billion molecules, then fine-tunes for property prediction, reaction analysis, and generation.</description><content:encoded><![CDATA[<h2 id="a-unified-molecular-pre-training-framework">A Unified Molecular Pre-training Framework</h2>
<p>X-MOL is a <strong>Method</strong> paper that introduces a large-scale pre-training framework for <a href="/notes/computational-chemistry/molecular-representations/smiles/">SMILES</a>-based molecular understanding. The primary contribution is a Transformer encoder-decoder model pre-trained on 1.1 billion molecules from <a href="/notes/computational-chemistry/datasets/zinc-22/">ZINC15</a>, which is then fine-tuned across five distinct molecular analysis tasks: molecular property prediction (classification and regression), chemical reaction productivity prediction, <a href="https://en.wikipedia.org/wiki/Drug_interaction">drug-drug interaction</a> (DDI) prediction, de novo molecule generation (distribution learning and goal-directed), and molecule optimization. The paper demonstrates that a single pre-trained model can serve as a universal foundation for diverse downstream chemistry tasks.</p>
<h2 id="bridging-scale-and-understanding-in-molecular-smiles">Bridging Scale and Understanding in Molecular SMILES</h2>
<p>Prior to X-MOL, most molecular analysis tasks were investigated individually with task-specific models. SMILES-based deep learning methods existed but lacked the benefit of large-scale pre-training that had proven transformative in NLP (BERT, RoBERTa, ERNIE, XLNet, T5). Two challenges motivated this work:</p>
<ol>
<li><strong>SMILES sacrifices structural information for simplicity.</strong> While SMILES is a convenient linear representation, it does not directly encode molecular topology, making it harder for models to learn 3D structure from string input.</li>
<li><strong>Labelled molecular data is scarce.</strong> Most benchmark datasets (<a href="/notes/computational-chemistry/benchmark-problems/moleculenet-benchmark-molecular-ml/">MoleculeNet</a>) contain only thousands of labelled examples, making it difficult to train large models from scratch without overfitting.</li>
</ol>
<p>The authors hypothesized that massive-scale pre-training on unlabelled SMILES could teach a model the grammar rules and implicit structural information in SMILES, providing a strong initialization for multiple downstream tasks.</p>
<h2 id="generative-pre-training-with-random-smiles">Generative Pre-training with Random SMILES</h2>
<p>The core innovation in X-MOL is a <strong>generative pre-training strategy</strong> that exploits the non-uniqueness of SMILES. A single molecule can be represented by many valid SMILES strings (<a href="/notes/computational-chemistry/molecular-representations/randomized-smiles-generative-models/">random SMILES</a>), depending on the starting atom, main chain selection, and ring-opening position. X-MOL trains the model to generate a valid alternative SMILES given an input SMILES of the same molecule, forcing the model to:</p>
<ol>
<li>Reconstruct the molecular structure from the input SMILES</li>
<li>Generate a valid output SMILES following SMILES grammar rules</li>
</ol>
<p>The architecture uses a shared-parameter encoder-decoder based on the Transformer. Unlike standard encoder-decoder models (e.g., for machine translation), X-MOL shares all parameters between encoder and decoder, forcing both encoding and decoding to occur in the same semantic space. The output SMILES is fully masked during training, and only unidirectional attention is permitted within the output sequence.</p>
<p>The self-attention mechanism computes attention for each character $i$ as:</p>
<p>$$
Z_{i} = \text{SoftMax}\left(\frac{Q_{i} \cdot K^{T}}{\sqrt{D}}\right) \cdot V
$$</p>
<p>where $Q_{i}$, $K$, and $V$ are the query, key, and value matrices, and $D$ is the feature dimension. The model uses 12 attention heads to capture different relational patterns.</p>
<h3 id="model-architecture">Model Architecture</h3>
<ul>
<li>12 Transformer encoder layers</li>
<li>768-dimensional hidden units</li>
<li>12 attention heads</li>
<li>Character-level SMILES tokenization (108 chemical characters plus 5 special tokens: [PAD], [CLS], [SEP], [MASK], [UNK])</li>
<li>Characters within square brackets and double digits preceded by &ldquo;%&rdquo; are treated as single tokens</li>
</ul>
<h3 id="data-augmentation-in-pre-training">Data Augmentation in Pre-training</h3>
<p>Because a molecule has multiple valid random SMILES, the output may differ from the predefined target. To handle this, X-MOL generates multiple training samples per molecule with the same input SMILES but different output random SMILES, and places these in the same mini-batch.</p>
<h2 id="experimental-setup-across-five-tasks">Experimental Setup Across Five Tasks</h2>
<p>X-MOL is fine-tuned with task-specific strategies organized into two categories: prediction tasks and generation tasks.</p>
<h3 id="prediction-tasks">Prediction Tasks</h3>
<p>For prediction tasks, the [CLS] token&rsquo;s output representation is passed through a fully connected network to produce predictions. The input format varies by task:</p>
<table>
  <thead>
      <tr>
          <th>Task</th>
          <th>Input Format</th>
          <th>Loss Function</th>
          <th>Metric</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Property prediction (classification)</td>
          <td>Single SMILES</td>
          <td>Cross-entropy</td>
          <td>ROC-AUC</td>
      </tr>
      <tr>
          <td>Property prediction (regression)</td>
          <td>Single SMILES</td>
          <td>MSE</td>
          <td>RMSE</td>
      </tr>
      <tr>
          <td>Reaction productivity prediction</td>
          <td>Four SMILES (reactant, additive, base, ligand)</td>
          <td>MSE</td>
          <td>RMSE</td>
      </tr>
      <tr>
          <td>DDI prediction</td>
          <td>Two SMILES (drug pair)</td>
          <td>Cross-entropy</td>
          <td>Accuracy</td>
      </tr>
  </tbody>
</table>
<p><strong>Molecular Property Prediction (Classification):</strong> Four MoleculeNet benchmarks were used: HIV (41,127 compounds), BACE (1,513), <a href="https://en.wikipedia.org/wiki/Blood%E2%80%93brain_barrier">BBBP</a> (2,039), and ClinTox (1,484). Data were randomly split 20 times, and average ROC-AUC is reported.</p>
<p><strong>Molecular Property Prediction (Regression):</strong> Three MoleculeNet benchmarks: ESOL (1,128), FreeSolv (642), and Lipophilicity (4,200). Data augmentation with random SMILES was applied to the training set. Average RMSE over 20 random splits is reported.</p>
<p><strong>Chemical Reaction Productivity Prediction:</strong> The <a href="https://en.wikipedia.org/wiki/Cross-coupling_reaction">C-N cross-coupling</a> dataset (3,956 reactions) from Ahneman et al. was used with 10-fold cross-validation.</p>
<p><strong>DDI Prediction:</strong> The DeepDDI dataset (192,284 DDI pairs, 86 interaction types) was used as benchmark.</p>
<h3 id="generation-tasks">Generation Tasks</h3>
<table>
  <thead>
      <tr>
          <th>Task</th>
          <th>Generation Source</th>
          <th>Sampling Strategy</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Distribution learning (DL) generation</td>
          <td>Fixed initial symbol ([CLS])</td>
          <td>Random sampling</td>
      </tr>
      <tr>
          <td>Goal-directed (GD) generation</td>
          <td>Unfixed initial symbol</td>
          <td>Random sampling</td>
      </tr>
      <tr>
          <td>Molecule optimization</td>
          <td>Input molecule</td>
          <td>Beam search (beam size = 4)</td>
      </tr>
  </tbody>
</table>
<p><strong>DL-based Generation:</strong> Evaluated on ZINC250K (249,456 molecules) using validity, uniqueness, and novelty.</p>
<p><strong>GD Generation:</strong> Also on ZINC250K, using QED as the goal property with target QED = 0.948 (the dataset maximum). 10,000 molecules were generated for evaluation.</p>
<p><strong>Molecule Optimization:</strong> Evaluated on ZINC250K with QED as the optimization goal. Molecular pairs were constructed by selecting pairs with <a href="https://en.wikipedia.org/wiki/Jaccard_index">Tanimoto similarity</a> in [0.6, 0.8], where the lower-QED molecule serves as input and the higher-QED molecule as target.</p>
<h3 id="key-results">Key Results</h3>
<p><strong>Classification (ROC-AUC, higher is better):</strong> X-MOL achieved state-of-the-art on all four datasets, outperforming both shallow learning methods and deep learning baselines including graph convolutional models.</p>
<p><strong>Regression (RMSE, lower is better):</strong> X-MOL achieved the best RMSE on ESOL, FreeSolv, and Lipophilicity.</p>
<p><strong>Reaction Productivity:</strong> X-MOL obtained an average RMSE of 0.0626, compared to the random forest baseline of 0.078.</p>
<p><strong>DDI Prediction:</strong> X-MOL achieved accuracy of 0.952, improving over DeepDDI&rsquo;s 0.924.</p>
<p><strong>DL-based Generation:</strong></p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Validity</th>
          <th>Uniqueness</th>
          <th>Novelty</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>GCPN</td>
          <td>20%</td>
          <td>99.97%</td>
          <td>100%</td>
      </tr>
      <tr>
          <td>MRNN</td>
          <td>65%</td>
          <td>99.89%</td>
          <td>100%</td>
      </tr>
      <tr>
          <td>GraphAF</td>
          <td>68%</td>
          <td>99.10%</td>
          <td>100%</td>
      </tr>
      <tr>
          <td><strong>X-MOL</strong></td>
          <td><strong>85.28%</strong></td>
          <td><strong>99.91%</strong></td>
          <td><strong>100%</strong></td>
      </tr>
  </tbody>
</table>
<p><strong>GD Generation:</strong> X-MOL generated all top-3 molecules with QED = 0.948, matching the dataset maximum. GraphAF reached 0.948/0.948/0.947, while JT-VAE and MRNN fell further behind.</p>
<h3 id="knowledge-embedding-ablation">Knowledge Embedding Ablation</h3>
<p>The paper tested three additional embedding strategies to inject structural information into the model:</p>
<ul>
<li><strong>Link embedding:</strong> Encodes connection information between atoms (position of the previous connected atom)</li>
<li><strong>Ring embedding:</strong> Encodes ring structure information from SMILES number pairs</li>
<li><strong>Type embedding:</strong> Categorizes characters into 9 types (atoms, bonds, structural symbols)</li>
</ul>
<p>None of these additional embeddings improved performance on the HIV or DDI tasks, whether with or without pre-training. The authors conclude that SMILES already contains sufficient information for molecular understanding and that pre-training effectively extracts this information, a finding they label &ldquo;SMILES is all you need.&rdquo;</p>
<h3 id="attention-visualization">Attention Visualization</h3>
<p>The authors provide attention heatmap analysis demonstrating that:</p>
<ul>
<li>Middle layers (e.g., layer 9) reconstruct molecular structure by correctly identifying atom connectivity and ring closures</li>
<li>Later layers abstract higher-level features for property prediction</li>
<li>In multi-input prediction tasks (reaction productivity), attention reveals which reaction components are most important (e.g., the ligand receives highest cross-attention)</li>
<li>In generation tasks, attention patterns differ between DL (self-focused), GD (source-constrained), and optimization (gradual shift from input to output)</li>
</ul>
<h2 id="findings-limitations-and-future-directions">Findings, Limitations, and Future Directions</h2>
<p>X-MOL demonstrates that large-scale pre-training on SMILES can produce a single model that achieves competitive or state-of-the-art performance across five distinct molecular analysis tasks. The key findings are:</p>
<ol>
<li><strong>Scale enables SMILES understanding.</strong> Pre-training on 1.1 billion molecules allows the model to learn SMILES grammar rules well enough to outperform graph-based methods on molecule generation validity.</li>
<li><strong>Unified framework.</strong> A single pre-trained backbone serves classification, regression, reaction prediction, DDI prediction, and generative tasks through different fine-tuning strategies.</li>
<li><strong>SMILES is sufficient.</strong> Additional knowledge embeddings (link, ring, type) do not improve performance, suggesting pre-training extracts the necessary structural information from SMILES alone.</li>
<li><strong>Interpretable attention.</strong> Attention visualization confirms that the model reconstructs molecular structure internally.</li>
</ol>
<p><strong>Limitations</strong> (observed):</p>
<ul>
<li>The paper reports only MoleculeNet benchmarks with relatively few datasets. No scaffold splits or temporal splits are used; all splits are random, which can overestimate performance on structurally novel compounds.</li>
<li>Comparison baselines are somewhat dated (2018-2019 era methods), and the paper does not compare against concurrent SMILES pre-training methods.</li>
<li>The molecule generation validity (85.28%) is much higher than graph baselines like GCPN (20%), but later work achieved near 100% validity with constrained SMILES grammars.</li>
<li>No code or model weights have been publicly released, limiting independent verification.</li>
<li>The paper remains a bioRxiv preprint and has not been published in a peer-reviewed venue.</li>
</ul>
<p><strong>Future directions</strong> proposed by the authors include: better pre-training strategies, extension to graph-based representations, and fine-tuning on additional downstream tasks.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pre-training</td>
          <td>ZINC15</td>
          <td>1.1 billion molecules</td>
          <td>Random SMILES augmentation</td>
      </tr>
      <tr>
          <td>Classification</td>
          <td>HIV (MoleculeNet)</td>
          <td>41,127</td>
          <td>Binary classification</td>
      </tr>
      <tr>
          <td>Classification</td>
          <td>BACE (MoleculeNet)</td>
          <td>1,513</td>
          <td>Binary classification</td>
      </tr>
      <tr>
          <td>Classification</td>
          <td>BBBP (MoleculeNet)</td>
          <td>2,039</td>
          <td>Binary classification</td>
      </tr>
      <tr>
          <td>Classification</td>
          <td>ClinTox (MoleculeNet)</td>
          <td>1,484</td>
          <td>Two sub-datasets, averaged</td>
      </tr>
      <tr>
          <td>Regression</td>
          <td>ESOL (MoleculeNet)</td>
          <td>1,128</td>
          <td>Water solubility</td>
      </tr>
      <tr>
          <td>Regression</td>
          <td>FreeSolv (MoleculeNet)</td>
          <td>642</td>
          <td>Hydration free energy</td>
      </tr>
      <tr>
          <td>Regression</td>
          <td>Lipophilicity (MoleculeNet)</td>
          <td>4,200</td>
          <td>logD at pH 7.4</td>
      </tr>
      <tr>
          <td>Reaction</td>
          <td>C-N cross-coupling</td>
          <td>3,956</td>
          <td>From Ahneman et al. (2018)</td>
      </tr>
      <tr>
          <td>DDI</td>
          <td>DeepDDI</td>
          <td>192,284 DDI pairs</td>
          <td>86 interaction types</td>
      </tr>
      <tr>
          <td>Generation</td>
          <td>ZINC250K</td>
          <td>249,456</td>
          <td>For DL, GD, and optimization</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Pre-training: Generative SMILES-to-SMILES with shared encoder-decoder Transformer</li>
<li>Fine-tuning prediction tasks: [CLS] token passed through fully connected layers</li>
<li>Fine-tuning generation tasks: Autoregressive generation with random sampling (DL, GD) or beam search (optimization)</li>
<li>Data augmentation: Random SMILES augmentation for regression tasks</li>
<li>Repeated training: 20 random splits with averaged results for classification/regression</li>
<li>10-fold cross-validation for reaction productivity</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>12-layer Transformer, 768 hidden dimensions, 12 attention heads</li>
<li>Character-level tokenization: 108 chemical characters + 5 special tokens</li>
<li>Implemented in PaddlePaddle framework</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Task</th>
          <th>Metric</th>
          <th>X-MOL</th>
          <th>Best Baseline</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>HIV (classification)</td>
          <td>ROC-AUC</td>
          <td>State-of-the-art</td>
          <td>Previous best (various)</td>
      </tr>
      <tr>
          <td>BACE (classification)</td>
          <td>ROC-AUC</td>
          <td>State-of-the-art</td>
          <td>Previous best (various)</td>
      </tr>
      <tr>
          <td>BBBP (classification)</td>
          <td>ROC-AUC</td>
          <td>State-of-the-art</td>
          <td>Previous best (various)</td>
      </tr>
      <tr>
          <td>ClinTox (classification)</td>
          <td>ROC-AUC</td>
          <td>State-of-the-art</td>
          <td>Previous best (various)</td>
      </tr>
      <tr>
          <td>ESOL (regression)</td>
          <td>RMSE</td>
          <td>State-of-the-art</td>
          <td>Previous best (various)</td>
      </tr>
      <tr>
          <td>FreeSolv (regression)</td>
          <td>RMSE</td>
          <td>State-of-the-art</td>
          <td>Previous best (various)</td>
      </tr>
      <tr>
          <td>Lipophilicity (regression)</td>
          <td>RMSE</td>
          <td>State-of-the-art</td>
          <td>Previous best (various)</td>
      </tr>
      <tr>
          <td>C-N coupling</td>
          <td>RMSE</td>
          <td>0.0626</td>
          <td>0.078 (random forest)</td>
      </tr>
      <tr>
          <td>DDI prediction</td>
          <td>Accuracy</td>
          <td>0.952</td>
          <td>0.924 (DeepDDI)</td>
      </tr>
      <tr>
          <td>DL generation</td>
          <td>Validity</td>
          <td>85.28%</td>
          <td>68% (GraphAF)</td>
      </tr>
      <tr>
          <td>GD generation</td>
          <td>Top-3 QED</td>
          <td>All 0.948</td>
          <td>0.948/0.948/0.947 (GraphAF)</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>Pre-training: 8/16 Tesla P40 GPUs (24 GB each), approximately 4 days</li>
<li>Data pre-processing: Over 1,000 CPUs with Hadoop</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<p>No code, model weights, or pre-trained checkpoints have been publicly released. The model was implemented in Baidu&rsquo;s PaddlePaddle framework, but no repository is available.</p>
<p><strong>Reproducibility status: Closed.</strong> While the datasets are all publicly available (ZINC15, MoleculeNet, ZINC250K, DeepDDI, C-N coupling), the model implementation, pre-trained weights, and fine-tuning code are not released. The computational requirements (1,000+ CPUs for data processing, 8-16 GPUs for 4 days of pre-training) are substantial.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Xue, D., Zhang, H., Xiao, D., Gong, Y., Chuai, G., Sun, Y., Tian, H., Wu, H., Li, Y., &amp; Liu, Q. (2020). X-MOL: Large-scale pre-training for molecular understanding and diverse molecular analysis. <em>bioRxiv</em>.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{xue2020xmol,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{X-MOL: large-scale pre-training for molecular understanding and diverse molecular analysis}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Xue, Dongyu and Zhang, Han and Xiao, Dongling and Gong, Yukang and Chuai, Guohui and Sun, Yu and Tian, Hao and Wu, Hua and Li, Yukun and Liu, Qi}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{bioRxiv}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2020}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1101/2020.12.23.424259}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Cold Spring Harbor Laboratory}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>SMILES-BERT: BERT-Style Pre-Training for Molecules</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-encoders/smiles-bert/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-encoders/smiles-bert/</guid><description>SMILES-BERT applies BERT-style masked pre-training to SMILES strings for molecular property prediction, using Transformer encoders fine-tuned on labeled data.</description><content:encoded><![CDATA[<h2 id="pre-training-transformers-on-smiles-for-molecular-properties">Pre-Training Transformers on SMILES for Molecular Properties</h2>
<p>SMILES-BERT is a <strong>Method</strong> paper that introduces a BERT-inspired pre-training and fine-tuning framework for molecular property prediction. The primary contribution is adapting the masked language model paradigm from NLP to <a href="/notes/computational-chemistry/molecular-representations/smiles/">SMILES strings</a>, enabling a Transformer encoder to learn molecular representations from large-scale unlabeled data before fine-tuning on smaller labeled datasets.</p>
<h2 id="limited-labels-in-molecular-property-prediction">Limited Labels in Molecular Property Prediction</h2>
<p>Molecular property prediction is central to drug discovery and chemical design, but obtaining labeled data requires expensive biological assays. Deep learning methods for this task fall into three categories: manually designed fingerprints (e.g., ECFP), graph-based methods (GCNs operating on molecular graphs), and sequence-based methods (RNNs or CNNs operating on SMILES strings).</p>
<p>Prior unsupervised approaches like <a href="/notes/computational-chemistry/chemical-language-models/molecular-encoders/seq2seq-fingerprint-molecular-embedding/">Seq2seq Fingerprint</a> used an encoder-decoder architecture to learn representations from unlabeled SMILES, but the decoder acts as scaffolding that consumes GPU memory during pre-training without contributing to downstream prediction. The semi-supervised Seq3seq Fingerprint improved on this by incorporating labeled data, but retained the encoder-decoder inefficiency. RNN-based methods also suffer from difficulty in parallel training and require careful tuning (gradient clipping, early stopping) to converge.</p>
<p>The authors identify two motivations: (1) building a semi-supervised model that effectively leverages large pools of unlabeled SMILES to improve prediction with limited labels, and (2) designing an architecture where the entire pre-trained model participates in fine-tuning (no wasted decoder parameters) and naturally supports parallel training.</p>
<h2 id="masked-smiles-recovery-with-transformer-encoders">Masked SMILES Recovery with Transformer Encoders</h2>
<p>The core innovation is the Masked SMILES Recovery pre-training task, directly analogous to BERT&rsquo;s masked language modeling. The model architecture is a stack of Transformer encoder layers, making it fully convolutional and parallelizable.</p>
<h3 id="architecture">Architecture</h3>
<p>SMILES-BERT uses 6 Transformer encoder layers, each with 4-head multi-head self-attention and feed-forward dimension of 1024. Each Transformer layer contains three components: a pre-attention feed-forward network, a self-attention layer, and a post-attention feed-forward network, all followed by layer normalization with residual connections.</p>
<p>The self-attention mechanism uses scaled dot-product attention:</p>
<p>$$
Z = \text{Softmax}\left(\frac{(XW^{Q})(XW^{K})^{T}}{\sqrt{d_{k}}}\right) XW^{V}
$$</p>
<p>where $X \in \mathbb{R}^{N \times M}$ is the input feature matrix, $W^{Q}$, $W^{K}$, $W^{V} \in \mathbb{R}^{M \times d_{k}}$ are the query, key, and value weight matrices, and $\sqrt{d_{k}}$ is the scaling factor.</p>
<p>Input SMILES are tokenized at the character level with token embeddings and positional embeddings. A special <code>&lt;GO&gt;</code> token is prepended to each SMILES, and its output representation is used for downstream classification/regression after fine-tuning.</p>
<h3 id="pre-training-masked-smiles-recovery">Pre-training: Masked SMILES Recovery</h3>
<p>Following BERT&rsquo;s masking strategy, 15% of tokens in each SMILES are selected for masking (minimum one per SMILES). Of the selected tokens:</p>
<ul>
<li>85% are replaced with a <code>&lt;MASK&gt;</code> token</li>
<li>10% are replaced with a random token from the vocabulary</li>
<li>5% are kept unchanged</li>
</ul>
<p>The model is trained to recover the original tokens at masked positions. The loss is computed only on the masked token outputs.</p>
<h3 id="fine-tuning">Fine-tuning</h3>
<p>After pre-training, a classifier or regressor head is added to the <code>&lt;GO&gt;</code> token output. The entire model (all Transformer layers plus the new head) is fine-tuned on the labeled dataset.</p>
<p>Key differences from the original BERT:</p>
<ol>
<li>Only the Masked SMILES Recovery task is used (BERT&rsquo;s next sentence prediction is dropped since SMILES have no consecutive-sentence structure)</li>
<li>Segment embeddings are removed</li>
<li>The architecture is smaller (6 layers, 4 heads, 1024 FFN dim) since SMILES have a much smaller vocabulary and shorter sequences than natural language</li>
</ol>
<p>The authors compared this configuration against a larger BERT-base setup (12 layers, 12 heads, 3072 FFN dim) and found no meaningful performance difference, confirming that the smaller model is sufficient for SMILES.</p>
<h2 id="experimental-setup-and-baseline-comparisons">Experimental Setup and Baseline Comparisons</h2>
<h3 id="pre-training-data">Pre-training Data</h3>
<p>SMILES-BERT was pre-trained on the <a href="/notes/computational-chemistry/datasets/zinc-22/">ZINC database</a> with 18,671,355 training SMILES, 10,000 for validation, and 10,000 for evaluation. Pre-training ran for 10 epochs using the Adam optimizer with a warm-up strategy (learning rate from $10^{-9}$ to $10^{-4}$ over 4,000 steps, then inverse-square-root decay). Batch size was 256 and dropout was 0.1. The pre-training masked SMILES exact recovery rate reached 82.85% on the validation set.</p>
<h3 id="fine-tuning-datasets">Fine-tuning Datasets</h3>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Source</th>
          <th>Size</th>
          <th>Task</th>
          <th>Metric</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://en.wikipedia.org/wiki/Partition_coefficient">LogP</a></td>
          <td>NCATS/NIH</td>
          <td>10,850</td>
          <td>Classification (threshold 1.88)</td>
          <td>Accuracy</td>
      </tr>
      <tr>
          <td>PM2</td>
          <td>NCATS/NIH</td>
          <td>323,242</td>
          <td>Classification (threshold 0.024896)</td>
          <td>Accuracy</td>
      </tr>
      <tr>
          <td>PCBA-686978</td>
          <td><a href="https://en.wikipedia.org/wiki/PubChem">PubChem</a></td>
          <td>302,175</td>
          <td>Classification</td>
          <td>Accuracy</td>
      </tr>
  </tbody>
</table>
<p>All datasets were split 80/10/10 for train/validation/test. Fine-tuning used Adam with a fixed learning rate for 50 epochs, selecting the best model on validation data.</p>
<h3 id="baselines">Baselines</h3>
<ul>
<li><strong>Circular Fingerprint (CircularFP)</strong>: Manually designed hash-based fingerprint (ECFP family)</li>
<li><strong>Neural Fingerprint (NeuralFP)</strong>: Graph-based neural network replacing hash functions with learned layers</li>
<li><strong>Seq2seq Fingerprint (Seq2seqFP)</strong>: Unsupervised encoder-decoder model on SMILES</li>
<li><strong>Seq3seq Fingerprint (Seq3seqFP)</strong>: Semi-supervised encoder-decoder model on SMILES</li>
</ul>
<h3 id="results">Results</h3>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>LogP</th>
          <th>PM2</th>
          <th>PCBA-686978</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>CircularFP</td>
          <td>~0.90</td>
          <td>0.6858</td>
          <td>~0.82</td>
      </tr>
      <tr>
          <td>NeuralFP</td>
          <td>~0.90</td>
          <td>0.6802</td>
          <td>~0.82</td>
      </tr>
      <tr>
          <td>Seq2seqFP</td>
          <td>~0.87</td>
          <td>0.6112</td>
          <td>~0.80</td>
      </tr>
      <tr>
          <td>Seq3seqFP</td>
          <td>~0.90</td>
          <td>0.7038</td>
          <td>~0.84</td>
      </tr>
      <tr>
          <td><strong>SMILES-BERT</strong></td>
          <td><strong>0.9154</strong></td>
          <td><strong>0.7589</strong></td>
          <td><strong>0.8784</strong></td>
      </tr>
  </tbody>
</table>
<p>SMILES-BERT outperformed all baselines on all three datasets. The improvement over Seq3seqFP was approximately 2% on LogP, 5.5% on PM2, and 3.8% on PCBA-686978. The results on PM2 (the largest labeled dataset) show that pre-training benefits persist even with substantial labeled data.</p>
<h3 id="structure-study">Structure Study</h3>
<table>
  <thead>
      <tr>
          <th>Configuration</th>
          <th>Layers</th>
          <th>Attention Heads</th>
          <th>FFN Dim</th>
          <th>LogP Accuracy</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>SMILES-BERT</td>
          <td>6</td>
          <td>4</td>
          <td>1024</td>
          <td>0.9154</td>
      </tr>
      <tr>
          <td>SMILES-BERT (large)</td>
          <td>12</td>
          <td>12</td>
          <td>3072</td>
          <td>0.9147</td>
      </tr>
  </tbody>
</table>
<p>The larger configuration provided no improvement, supporting the choice of the smaller, more efficient architecture.</p>
<h2 id="findings-limitations-and-future-directions">Findings, Limitations, and Future Directions</h2>
<p>SMILES-BERT demonstrated that BERT-style masked pre-training on SMILES strings produces transferable molecular representations that improve property prediction across datasets of varying sizes and property types.</p>
<p>Key findings:</p>
<ul>
<li>The Masked SMILES Recovery pre-training task transfers effectively to molecular property prediction</li>
<li>The full model participates in fine-tuning (no wasted decoder), making SMILES-BERT more parameter-efficient than encoder-decoder alternatives</li>
<li>A smaller Transformer configuration (6 layers, 4 heads) matches the performance of a BERT-base-sized model for SMILES data</li>
<li>Pre-training on ~18.7M SMILES from ZINC provides robust initialization across different downstream tasks</li>
</ul>
<p><strong>Limitations</strong>: The evaluation uses only classification accuracy as the metric, without reporting AUC-ROC, F1, or other metrics common in molecular property prediction. The comparison is limited to four baselines, and two of the three evaluation datasets (LogP, PM2) are non-public NIH datasets. The paper does not explore different pre-training dataset sizes or ablate the masking strategy. Only classification tasks are evaluated, though the architecture supports regression.</p>
<p><strong>Future work</strong>: The authors propose incorporating Quantitative Estimate of Druglikeness (QED) prediction as an additional pre-training task to warm up the model&rsquo;s classification capability, analogous to BERT&rsquo;s next sentence prediction.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pre-training</td>
          <td>ZINC</td>
          <td>18,671,355 SMILES</td>
          <td>Publicly available database</td>
      </tr>
      <tr>
          <td>Fine-tuning</td>
          <td>LogP</td>
          <td>10,850</td>
          <td>Non-public, from NCATS/NIH</td>
      </tr>
      <tr>
          <td>Fine-tuning</td>
          <td>PM2</td>
          <td>323,242</td>
          <td>Non-public, from NCATS/NIH</td>
      </tr>
      <tr>
          <td>Fine-tuning</td>
          <td>PCBA-686978</td>
          <td>302,175</td>
          <td>Public, from PubChem BioAssay</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Pre-training: Adam optimizer, warm-up for 4,000 steps ($10^{-9}$ to $10^{-4}$), inverse-square-root LR schedule, batch size 256, dropout 0.1, 10 epochs</li>
<li>Fine-tuning: Adam optimizer, fixed LR (insensitive to choice among $10^{-5}$, $10^{-6}$, $10^{-7}$), 50 epochs, best model on validation</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>6 Transformer encoder layers, 4-head multi-head attention, FFN dim 1024</li>
<li>Token embedding + positional embedding, <code>&lt;GO&gt;</code> special token</li>
<li>Implemented with FairSeq (Facebook AI Research Sequence-to-Sequence Toolkit)</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>SMILES-BERT</th>
          <th>Best Baseline (Seq3seqFP)</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>LogP Accuracy</td>
          <td>0.9154</td>
          <td>~0.90</td>
          <td>~2% improvement</td>
      </tr>
      <tr>
          <td>PM2 Accuracy</td>
          <td>0.7589</td>
          <td>0.7038</td>
          <td>~5.5% improvement</td>
      </tr>
      <tr>
          <td>PCBA Accuracy</td>
          <td>0.8784</td>
          <td>~0.84</td>
          <td>~3.8% improvement</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>The paper mentions GPU training and NVIDIA GPU donation in acknowledgments but does not specify the exact GPU model or training time beyond noting that pre-training on a single GPU takes over a week for 10 epochs.</p>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>No public code or model release identified</td>
          <td>-</td>
          <td>-</td>
          <td>Paper does not provide a GitHub link or model checkpoint</td>
      </tr>
  </tbody>
</table>
<p><strong>Reproducibility status</strong>: Partially Reproducible. The ZINC pre-training data is public and the architecture is described in detail, but no code or pre-trained weights are released. Two of three evaluation datasets (LogP, PM2) are non-public.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Wang, S., Guo, Y., Wang, Y., Sun, H., &amp; Huang, J. (2019). SMILES-BERT: Large Scale Unsupervised Pre-Training for Molecular Property Prediction. In <em>Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics (ACM-BCB &lsquo;19)</em>, 429-436. <a href="https://doi.org/10.1145/3307339.3342186">https://doi.org/10.1145/3307339.3342186</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{wang2019smilesbert,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{SMILES-BERT: Large Scale Unsupervised Pre-Training for Molecular Property Prediction}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Wang, Sheng and Guo, Yuzhi and Wang, Yuhong and Sun, Hongmao and Huang, Junzhou}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{429--436}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2019}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{ACM}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1145/3307339.3342186}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>SMILES Transformer: Low-Data Molecular Fingerprints</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-encoders/smiles-transformer/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-encoders/smiles-transformer/</guid><description>SMILES Transformer uses unsupervised Transformer pre-training on SMILES strings to produce molecular fingerprints that excel in low-data drug discovery tasks.</description><content:encoded><![CDATA[<h2 id="a-transformer-approach-to-learned-molecular-fingerprints">A Transformer Approach to Learned Molecular Fingerprints</h2>
<p>This is a <strong>Method</strong> paper that introduces SMILES Transformer (ST), a Transformer-based sequence-to-sequence model pre-trained on unlabeled SMILES strings to produce continuous, data-driven molecular fingerprints. The primary contribution is demonstrating that unsupervised pre-training on chemical text representations yields fingerprints that generalize well under low-data conditions, outperforming both rule-based fingerprints (ECFP) and graph convolution models on several <a href="/notes/computational-chemistry/benchmark-problems/moleculenet-benchmark-molecular-ml/">MoleculeNet</a> benchmarks. A secondary contribution is the Data Efficiency Metric (DEM), a scalar metric for evaluating model performance across varying training set sizes.</p>
<h2 id="the-low-data-problem-in-molecular-property-prediction">The Low-Data Problem in Molecular Property Prediction</h2>
<p>Machine learning for drug discovery depends on molecular representations, but labeled datasets of experimentally validated properties are typically small. Conventional approaches fall into two camps: rule-based fingerprints like ECFP that hash substructures into sparse binary vectors, and graph-based methods like GraphConv that learn representations end-to-end. Rule-based fingerprints perform poorly with shallow models or limited data, while graph-based methods are designed for large fully-labeled settings.</p>
<p>Pre-training on unlabeled data had shown strong results in NLP (ELMo, BERT, XLNet), and prior work in cheminformatics had explored RNN-based and VAE-based pre-training on SMILES (<a href="/notes/computational-chemistry/chemical-language-models/molecular-encoders/seq2seq-fingerprint-molecular-embedding/">Seq2Seq fingerprints</a>, <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/latent-space/grammar-variational-autoencoder/">Grammar VAE</a>, heteroencoders). However, none of these studies systematically evaluated performance in small-data settings. Honda et al. fill this gap by applying Transformer-based pre-training to SMILES and measuring data efficiency explicitly.</p>
<h2 id="transformer-pre-training-on-smiles-with-pooled-fingerprint-extraction">Transformer Pre-training on SMILES with Pooled Fingerprint Extraction</h2>
<p>The core innovation is a Transformer encoder-decoder architecture pre-trained as an autoencoder on SMILES strings, with a specific fingerprint extraction strategy that pools the encoder outputs into a fixed-length vector.</p>
<h3 id="architecture">Architecture</h3>
<p>The model uses 4 Transformer blocks for both the encoder and decoder, each with 4-head attention and 256 embedding dimensions plus 2 linear layers. Input SMILES are tokenized at the symbol level (e.g., &lsquo;c&rsquo;, &lsquo;Br&rsquo;, &lsquo;=&rsquo;, &lsquo;(&rsquo;, &lsquo;2&rsquo;) and one-hot encoded. Following Vaswani et al. (2017), the input uses the sum of token encoding and positional encoding.</p>
<h3 id="pre-training">Pre-training</h3>
<p>The model is pre-trained on 861,000 unlabeled SMILES sampled from <a href="https://en.wikipedia.org/wiki/ChEMBL">ChEMBL24</a> to minimize cross-entropy between input and output SMILES (i.e., reconstruction). <a href="/notes/computational-chemistry/molecular-representations/randomized-smiles-generative-models/">SMILES enumeration</a> (Bjerrum, 2017) randomly generates non-canonical SMILES at each epoch to reduce representation bias. Training runs for 5 epochs with Adam optimization, reaching a perplexity of 1.0 (perfect decoding).</p>
<h3 id="fingerprint-extraction">Fingerprint Extraction</h3>
<p>Since the Transformer outputs symbol-level (atom-level) representations, a pooling strategy produces molecule-level fingerprints. Four vectors are concatenated:</p>
<ol>
<li>Mean-pooled output of the last encoder layer</li>
<li>Max-pooled output of the last encoder layer</li>
<li>First output token of the last encoder layer</li>
<li>First output token of the penultimate encoder layer</li>
</ol>
<p>This produces a 1024-dimensional fingerprint, matching the dimensionality of ECFP for fair comparison.</p>
<h3 id="data-efficiency-metric">Data Efficiency Metric</h3>
<p>The paper proposes DEM to measure how well a model performs across different training set sizes:</p>
<p>$$
M_{DE}(f, m) = \frac{1}{|I|} \sum_{i \in I} m(f_i, X_i, Y_i)
$$</p>
<p>where $f_i$ is the model trained on the fraction $i$ of training data, $m$ is the task metric, and $I = {0.0125, 0.025, 0.05, 0.1, 0.2, 0.4, 0.8}$ doubles the training percentage at each step. This captures average performance across a range of data availability, giving a single scalar that balances accuracy and data efficiency.</p>
<h2 id="benchmarking-across-moleculenet-with-data-efficiency-focus">Benchmarking Across MoleculeNet with Data Efficiency Focus</h2>
<h3 id="datasets">Datasets</h3>
<p>The evaluation uses 10 datasets from MoleculeNet spanning three categories:</p>
<table>
  <thead>
      <tr>
          <th>Category</th>
          <th>Dataset</th>
          <th>Tasks</th>
          <th>Type</th>
          <th>Molecules</th>
          <th>Metric</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Physical chemistry</td>
          <td>ESOL</td>
          <td>1</td>
          <td>Regression</td>
          <td>1,128</td>
          <td>RMSE</td>
      </tr>
      <tr>
          <td>Physical chemistry</td>
          <td>FreeSolv</td>
          <td>1</td>
          <td>Regression</td>
          <td>643</td>
          <td>RMSE</td>
      </tr>
      <tr>
          <td>Physical chemistry</td>
          <td><a href="https://en.wikipedia.org/wiki/Lipophilicity">Lipophilicity</a></td>
          <td>1</td>
          <td>Regression</td>
          <td>4,200</td>
          <td>RMSE</td>
      </tr>
      <tr>
          <td>Biophysics</td>
          <td>MUV</td>
          <td>17</td>
          <td>Classification</td>
          <td>93,127</td>
          <td>PRC-AUC</td>
      </tr>
      <tr>
          <td>Biophysics</td>
          <td>HIV</td>
          <td>1</td>
          <td>Classification</td>
          <td>41,913</td>
          <td>ROC-AUC</td>
      </tr>
      <tr>
          <td>Biophysics</td>
          <td>BACE</td>
          <td>1</td>
          <td>Classification</td>
          <td>1,522</td>
          <td>ROC-AUC</td>
      </tr>
      <tr>
          <td>Physiology</td>
          <td>BBBP</td>
          <td>1</td>
          <td>Classification</td>
          <td>2,053</td>
          <td>ROC-AUC</td>
      </tr>
      <tr>
          <td>Physiology</td>
          <td>Tox21</td>
          <td>12</td>
          <td>Classification</td>
          <td>8,014</td>
          <td>ROC-AUC</td>
      </tr>
      <tr>
          <td>Physiology</td>
          <td>SIDER</td>
          <td>27</td>
          <td>Classification</td>
          <td>1,427</td>
          <td>ROC-AUC</td>
      </tr>
      <tr>
          <td>Physiology</td>
          <td>ClinTox</td>
          <td>2</td>
          <td>Classification</td>
          <td>1,491</td>
          <td>ROC-AUC</td>
      </tr>
  </tbody>
</table>
<h3 id="baselines">Baselines</h3>
<ul>
<li><strong>ECFP4</strong>: Rule-based extended-connectivity fingerprint with 1024 dimensions</li>
<li><strong>RNNS2S</strong>: RNN-based Seq2Seq pre-trained fingerprint (3-layer bidirectional GRU, same pre-training data as ST)</li>
<li><strong>GraphConv</strong>: Graph convolution network trained end-to-end on labeled data</li>
</ul>
<h3 id="experimental-setup">Experimental Setup</h3>
<p>All fingerprint methods use a simple MLP classifier/regressor from scikit-learn with default hyperparameters to isolate the fingerprint quality from model capacity. Datasets are randomly split (stratified for classification), and results are averaged over 20 trials. Note that random splits are used rather than scaffold splits for the DEM experiments.</p>
<h3 id="data-efficiency-results-dem">Data Efficiency Results (DEM)</h3>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>ST+MLP</th>
          <th>ECFP+MLP</th>
          <th>RNNS2S+MLP</th>
          <th>GraphConv</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ESOL (RMSE, lower is better)</td>
          <td><strong>1.144</strong></td>
          <td>1.741</td>
          <td>1.317</td>
          <td>1.673</td>
      </tr>
      <tr>
          <td>FreeSolv (RMSE, lower is better)</td>
          <td><strong>2.246</strong></td>
          <td>3.043</td>
          <td>2.987</td>
          <td>3.476</td>
      </tr>
      <tr>
          <td>Lipophilicity (RMSE, lower is better)</td>
          <td>1.169</td>
          <td><strong>1.090</strong></td>
          <td>1.219</td>
          <td><strong>1.062</strong></td>
      </tr>
      <tr>
          <td>MUV (PRC-AUC, higher is better)</td>
          <td>0.009</td>
          <td><strong>0.036</strong></td>
          <td>0.010</td>
          <td>0.004</td>
      </tr>
      <tr>
          <td>HIV (ROC-AUC, higher is better)</td>
          <td>0.683</td>
          <td>0.697</td>
          <td>0.682</td>
          <td><strong>0.723</strong></td>
      </tr>
      <tr>
          <td>BACE (ROC-AUC, higher is better)</td>
          <td>0.719</td>
          <td><strong>0.769</strong></td>
          <td>0.717</td>
          <td>0.744</td>
      </tr>
      <tr>
          <td>BBBP (ROC-AUC, higher is better)</td>
          <td><strong>0.900</strong></td>
          <td>0.760</td>
          <td>0.884</td>
          <td>0.795</td>
      </tr>
      <tr>
          <td>Tox21 (ROC-AUC, higher is better)</td>
          <td><strong>0.706</strong></td>
          <td>0.616</td>
          <td>0.702</td>
          <td>0.687</td>
      </tr>
      <tr>
          <td>SIDER (ROC-AUC, higher is better)</td>
          <td>0.559</td>
          <td><strong>0.588</strong></td>
          <td>0.558</td>
          <td>0.557</td>
      </tr>
      <tr>
          <td>ClinTox (ROC-AUC, higher is better)</td>
          <td><strong>0.963</strong></td>
          <td>0.515</td>
          <td>0.904</td>
          <td>0.936</td>
      </tr>
  </tbody>
</table>
<p>ST achieves the best DEM in 5 of 10 datasets (ESOL, FreeSolv, BBBP, Tox21, ClinTox), with particularly strong margins on ClinTox (+0.027 over GraphConv) and BBBP (+0.016 over RNNS2S).</p>
<h3 id="linear-model-experiments">Linear Model Experiments</h3>
<p>To further isolate fingerprint quality, the authors replace MLP with ridge/logistic regression with L2 penalty. On 8 datasets (excluding MUV and SIDER due to class imbalance issues), ST achieves best DEM in 5 of 8, confirming the fingerprint quality holds regardless of downstream model.</p>
<h3 id="stratified-analysis-by-molecule-size">Stratified Analysis by Molecule Size</h3>
<p>On BBBP stratified by SMILES length, ST&rsquo;s ROC-AUC increases with longer SMILES, similar to RNNS2S but unlike GraphConv which shows stable performance across lengths. This suggests text-based models extract richer information from longer sequences.</p>
<h3 id="comparison-with-record-scores-large-data">Comparison with Record Scores (Large Data)</h3>
<p>Under the large-data setting (80/10/10 train/val/test split with hyperparameter tuning via Optuna), ST achieves first place only in ClinTox (0.954) but performs comparably to ECFP and graph-based models on the other datasets. This confirms that ST&rsquo;s main advantage is in the low-data regime.</p>
<h2 id="strong-low-data-performance-with-caveats-on-scalability">Strong Low-Data Performance with Caveats on Scalability</h2>
<h3 id="key-findings">Key Findings</h3>
<ol>
<li>Transformer-based unsupervised pre-training on SMILES produces fingerprints that excel in low-data molecular property prediction, achieving best data efficiency on 5 of 10 MoleculeNet tasks.</li>
<li>The advantage is most pronounced on small datasets (ESOL with 1,128 molecules, FreeSolv with 643, BBBP with 2,053, ClinTox with 1,491) where pre-training enables good generalization.</li>
<li>With sufficient labeled data and hyperparameter tuning, ST fingerprints perform comparably to (but do not surpass) graph-based methods.</li>
<li>Longer SMILES provide richer information for text-based models, as shown by the stratified analysis on BBBP.</li>
</ol>
<h3 id="limitations">Limitations</h3>
<ul>
<li>Random splits are used for most DEM experiments rather than scaffold splits, which may inflate performance estimates for drug discovery applications where training and test molecules are structurally distinct.</li>
<li>The pre-training corpus (861K SMILES from ChEMBL24) is relatively small by modern standards.</li>
<li>MUV performance is poor across all methods (PRC-AUC near zero), suggesting the DEM framework may not be informative for extremely imbalanced or noisy datasets.</li>
<li>No comparison with BERT-style masked language model pre-training, which later work (<a href="/notes/computational-chemistry/chemical-language-models/molecular-encoders/chemberta/">ChemBERTa</a>) would show as a viable alternative.</li>
</ul>
<h3 id="future-directions">Future Directions</h3>
<p>The authors propose three directions: (1) replacing the Transformer with Transformer-XL to handle longer SMILES, (2) multi-task pre-training that jointly predicts molecular descriptors (e.g., molecular weight, <a href="https://en.wikipedia.org/wiki/Partition_coefficient">LogP</a>) alongside SMILES reconstruction, and (3) better exploitation of enumerated SMILES to constrain the latent space.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pre-training</td>
          <td>ChEMBL24</td>
          <td>861,000 SMILES</td>
          <td>Unlabeled, randomly sampled</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>MoleculeNet (10 datasets)</td>
          <td>643 to 93,127 molecules</td>
          <td>See Table 1 for per-dataset details</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Transformer encoder-decoder: 4 blocks each, 4-head attention, 256 embedding dimensions</li>
<li>Pre-training: 5 epochs, Adam optimizer, cross-entropy loss, SMILES enumeration for augmentation</li>
<li>Fingerprint: 1024 dimensions from concatenated mean pool, max pool, and first-token outputs</li>
<li>Downstream: scikit-learn MLP (default hyperparameters) for DEM experiments; ridge/logistic regression for linear model experiments; Optuna for hyperparameter search in large-data comparison</li>
</ul>
<h3 id="models">Models</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/DSPsleeporg/smiles-transformer">smiles-transformer</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official implementation (Jupyter notebooks)</td>
      </tr>
  </tbody>
</table>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li>DEM averaged over 7 training fractions (1.25% to 80%), 20 trials each</li>
<li>Random splits for DEM; scaffold splits for HIV, BACE, BBBP in large-data comparison</li>
<li>Metrics: RMSE (regression), ROC-AUC or PRC-AUC (classification) per MoleculeNet conventions</li>
</ul>
<h3 id="hardware">Hardware</h3>
<p>The paper does not specify GPU type or training time for the pre-training phase.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Honda, S., Shi, S., &amp; Ueda, H. R. (2019). SMILES Transformer: Pre-trained Molecular Fingerprint for Low Data Drug Discovery. <em>arXiv preprint arXiv:1911.04738</em>.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{honda2019smiles,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{SMILES Transformer: Pre-trained Molecular Fingerprint for Low Data Drug Discovery}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Honda, Shion and Shi, Shoi and Ueda, Hiroki R.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{arXiv preprint arXiv:1911.04738}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2019}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>SMI-TED: Encoder-Decoder Foundation Models for Chemistry</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-encoders/smi-ted-encoder-decoder-chemistry/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-encoders/smi-ted-encoder-decoder-chemistry/</guid><description>SMI-TED is a family of encoder-decoder transformer models pre-trained on 91M PubChem molecules for molecular property prediction and generation.</description><content:encoded><![CDATA[<h2 id="an-encoder-decoder-chemical-foundation-model-family">An Encoder-Decoder Chemical Foundation Model Family</h2>
<p>SMI-TED is a <strong>Method</strong> paper that introduces a family of encoder-decoder transformer-based foundation models for chemistry. The primary contribution is the SMI-TED289M architecture, a 289-million parameter model pre-trained on 91 million curated SMILES from PubChem, along with a Mixture-of-Experts variant (MoE-OSMI) that scales to 8x289M parameters. The models support molecular property prediction, molecule reconstruction, reaction yield prediction, and few-shot reasoning over molecular embeddings. All model weights and code are open-sourced under an Apache 2.0 license.</p>
<h2 id="bridging-encoding-and-decoding-for-molecular-representations">Bridging Encoding and Decoding for Molecular Representations</h2>
<p>Chemical language models based on <a href="/notes/computational-chemistry/molecular-representations/smiles/">SMILES</a> have gained traction for molecular property prediction and generation. Most existing models, such as <a href="/notes/computational-chemistry/chemical-language-models/molecular-encoders/molformer/">MoLFormer</a> and <a href="/notes/computational-chemistry/chemical-language-models/molecular-encoders/chemberta/">ChemBERTa</a>, are encoder-only architectures that produce molecular embeddings through mean pooling. While effective for downstream classification and regression, this encoder-only approach has a limitation: mean pooling has no natural inverse, meaning the model cannot reconstruct the input molecule from its latent representation. This restricts the model&rsquo;s utility for generative tasks and limits the interpretability of the learned latent space.</p>
<p>The authors argue that adding a decoder with a reconstruction objective forces the model to encode a more complete set of structural features. Prior work has shown that the quality of pre-training data matters more than the choice of SMILES vs. <a href="/notes/computational-chemistry/molecular-representations/selfies/">SELFIES</a>, and that large-scale pre-training can yield useful chemical representations. SMI-TED builds on these observations by combining an encoder-decoder architecture with a carefully curated 91-million molecule dataset from <a href="https://en.wikipedia.org/wiki/PubChem">PubChem</a>.</p>
<h2 id="invertible-pooling-and-two-phase-pre-training">Invertible Pooling and Two-Phase Pre-Training</h2>
<p>The core architectural innovation in SMI-TED is a learned pooling mechanism that replaces standard mean or max pooling with an invertible projection. Given token embeddings $\mathbf{x} \in \mathbb{R}^{D \times L}$ (where $D = 202$ is the maximum token count and $L = 768$ is the embedding dimension), the submersion into the latent space $\mathbf{z} \in \mathbb{R}^{L}$ is computed as:</p>
<p>$$
\mathbf{z} = \left(\text{LayerNorm}\left(\text{GELU}\left(\mathbf{W}_1^T \mathbf{x} + \mathbf{b}_1\right)\right)\right) \mathbf{W}_2
$$</p>
<p>where $\mathbf{W}_1 \in \mathbb{R}^{D \times L}$, $\mathbf{b}_1 \in \mathbb{R}^{L}$, and $\mathbf{W}_2 \in \mathbb{R}^{L \times L}$. The immersion (inverse mapping) back to the token space is:</p>
<p>$$
\tilde{\mathbf{x}}^T = \left(\text{LayerNorm}\left(\text{GELU}\left(\mathbf{z} \mathbf{W}_3 + \mathbf{b}_3\right)\right)\right) \mathbf{W}_4
$$</p>
<p>where $\mathbf{W}_3 \in \mathbb{R}^{L \times L}$, $\mathbf{b}_3 \in \mathbb{R}^{L}$, and $\mathbf{W}_4 \in \mathbb{R}^{L \times D}$. A decoder language model then predicts the next token from $\tilde{\mathbf{x}}$.</p>
<p>The encoder uses a modified RoFormer attention mechanism with rotary position embeddings:</p>
<p>$$
\text{Attention}_m(Q, K, V) = \frac{\sum_{n=1}^{N} \langle \varphi(R_m q_m), \varphi(R_n k_n) \rangle v_n}{\sum_{n=1}^{N} \langle \varphi(R_m q_m), \varphi(R_n k_n) \rangle}
$$</p>
<p>where $R_m$ are position-dependent rotation matrices and $\varphi$ is a random feature map.</p>
<p><strong>Two-phase pre-training strategy:</strong></p>
<ul>
<li><strong>Phase 1</strong>: The token encoder is pre-trained on 95% of the data using masked language modeling (15% token selection, of which 80% masked, 10% random, 10% unchanged). The remaining 5% trains the encoder-decoder layer, preventing convergence issues from unstable early embeddings.</li>
<li><strong>Phase 2</strong>: After the token embeddings converge, both the encoder and decoder train on 100% of the data jointly.</li>
</ul>
<p><strong><a href="https://en.wikipedia.org/wiki/Mixture_of_experts">Mixture-of-Experts</a> (MoE-OSMI):</strong> The MoE variant composes 8 fine-tuned SMI-TED289M expert models with a gating network. Given an input embedding $x$, the output is:</p>
<p>$$
y = \sum_{i=1}^{n} G(x)_i E_i(\hat{x})
$$</p>
<p>where $G(x) = \text{Softmax}(\text{TopK}(x \cdot W_g))$ selects the top $k = 2$ experts per input, setting all other gate values to zero.</p>
<h2 id="benchmarks-across-property-prediction-generation-and-reaction-yield">Benchmarks Across Property Prediction, Generation, and Reaction Yield</h2>
<h3 id="moleculenet-classification-6-datasets-roc-auc"><a href="/notes/computational-chemistry/benchmark-problems/moleculenet-benchmark-molecular-ml/">MoleculeNet</a> classification (6 datasets, ROC-AUC)</h3>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>BBBP</th>
          <th>ClinTox</th>
          <th>HIV</th>
          <th>BACE</th>
          <th>SIDER</th>
          <th>Tox21</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>MoLFormer</td>
          <td>73.6 +/- 0.8</td>
          <td>91.2 +/- 1.4</td>
          <td>80.5 +/- 1.65</td>
          <td>86.3 +/- 0.6</td>
          <td>65.5 +/- 0.2</td>
          <td>80.46 +/- 0.2</td>
      </tr>
      <tr>
          <td>Uni-Mol</td>
          <td>72.9 +/- 0.6</td>
          <td>91.9 +/- 1.8</td>
          <td>80.8 +/- 0.3</td>
          <td>85.7 +/- 0.2</td>
          <td>65.9 +/- 1.3</td>
          <td>79.6 +/- 0.5</td>
      </tr>
      <tr>
          <td>GEM</td>
          <td>72.4 +/- 0.4</td>
          <td>90.1 +/- 1.3</td>
          <td>80.6 +/- 0.9</td>
          <td>85.6 +/- 1.1</td>
          <td>67.2 +/- 0.4</td>
          <td>78.1 +/- 0.1</td>
      </tr>
      <tr>
          <td>SMI-TED289M (pre-trained)</td>
          <td>91.46 +/- 0.47</td>
          <td>93.49 +/- 0.85</td>
          <td>80.51 +/- 1.34</td>
          <td>85.58 +/- 0.92</td>
          <td>66.01 +/- 0.88</td>
          <td>81.53 +/- 0.45</td>
      </tr>
      <tr>
          <td>SMI-TED289M (fine-tuned)</td>
          <td><strong>92.26 +/- 0.57</strong></td>
          <td><strong>94.27 +/- 1.83</strong></td>
          <td>76.85 +/- 0.89</td>
          <td><strong>88.24 +/- 0.50</strong></td>
          <td>65.68 +/- 0.45</td>
          <td><strong>81.85 +/- 1.42</strong></td>
      </tr>
  </tbody>
</table>
<p>SMI-TED achieves the best results in 4 of 6 classification tasks. Notably, the pre-trained version (without fine-tuning) already matches or exceeds many baselines on BBBP, ClinTox, and Tox21.</p>
<h3 id="moleculenet-regression-5-datasets-mae-for-qm9qm8-rmse-for-esolfreesolvlipophilicity">MoleculeNet regression (5 datasets, MAE for QM9/QM8, RMSE for ESOL/FreeSolv/Lipophilicity)</h3>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>QM9</th>
          <th>QM8</th>
          <th>ESOL</th>
          <th>FreeSolv</th>
          <th>Lipophilicity</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>MoLFormer</td>
          <td>1.5894</td>
          <td>0.0102</td>
          <td>0.880</td>
          <td>2.342</td>
          <td>0.700</td>
      </tr>
      <tr>
          <td>D-MPNN</td>
          <td>3.241</td>
          <td>0.0143</td>
          <td>0.98</td>
          <td>2.18</td>
          <td>0.65</td>
      </tr>
      <tr>
          <td>SMI-TED289M (fine-tuned)</td>
          <td><strong>1.3246</strong></td>
          <td><strong>0.0095</strong></td>
          <td><strong>0.6112</strong></td>
          <td><strong>1.2233</strong></td>
          <td><strong>0.5522</strong></td>
      </tr>
  </tbody>
</table>
<p>SMI-TED289M achieves the best results across all 5 regression tasks when fine-tuned. The improvements are substantial on ESOL (0.61 vs. 0.82 for next best) and FreeSolv (1.22 vs. 1.91 for next best).</p>
<h3 id="reaction-yield-prediction-buchwald-hartwig-c-n-cross-coupling">Reaction yield prediction (<a href="https://en.wikipedia.org/wiki/Buchwald%E2%80%93Hartwig_amination">Buchwald-Hartwig</a> C-N cross-coupling)</h3>
<p>The model was tested on Pd-catalyzed Buchwald-Hartwig reactions with 3,955 reactions across varying train/test splits. Selected $R^2$ results:</p>
<table>
  <thead>
      <tr>
          <th>Split</th>
          <th>Yield-BERT (Aug)</th>
          <th>DRFP</th>
          <th>SMI-TED289M</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>70/30</td>
          <td>0.97</td>
          <td>0.95</td>
          <td><strong>0.984</strong></td>
      </tr>
      <tr>
          <td>10/90</td>
          <td>0.81</td>
          <td>0.81</td>
          <td><strong>0.961</strong></td>
      </tr>
      <tr>
          <td>2.5/97.5</td>
          <td>0.61</td>
          <td>0.62</td>
          <td><strong>0.875</strong></td>
      </tr>
      <tr>
          <td>Test 1-4 avg</td>
          <td>0.58</td>
          <td>0.71</td>
          <td><strong>0.983</strong></td>
      </tr>
  </tbody>
</table>
<p>SMI-TED shows particularly strong performance in low-data regimes. With only 2.5% training data, it achieves $R^2 = 0.875$, compared to 0.61-0.62 for competing methods.</p>
<h3 id="moses-molecular-generation-benchmarks"><a href="/notes/computational-chemistry/benchmark-problems/molecular-sets-moses/">MOSES</a> molecular generation benchmarks</h3>
<p>SMI-TED is competitive with baselines including CharRNN, SMILES VAE, JT-VAE, <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/latent-space/limo-latent-inceptionism/">LIMO</a>, <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/autoregressive/molgen-molecular-generation-chemical-feedback/">MolGen-7b</a>, and <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/autoregressive/gp-molformer/">GP-MoLFormer</a> on standard metrics (validity, uniqueness, novelty, FCD, internal diversity). It achieves superior scaffold cosine similarity (Scaf) and nearest-neighbor similarity (SNN) scores.</p>
<h3 id="latent-space-compositionality">Latent space compositionality</h3>
<p>Using six families of carbon chains ($\mathcal{F} = {CC, CO, CN, CS, CF, CP}$), the authors test whether the embedding space respects hierarchical distance structures. A linear regression on SMI-TED embeddings yields $R^2 = 0.99$ and $MSE = 0.002$, compared to $R^2 = 0.55$ and $MSE = 0.237$ for MoLFormer. This indicates that the SMI-TED latent space captures compositional chemical relationships far more faithfully.</p>
<p>For structure-property analysis on QM9, nitrogen-containing molecules represent 9.10% of the dataset but account for 32.81% of the top 10% by HOMO energy. In the SMI-TED latent space, these molecules cluster distinctly (<a href="https://en.wikipedia.org/wiki/Davies%E2%80%93Bouldin_index">Davies-Bouldin index</a> of 2.82 vs. 4.28 for MoLFormer), suggesting the decoder objective encourages encoding of functional group information.</p>
<h2 id="strong-performance-with-a-compositional-latent-space">Strong Performance with a Compositional Latent Space</h2>
<p>SMI-TED289M demonstrates competitive or superior performance across molecular property prediction, reaction yield prediction, and molecular generation benchmarks. The key findings include:</p>
<ol>
<li><strong>Broad applicability</strong>: The single pre-trained model achieves strong results across classification (4/6 best), regression (5/5 best), reaction yield, and generation tasks.</li>
<li><strong>Low-data robustness</strong>: The pre-training on 91M molecules provides chemical knowledge that transfers well to small training sets, as shown by the reaction yield experiments where SMI-TED maintains high accuracy even at 2.5% training data.</li>
<li><strong>Compositional embeddings</strong>: The encoder-decoder architecture produces a latent space where molecular similarity follows chemical intuition, with near-perfect linear relationships between functional group families ($R^2 = 0.99$).</li>
<li><strong>Structure-property capture</strong>: The reconstruction objective appears to enforce encoding of chemically meaningful features like nitrogen substituent effects on <a href="https://en.wikipedia.org/wiki/HOMO_and_LUMO">HOMO</a> energy, outperforming encoder-only models in latent space organization.</li>
</ol>
<p><strong>Limitations</strong>: The paper evaluates on MoleculeNet benchmarks, which are well-studied but may not reflect performance on more diverse chemical tasks. The BBBP classification result (92.26) shows a large jump from prior methods (73.6 for MoLFormer), which is worth scrutinizing. The MoE variant is evaluated only in supplementary materials, and scaling behavior beyond 8 experts is not explored.</p>
<p><strong>Future directions</strong>: The authors note that compositionality of the learned representations suggests potential for reasoning applications, though they acknowledge that stronger claims require further studies following compositionality analysis methodologies from natural language processing. The model has been integrated into the dZiner agent for inverse molecular design.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pre-training</td>
          <td>PubChem (curated)</td>
          <td>91M molecules, 4B tokens</td>
          <td>Deduplicated, canonicalized, validity-checked</td>
      </tr>
      <tr>
          <td>Classification</td>
          <td>MoleculeNet (BBBP, ClinTox, HIV, BACE, SIDER, Tox21)</td>
          <td>Varies</td>
          <td>Original benchmark splits</td>
      </tr>
      <tr>
          <td>Regression</td>
          <td>MoleculeNet (QM9, QM8, ESOL, FreeSolv, Lipophilicity)</td>
          <td>Varies</td>
          <td>Original benchmark splits</td>
      </tr>
      <tr>
          <td>Generation</td>
          <td>MOSES</td>
          <td>1.94M molecules</td>
          <td>Train/test/scaffold test splits</td>
      </tr>
      <tr>
          <td>Reaction yield</td>
          <td>Buchwald-Hartwig HTE</td>
          <td>3,955 reactions</td>
          <td>3x 1536-well plates</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Masked language modeling for token encoder (15% selection: 80% masked, 10% random, 10% unchanged)</li>
<li>Two-phase pre-training (95/5 split then 100% joint training)</li>
<li>RoFormer attention with rotary position embeddings</li>
<li>Vocabulary: 2,993 tokens (2,988 molecular + 5 special)</li>
<li>Maximum sequence length: 202 tokens (covers 99.4% of PubChem)</li>
<li>Learning rate: 1.6e-4, batch size: 288 molecules</li>
<li>40 epochs over the full PubChem corpus</li>
<li>10 random seeds per experiment for robustness</li>
</ul>
<h3 id="models">Models</h3>
<table>
  <thead>
      <tr>
          <th>Variant</th>
          <th>Parameters</th>
          <th>Encoder</th>
          <th>Decoder</th>
          <th>Description</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>SMI-TED289M base</td>
          <td>289M</td>
          <td>47M</td>
          <td>242M</td>
          <td>12 layers, 12 attention heads, hidden size 768, dropout 0.2</td>
      </tr>
      <tr>
          <td>MoE-OSMI</td>
          <td>8x289M</td>
          <td>-</td>
          <td>-</td>
          <td>8 experts, top-k=2 routing, gating network</td>
      </tr>
  </tbody>
</table>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li>Classification: ROC-AUC</li>
<li>Regression: MAE (QM9, QM8), RMSE (ESOL, FreeSolv, Lipophilicity)</li>
<li>Reaction yield: $R^2$</li>
<li>Generation: Validity, uniqueness, novelty, FCD, IntDiv, Scaf, SNN (MOSES metrics)</li>
<li>Latent space: Linear regression $R^2$, MSE, Davies-Bouldin index, t-SNE visualization</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li>24 NVIDIA V100 GPUs (16GB)</li>
<li>4 nodes with DDP (Distributed Data Parallel)</li>
<li>Pre-training: 40 epochs on 91M molecules</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/IBM/materials/tree/main/models/smi_ted">IBM/materials (smi_ted)</a></td>
          <td>Code</td>
          <td>Apache-2.0</td>
          <td>Training, fine-tuning scripts, Jupyter notebooks</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/ibm/materials.smi-ted">ibm/materials.smi-ted</a></td>
          <td>Model</td>
          <td>Apache-2.0</td>
          <td>Pre-trained model weights</td>
      </tr>
      <tr>
          <td><a href="https://doi.org/10.5281/zenodo.15603701">Zenodo archive</a></td>
          <td>Code + Data</td>
          <td>Apache-2.0</td>
          <td>Archival copy of scripts</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Soares, E., Vital Brazil, E., Shirasuna, V., Zubarev, D., Cerqueira, R., &amp; Schmidt, K. (2025). An open-source family of large encoder-decoder foundation models for chemistry. <em>Communications Chemistry</em>, 8(1). <a href="https://doi.org/10.1038/s42004-025-01585-0">https://doi.org/10.1038/s42004-025-01585-0</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{soares2025smited,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{An open-source family of large encoder-decoder foundation models for chemistry}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Soares, Eduardo and Vital Brazil, Emilio and Shirasuna, Victor and Zubarev, Dmitry and Cerqueira, Renato and Schmidt, Kristin}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Communications Chemistry}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{8}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2025}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Nature Publishing Group}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1038/s42004-025-01585-0}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Seq2seq Fingerprint: Unsupervised Molecular Embedding</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-encoders/seq2seq-fingerprint-molecular-embedding/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-encoders/seq2seq-fingerprint-molecular-embedding/</guid><description>Seq2seq fingerprint uses a GRU encoder-decoder trained on SMILES self-translation to produce unsupervised molecular embeddings for property prediction.</description><content:encoded><![CDATA[<h2 id="an-unsupervised-seq2seq-method-for-molecular-fingerprints">An Unsupervised Seq2seq Method for Molecular Fingerprints</h2>
<p>This is a <strong>Method</strong> paper that introduces seq2seq fingerprint, an unsupervised molecular embedding approach based on sequence-to-sequence learning. The core idea is to train a <a href="https://en.wikipedia.org/wiki/Gated_recurrent_unit">GRU</a> encoder-decoder network to translate <a href="/notes/computational-chemistry/molecular-representations/smiles/">SMILES</a> strings to themselves, then extract the intermediate fixed-length vector as a molecular fingerprint. These fingerprints are then used with standard supervised classifiers for downstream property prediction tasks such as solubility classification and promiscuity prediction.</p>
<h2 id="the-labeled-data-bottleneck-in-drug-discovery">The Labeled Data Bottleneck in Drug Discovery</h2>
<p>Machine learning approaches to molecular property prediction depend on fixed-length feature vectors as inputs. Traditional molecular fingerprints fall into two categories: hash-based methods like Extended-Connectivity Fingerprints (ECFP) that are fast but lossy and non-invertible, and biologist-guided local-feature fingerprints that require domain expertise and are task-specific. Supervised deep learning fingerprints (e.g., neural fingerprints) can learn representations from data but require large amounts of labeled data, which is expensive to obtain in drug discovery due to the cost of biological experiments.</p>
<p>The authors identify three limitations of existing approaches:</p>
<ol>
<li>Hash-based fingerprints discard information during the hashing process and cannot reconstruct the original molecule</li>
<li>Local-feature fingerprints require expert knowledge and generalize poorly across tasks</li>
<li>Supervised deep learning fingerprints are data-hungry and fail when labeled data is limited</li>
</ol>
<h2 id="self-translation-as-unsupervised-molecular-encoding">Self-Translation as Unsupervised Molecular Encoding</h2>
<p>The key insight is to adapt the <a href="https://en.wikipedia.org/wiki/Seq2seq">sequence-to-sequence</a> learning framework from machine translation (originally English-to-French) to molecular representation learning by setting both the input and output to the same SMILES string. Since the intermediate vector must contain enough information to reconstruct the original SMILES, it serves as a rich, task-agnostic molecular fingerprint.</p>
<p>The architecture consists of two components:</p>
<ul>
<li><strong>Perceiver network</strong>: A multi-layer GRU encoder that reads the SMILES string and compresses it into a fixed-length vector</li>
<li><strong>Interpreter network</strong>: A multi-layer GRU decoder that reconstructs the original SMILES from the fingerprint vector</li>
</ul>
<p>The GRU cell computes a sequence of outputs $(s_1, \ldots, s_T)$ from input sequences $(x_1, \ldots, x_T)$ by iterating:</p>
<p>$$
z_t = \sigma_g(W_z x_t + U_z s_{t-1} + b_z)
$$</p>
<p>$$
r_t = \sigma_r(W_r x_t + U_r s_{t-1} + b_r)
$$</p>
<p>$$
h_t = \tanh(U_h x_t + W_h(s_{t-1} \circ r_t))
$$</p>
<p>$$
s_t = (1 - z_t) \circ h_{t-1} + z_t \circ s_{t-1}
$$</p>
<p>where $z_t$ is the update gate, $r_t$ is the reset gate, $\circ$ denotes element-wise multiplication, and $W$, $U$, $b$ are trainable parameters.</p>
<p>Several adaptations to the original seq2seq framework make this work for molecular data:</p>
<ol>
<li><strong>GRU instead of LSTM</strong>: GRU provides comparable performance with faster training, which is important given the large training data pool</li>
<li><strong>Attention mechanism</strong>: Establishes a stronger connection between the perceiver and interpreter networks via soft alignment, addressing the challenge of passing information through hidden memory for long sequences (SMILES can be up to 250 characters)</li>
<li><strong>Dropout layers</strong>: Added to input and output gates (but not hidden memory transfer) following the approach of Zaremba et al. to combat overfitting when training on large datasets</li>
<li><strong>Fingerprint extraction layer</strong>: A fixed-unit fully connected layer combined with a GRU cell state concatenation layer is inserted between encoder and decoder to explicitly output the fingerprint vector</li>
<li><strong>Reverse target sequence</strong>: Following Sutskever et al., the target sequence is reversed to improve SGD optimization</li>
<li><strong>Bucket training</strong>: Sequences are distributed into buckets by length and padded to enable GPU parallelization</li>
</ol>
<h2 id="classification-experiments-on-logp-and-pm2-datasets">Classification Experiments on LogP and PM2 Datasets</h2>
<h3 id="training-setup">Training Setup</h3>
<p>The unsupervised training used 334,092 valid SMILES representations from combined LogP and PM2-full datasets obtained from the National Center for Advancing Translational Sciences (NCATS) at NIH. Three model variants were trained with fingerprint dimensions of 512, 768, and 1024, differing in the number of GRU layers (2, 3, and 4 respectively) while keeping the latent dimension at 256. Each model was trained for 24 hours on a workstation with an Intel i7-6700K CPU, 16 GB RAM, and an NVIDIA GTX 1080 GPU.</p>
<h3 id="reconstruction-performance">Reconstruction Performance</h3>
<p>The models were evaluated on their ability to reconstruct SMILES strings from their fingerprints:</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>GRU Layers</th>
          <th>Latent Dim</th>
          <th>Perplexity</th>
          <th>Exact Match Accuracy</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>seq2seq-512</td>
          <td>2</td>
          <td>256</td>
          <td>1.00897</td>
          <td>94.24%</td>
      </tr>
      <tr>
          <td>seq2seq-768</td>
          <td>3</td>
          <td>256</td>
          <td>1.00949</td>
          <td>92.92%</td>
      </tr>
      <tr>
          <td>seq2seq-1024</td>
          <td>4</td>
          <td>256</td>
          <td>1.01472</td>
          <td>90.26%</td>
      </tr>
  </tbody>
</table>
<p>Deeper models showed lower reconstruction accuracy, possibly because larger fingerprint spaces introduce more null spaces and require longer training to converge.</p>
<h3 id="classification-results">Classification Results</h3>
<p>Two labeled datasets were used for downstream classification:</p>
<ul>
<li><strong>LogP</strong>: 10,850 samples with <a href="https://en.wikipedia.org/wiki/Partition_coefficient">water-octanol partition coefficient</a> values, binarized at a threshold of 1.88</li>
<li><strong>PM2-10k</strong>: 10,000 samples with binary promiscuity class labels</li>
</ul>
<p>The seq2seq fingerprints were evaluated with three ensemble classifiers (<a href="https://en.wikipedia.org/wiki/AdaBoost">AdaBoost</a>, <a href="https://en.wikipedia.org/wiki/Gradient_boosting">GradientBoost</a>, RandomForest) against circular fingerprints (ECFP) and neural fingerprints. Results are 100-run averages of 5-fold cross-validation accuracy.</p>
<p><strong>LogP classification accuracy:</strong></p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Mean Accuracy</th>
          <th>Std Dev</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Circular FP (ECFP)</td>
          <td>0.3674</td>
          <td>0.0074</td>
      </tr>
      <tr>
          <td>Neural FP</td>
          <td>0.6080</td>
          <td>0.0135</td>
      </tr>
      <tr>
          <td>Seq2seq-1024 + GradientBoost</td>
          <td><strong>0.7664</strong></td>
          <td>0.0043</td>
      </tr>
      <tr>
          <td>Seq2seq-1024 + AdaBoost</td>
          <td>0.7342</td>
          <td>0.0042</td>
      </tr>
      <tr>
          <td>Seq2seq-512 + GradientBoost</td>
          <td>0.7350</td>
          <td>0.0060</td>
      </tr>
  </tbody>
</table>
<p><strong>PM2-10k classification accuracy:</strong></p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Mean Accuracy</th>
          <th>Std Dev</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Circular FP (ECFP)</td>
          <td>0.3938</td>
          <td>0.0114</td>
      </tr>
      <tr>
          <td>Neural FP</td>
          <td>0.5227</td>
          <td>0.0112</td>
      </tr>
      <tr>
          <td>Seq2seq-1024 + GradientBoost</td>
          <td><strong>0.6206</strong></td>
          <td>0.0198</td>
      </tr>
      <tr>
          <td>Seq2seq-1024 + AdaBoost</td>
          <td>0.6036</td>
          <td>0.0147</td>
      </tr>
      <tr>
          <td>Seq2seq-512 + GradientBoost</td>
          <td>0.5741</td>
          <td>0.0086</td>
      </tr>
  </tbody>
</table>
<p>The seq2seq fingerprint outperformed both baselines across all configurations. Despite the seq2seq-1024 model having lower reconstruction accuracy, it provided the best classification performance, suggesting that the longer fingerprint captures more discriminative information for downstream tasks even if the reconstruction is less exact.</p>
<h2 id="unsupervised-transfer-learning-for-molecular-properties">Unsupervised Transfer Learning for Molecular Properties</h2>
<p>The results demonstrate that unsupervised pretraining on large unlabeled molecular datasets can produce fingerprints that transfer well to supervised property prediction with limited labels. The key advantages confirmed by the experiments are:</p>
<ol>
<li><strong>Label-free training</strong>: The unsupervised approach uses essentially unlimited SMILES data, avoiding the expensive label collection process</li>
<li><strong>Task-agnostic representations</strong>: The same fingerprints work across different classification tasks (solubility and promiscuity) without retraining</li>
<li><strong>Invertibility</strong>: The fingerprints contain enough information to reconstruct the original SMILES (up to 94.24% exact match), unlike hash-based methods</li>
</ol>
<p><strong>Limitations</strong> acknowledged by the authors include:</p>
<ul>
<li>Long training times (24 hours per model variant), motivating future work on distributed training</li>
<li>The relationship between fingerprint dimensionality and downstream performance is non-monotonic (768-dim underperforms 512-dim on some tasks), suggesting sensitivity to hyperparameter choices</li>
<li>Only classification tasks were evaluated; regression performance was not assessed</li>
<li>The comparison baselines are limited to ECFP and neural fingerprints from 2015</li>
</ul>
<p><strong>Future directions</strong> proposed include distributed training strategies, hyperparameter optimization methods, and semi-supervised extensions that incorporate label information into the fingerprint training.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Unsupervised training</td>
          <td>LogP + PM2-full (combined)</td>
          <td>334,092 SMILES</td>
          <td>Obtained from NCATS at NIH</td>
      </tr>
      <tr>
          <td>Classification</td>
          <td>LogP</td>
          <td>10,850 samples</td>
          <td>Binary labels at LogP threshold 1.88</td>
      </tr>
      <tr>
          <td>Classification</td>
          <td>PM2-10k</td>
          <td>10,000 samples</td>
          <td>Binary promiscuity labels</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Encoder-decoder: Multi-layer GRU with attention mechanism and dropout</li>
<li>Fingerprint dimensions: 512, 768, 1024 (with 2, 3, 4 GRU layers respectively)</li>
<li>Latent dimension: 256 for all variants</li>
<li>Downstream classifiers: AdaBoost, GradientBoost, RandomForest</li>
<li>Evaluation: 5-fold cross-validation, 100-run averages</li>
<li>Baselines: ECFP via RDKit, Neural Fingerprint from HIPS/neural-fingerprint</li>
</ul>
<h3 id="models">Models</h3>
<p>Three model variants trained for 24 hours each. The paper states code would become publicly available after acceptance, but no public repository has been confirmed.</p>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Best Value</th>
          <th>Task</th>
          <th>Configuration</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Classification accuracy</td>
          <td>0.7664</td>
          <td>LogP</td>
          <td>seq2seq-1024 + GradientBoost</td>
      </tr>
      <tr>
          <td>Classification accuracy</td>
          <td>0.6206</td>
          <td>PM2-10k</td>
          <td>seq2seq-1024 + GradientBoost</td>
      </tr>
      <tr>
          <td>Exact match reconstruction</td>
          <td>94.24%</td>
          <td>SMILES recovery</td>
          <td>seq2seq-512</td>
      </tr>
      <tr>
          <td>Perplexity</td>
          <td>1.00897</td>
          <td>SMILES recovery</td>
          <td>seq2seq-512</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>Training: Intel i7-6700K @ 4.00 GHz, 16 GB RAM, NVIDIA GTX 1080 GPU</li>
<li>Hyperparameter search and classifier training: TACC Lonestar 5 cluster</li>
<li>Training time: 24 hours per model variant</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/HIPS/neural-fingerprint">Neural Fingerprint (baseline)</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Baseline comparison code</td>
      </tr>
  </tbody>
</table>
<p>The authors indicated the seq2seq fingerprint code would be released after acceptance, but no public repository has been found as of this writing. The datasets were sourced from NCATS/NIH.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Xu, Z., Wang, S., Zhu, F., &amp; Huang, J. (2017). Seq2seq Fingerprint: An Unsupervised Deep Molecular Embedding for Drug Discovery. <em>Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics (ACM-BCB &lsquo;17)</em>, 285-294. <a href="https://doi.org/10.1145/3107411.3107424">https://doi.org/10.1145/3107411.3107424</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{xu2017seq2seq,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Seq2seq Fingerprint: An Unsupervised Deep Molecular Embedding for Drug Discovery}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Xu, Zheng and Wang, Sheng and Zhu, Feiyun and Huang, Junzhou}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{285--294}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2017}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{ACM}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1145/3107411.3107424}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>MolBERT: Auxiliary Tasks for Molecular BERT Models</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-encoders/molbert-molecular-representations/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-encoders/molbert-molecular-representations/</guid><description>MolBERT applies BERT to SMILES with domain-relevant auxiliary tasks like physicochemical property prediction, improving virtual screening and QSAR.</description><content:encoded><![CDATA[<h2 id="bert-based-molecular-representations-with-auxiliary-pre-training-tasks">BERT-Based Molecular Representations with Auxiliary Pre-Training Tasks</h2>
<p>This is a <strong>Method</strong> paper that introduces MolBERT, a bidirectional Transformer (BERT) architecture applied to SMILES-based molecular representations for drug discovery. The primary contribution is a systematic study of how different domain-relevant self-supervised pre-training tasks affect the quality of learned molecular embeddings, paired with a model that achieves state-of-the-art performance on <a href="https://en.wikipedia.org/wiki/Virtual_screening">virtual screening</a> and <a href="https://en.wikipedia.org/wiki/Quantitative_structure%E2%80%93activity_relationship">quantitative structure-activity relationship (QSAR)</a> benchmarks.</p>
<h2 id="why-domain-relevant-pre-training-matters-for-molecular-language-models">Why Domain-Relevant Pre-Training Matters for Molecular Language Models</h2>
<p>Molecular representations are foundational for predictive, generative, and analytical tasks in drug discovery. Language models applied to text-based molecular representations like SMILES have demonstrated strong performance across property prediction, reaction prediction, and molecular generation. However, several open questions remained at the time of this work:</p>
<ol>
<li><strong>Task selection for pre-training</strong>: Prior work explored masked token prediction, input translation, and property concatenation, but there was no systematic comparison of how different self-supervised tasks affect downstream performance.</li>
<li><strong>SMILES ambiguity</strong>: The same molecule can be encoded as many different SMILES strings depending on how the molecular graph is traversed. Canonicalization algorithms address this but introduce their own artifacts that may distract the model.</li>
<li><strong>Domain knowledge integration</strong>: Standard NLP pre-training objectives (e.g., masked language modeling) do not explicitly encode chemical knowledge. It was unclear whether incorporating chemistry-specific supervision during pre-training could improve representation quality.</li>
</ol>
<p>MolBERT addresses these gaps by evaluating three pre-training tasks, including a novel physicochemical property prediction objective, and measuring their individual and combined effects on downstream drug discovery benchmarks.</p>
<h2 id="three-auxiliary-tasks-for-chemistry-aware-pre-training">Three Auxiliary Tasks for Chemistry-Aware Pre-Training</h2>
<p>MolBERT uses the BERT-Base architecture (12 attention heads, 12 layers, 768-dimensional hidden states, approximately 85M parameters) and explores three self-supervised pre-training tasks:</p>
<p><strong>Masked Language Modeling (MaskedLM)</strong>: The standard BERT objective where 15% of input tokens are masked and the model predicts their identity. The loss is cross-entropy between predicted and true tokens.</p>
<p><strong>SMILES Equivalence (SMILES-Eq)</strong>: A binary classification task where the model receives two SMILES strings and predicts whether they represent the same molecule. The second string is either a random permutation of the first (same molecule, different traversal) or a randomly sampled molecule. This is optimized with cross-entropy loss.</p>
<p><strong>Physicochemical Property Prediction (PhysChemPred)</strong>: Using <a href="https://en.wikipedia.org/wiki/RDKit">RDKit</a>, a set of 200 real-valued molecular descriptors are computed for each molecule. The model predicts these normalized descriptors from the SMILES input using mean squared error:</p>
<p>$$\mathcal{L}_{\text{PhysChemPred}} = \frac{1}{D} \sum_{d=1}^{D} (y_d - \hat{y}_d)^2$$</p>
<p>where $D = 200$ is the number of descriptors, $y_d$ is the true normalized descriptor value, and $\hat{y}_d$ is the model&rsquo;s prediction.</p>
<p>The final training loss is the arithmetic mean of all active task losses:</p>
<p>$$\mathcal{L}_{\text{total}} = \frac{1}{|\mathcal{T}|} \sum_{t \in \mathcal{T}} \mathcal{L}_t$$</p>
<p>where $\mathcal{T}$ is the set of active pre-training tasks.</p>
<p>Additionally, MolBERT supports SMILES permutation augmentation during training, where each input molecule is represented by a randomly sampled non-canonical SMILES string rather than the canonical form. The model uses a fixed vocabulary of 42 tokens, a sequence length of 128, and relative positional embeddings (from Transformer-XL) to support arbitrary-length SMILES at inference time.</p>
<h2 id="ablation-study-and-benchmark-evaluation">Ablation Study and Benchmark Evaluation</h2>
<h3 id="pre-training-setup">Pre-Training Setup</h3>
<p>All models were pre-trained on the <a href="/notes/computational-chemistry/benchmark-problems/guacamol-benchmarking-de-novo-molecular-design/">GuacaMol benchmark dataset</a>, consisting of approximately 1.6M compounds curated from ChEMBL, using an 80%/5% train/validation split. Training used the Adam optimizer with a learning rate of $3 \times 10^{-5}$ for 20 epochs (ablation) or 100 epochs (final model).</p>
<h3 id="ablation-impact-of-task-combinations-on-virtual-screening">Ablation: Impact of Task Combinations on Virtual Screening</h3>
<p>The ablation study evaluated all seven possible task combinations on the RDKit virtual screening benchmark (69 datasets, 5 query molecules per target). Results measured by AUROC and BEDROC20 (an early enrichment metric with $\alpha = 20$):</p>
<table>
  <thead>
      <tr>
          <th style="text-align: center">MaskedLM</th>
          <th style="text-align: center">PhysChemPred</th>
          <th style="text-align: center">SMILES-Eq</th>
          <th style="text-align: center">AUROC (w/ perm)</th>
          <th style="text-align: center">BEDROC20 (w/ perm)</th>
          <th style="text-align: center">AUROC (w/o perm)</th>
          <th style="text-align: center">BEDROC20 (w/o perm)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td style="text-align: center">Yes</td>
          <td style="text-align: center">Yes</td>
          <td style="text-align: center">Yes</td>
          <td style="text-align: center">0.685 +/- 0.069</td>
          <td style="text-align: center">0.246 +/- 0.041</td>
          <td style="text-align: center">0.707 +/- 0.059</td>
          <td style="text-align: center">0.280 +/- 0.042</td>
      </tr>
      <tr>
          <td style="text-align: center">Yes</td>
          <td style="text-align: center">Yes</td>
          <td style="text-align: center">No</td>
          <td style="text-align: center">0.738 +/- 0.060</td>
          <td style="text-align: center">0.323 +/- 0.071</td>
          <td style="text-align: center">0.740 +/- 0.066</td>
          <td style="text-align: center">0.322 +/- 0.065</td>
      </tr>
      <tr>
          <td style="text-align: center">Yes</td>
          <td style="text-align: center">No</td>
          <td style="text-align: center">Yes</td>
          <td style="text-align: center">0.483 +/- 0.092</td>
          <td style="text-align: center">0.092 +/- 0.069</td>
          <td style="text-align: center">0.493 +/- 0.068</td>
          <td style="text-align: center">0.108 +/- 0.070</td>
      </tr>
      <tr>
          <td style="text-align: center">No</td>
          <td style="text-align: center">Yes</td>
          <td style="text-align: center">Yes</td>
          <td style="text-align: center">0.476 +/- 0.077</td>
          <td style="text-align: center">0.064 +/- 0.034</td>
          <td style="text-align: center">0.514 +/- 0.165</td>
          <td style="text-align: center">0.084 +/- 0.014</td>
      </tr>
      <tr>
          <td style="text-align: center">Yes</td>
          <td style="text-align: center">No</td>
          <td style="text-align: center">No</td>
          <td style="text-align: center">0.696 +/- 0.058</td>
          <td style="text-align: center">0.283 +/- 0.077</td>
          <td style="text-align: center">0.676 +/- 0.060</td>
          <td style="text-align: center">0.250 +/- 0.073</td>
      </tr>
      <tr>
          <td style="text-align: center">No</td>
          <td style="text-align: center">Yes</td>
          <td style="text-align: center">No</td>
          <td style="text-align: center">0.719 +/- 0.057</td>
          <td style="text-align: center">0.293 +/- 0.071</td>
          <td style="text-align: center">0.716 +/- 0.061</td>
          <td style="text-align: center">0.290 +/- 0.076</td>
      </tr>
      <tr>
          <td style="text-align: center">No</td>
          <td style="text-align: center">No</td>
          <td style="text-align: center">Yes</td>
          <td style="text-align: center">0.129 +/- 0.067</td>
          <td style="text-align: center">0.005 +/- 0.037</td>
          <td style="text-align: center">0.508 +/- 0.068</td>
          <td style="text-align: center">0.048 +/- 0.035</td>
      </tr>
  </tbody>
</table>
<p>Key findings from the ablation:</p>
<ul>
<li>PhysChemPred had the highest individual impact (average BEDROC20 of 0.292 alone vs. 0.266 for MaskedLM alone).</li>
<li>Combining MaskedLM + PhysChemPred achieved the best performance (BEDROC20 of 0.323), though the additive gain from MaskedLM was modest (+0.031).</li>
<li>The SMILES-Eq task consistently decreased performance when added to other task combinations.</li>
</ul>
<p>A further sub-ablation on PhysChemPred descriptor groups showed that surface descriptors alone (49 of 200 descriptors) achieved nearly the same performance as the full set, suggesting molecular surface properties provide particularly informative supervision.</p>
<h3 id="virtual-screening-results">Virtual Screening Results</h3>
<p>Using the best task combination (MaskedLM + PhysChemPred) trained for 100 epochs:</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>AUROC</th>
          <th>BEDROC20</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>MolBERT (100 epochs)</td>
          <td>0.743 +/- 0.062</td>
          <td>0.344 +/- 0.062</td>
      </tr>
      <tr>
          <td>CDDD</td>
          <td>0.725 +/- 0.057</td>
          <td>0.310 +/- 0.080</td>
      </tr>
      <tr>
          <td>RDKit descriptors</td>
          <td>0.633 +/- 0.027</td>
          <td>0.217 +/- 0.000</td>
      </tr>
      <tr>
          <td>ECFC4</td>
          <td>0.603 +/- 0.056</td>
          <td>0.170 +/- 0.079</td>
      </tr>
  </tbody>
</table>
<p>MolBERT outperformed all baselines including <a href="/notes/computational-chemistry/chemical-language-models/molecular-encoders/cddd-translation-molecular-descriptors/">CDDD</a> (the prior state of the art), RDKit calculated descriptors, and extended-connectivity fingerprints (ECFC4).</p>
<h3 id="qsar-results">QSAR Results</h3>
<p>On <a href="/notes/computational-chemistry/benchmark-problems/moleculenet-benchmark-molecular-ml/">MoleculeNet</a> regression tasks (RMSE, lower is better):</p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th style="text-align: center">RDKit (norm)</th>
          <th style="text-align: center">ECFC4</th>
          <th style="text-align: center">CDDD</th>
          <th style="text-align: center">MolBERT</th>
          <th style="text-align: center">MolBERT (finetune)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ESOL</td>
          <td style="text-align: center">0.687 +/- 0.08</td>
          <td style="text-align: center">0.902 +/- 0.06</td>
          <td style="text-align: center">0.567 +/- 0.06</td>
          <td style="text-align: center">0.552 +/- 0.07</td>
          <td style="text-align: center"><strong>0.531 +/- 0.04</strong></td>
      </tr>
      <tr>
          <td>FreeSolv</td>
          <td style="text-align: center">1.671 +/- 0.45</td>
          <td style="text-align: center">2.876 +/- 0.38</td>
          <td style="text-align: center">1.456 +/- 0.43</td>
          <td style="text-align: center">1.523 +/- 0.66</td>
          <td style="text-align: center"><strong>0.948 +/- 0.33</strong></td>
      </tr>
      <tr>
          <td>Lipophilicity</td>
          <td style="text-align: center">0.738 +/- 0.04</td>
          <td style="text-align: center">0.770 +/- 0.03</td>
          <td style="text-align: center">0.669 +/- 0.02</td>
          <td style="text-align: center">0.602 +/- 0.01</td>
          <td style="text-align: center"><strong>0.561 +/- 0.03</strong></td>
      </tr>
  </tbody>
</table>
<p>On MoleculeNet classification tasks (AUROC, higher is better):</p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th style="text-align: center">RDKit (norm)</th>
          <th style="text-align: center">ECFC4</th>
          <th style="text-align: center">CDDD</th>
          <th style="text-align: center">MolBERT</th>
          <th style="text-align: center">MolBERT (finetune)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>BACE</td>
          <td style="text-align: center">0.831</td>
          <td style="text-align: center">0.845</td>
          <td style="text-align: center">0.833</td>
          <td style="text-align: center">0.849</td>
          <td style="text-align: center"><strong>0.866</strong></td>
      </tr>
      <tr>
          <td>BBBP</td>
          <td style="text-align: center">0.696</td>
          <td style="text-align: center">0.678</td>
          <td style="text-align: center">0.761</td>
          <td style="text-align: center">0.750</td>
          <td style="text-align: center"><strong>0.762</strong></td>
      </tr>
      <tr>
          <td>HIV</td>
          <td style="text-align: center">0.708</td>
          <td style="text-align: center">0.714</td>
          <td style="text-align: center">0.753</td>
          <td style="text-align: center">0.747</td>
          <td style="text-align: center"><strong>0.783</strong></td>
      </tr>
  </tbody>
</table>
<p>Fine-tuned MolBERT achieved the best performance on all six QSAR datasets. When used as a fixed feature extractor with an SVM, MolBERT embeddings outperformed other representations on three of six tasks.</p>
<h2 id="key-findings-and-limitations">Key Findings and Limitations</h2>
<h3 id="key-findings">Key Findings</h3>
<ol>
<li><strong>Pre-training task selection matters significantly.</strong> The choice of auxiliary tasks during pre-training has a large effect on downstream performance. PhysChemPred provides the strongest individual signal.</li>
<li><strong>Domain-relevant auxiliary tasks improve representation quality.</strong> Predicting physicochemical properties during pre-training encodes chemical knowledge directly into the embeddings, outperforming purely linguistic objectives.</li>
<li><strong>The SMILES equivalence task hurts performance.</strong> Despite being chemically motivated, the SMILES-Eq task consistently degraded results, suggesting it may introduce conflicting learning signals.</li>
<li><strong>PhysChemPred organizes the embedding space.</strong> Analysis of pairwise cosine similarities showed that models trained with PhysChemPred assign high similarity to permutations of the same molecule and low similarity to different molecules, creating a more semantically meaningful representation space.</li>
</ol>
<h3 id="limitations">Limitations</h3>
<ul>
<li>The paper evaluates only SMILES-based representations, inheriting all limitations of string-based molecular encodings (inability to capture 3D structure, sensitivity to tokenization).</li>
<li>The virtual screening evaluation uses a fixed number of query molecules ($n = 5$), which may not reflect realistic screening scenarios.</li>
<li>Cross-validation splits from ChemBench were used for QSAR evaluation rather than scaffold splits, which may overestimate performance on structurally novel compounds.</li>
<li>The model&rsquo;s 128-token sequence length limit may truncate larger molecules, though relative positional embeddings partially address this at inference time.</li>
</ul>
<h3 id="future-directions">Future Directions</h3>
<p>The authors propose extending MolBERT to learn representations for other biological entities such as proteins, and developing more advanced pre-training strategies.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pre-training</td>
          <td>GuacaMol (ChEMBL)</td>
          <td>~1.6M compounds</td>
          <td>80% train / 5% validation split</td>
      </tr>
      <tr>
          <td>Virtual Screening</td>
          <td>RDKit benchmark v1.2</td>
          <td>69 target datasets</td>
          <td>Filtered subset with active/decoy compounds</td>
      </tr>
      <tr>
          <td>QSAR (Regression)</td>
          <td>ESOL, FreeSolv, Lipophilicity</td>
          <td>Varies</td>
          <td>From MoleculeNet, ChemBench splits</td>
      </tr>
      <tr>
          <td>QSAR (Classification)</td>
          <td>BACE, BBBP, HIV</td>
          <td>Varies</td>
          <td>From MoleculeNet, ChemBench splits</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Architecture: BERT-Base (12 heads, 12 layers, 768-dim hidden, ~85M params)</li>
<li>Optimizer: Adam, learning rate $3 \times 10^{-5}$</li>
<li>Vocabulary: 42 tokens, sequence length 128</li>
<li>Masking: 15% of tokenized input</li>
<li>Positional encoding: relative positional embeddings (Transformer-XL)</li>
<li>Fine-tuning SVM: $C = 5.0$, RBF kernel (from Winter et al.)</li>
<li>Fine-tuning head: single linear layer on pooled output</li>
<li>Embeddings: pooled output (or average sequence output when only MaskedLM is used)</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>BERT-Base with ~85M parameters</li>
<li>Pre-trained weights available at <a href="https://github.com/BenevolentAI/MolBERT">BenevolentAI/MolBERT</a></li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Task</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>AUROC</td>
          <td>Virtual Screening, Classification QSAR</td>
          <td>Standard area under ROC curve</td>
      </tr>
      <tr>
          <td>BEDROC20</td>
          <td>Virtual Screening</td>
          <td>Early enrichment metric, $\alpha = 20$</td>
      </tr>
      <tr>
          <td>RMSE</td>
          <td>Regression QSAR</td>
          <td>Root mean squared error</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>2 GPUs, 16 CPUs</li>
<li>Pre-training time: ~40 hours (20 epochs)</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/BenevolentAI/MolBERT">BenevolentAI/MolBERT</a></td>
          <td>Code + Model</td>
          <td>MIT</td>
          <td>Official implementation with pre-trained weights</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Fabian, B., Edlich, T., Gaspar, H., Segler, M., Meyers, J., Fiscato, M., &amp; Ahmed, M. (2020). Molecular representation learning with language models and domain-relevant auxiliary tasks. <em>arXiv preprint arXiv:2011.13230</em>.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{fabian2020molecular,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Molecular representation learning with language models and domain-relevant auxiliary tasks}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Fabian, Benedek and Edlich, Thomas and Gaspar, H{\&#39;e}l{\&#39;e}na and Segler, Marwin and Meyers, Joshua and Fiscato, Marco and Ahmed, Mohamed}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{arXiv preprint arXiv:2011.13230}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2020}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>CDDD: Learning Descriptors by Translating SMILES</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-encoders/cddd-translation-molecular-descriptors/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-encoders/cddd-translation-molecular-descriptors/</guid><description>CDDD learns continuous molecular descriptors by translating between SMILES and InChI representations, outperforming fingerprints in virtual screening.</description><content:encoded><![CDATA[<h2 id="a-translation-based-method-for-learned-molecular-descriptors">A Translation-Based Method for Learned Molecular Descriptors</h2>
<p>This is a <strong>Method</strong> paper that introduces Continuous and Data-Driven Descriptors (CDDD), a neural machine translation approach for learning fixed-size, continuous molecular representations. Rather than training an autoencoder to reconstruct <a href="/notes/computational-chemistry/molecular-representations/smiles/">SMILES</a> strings, Winter et al. train an encoder-decoder model to translate between semantically equivalent but syntactically different molecular representations (e.g., randomized SMILES to canonical SMILES, or <a href="/notes/computational-chemistry/molecular-representations/inchi-2013/">InChI</a> to canonical SMILES). The bottleneck latent vector serves as a general-purpose molecular descriptor. Pretrained on approximately 72 million compounds from <a href="/notes/computational-chemistry/datasets/zinc-22/">ZINC15</a> and PubChem, CDDD produces 512-dimensional descriptors that achieve competitive QSAR performance and significantly outperform all tested molecular fingerprints in ligand-based virtual screening.</p>
<h2 id="why-translation-instead-of-reconstruction">Why Translation Instead of Reconstruction?</h2>
<p>Molecular descriptors are central to cheminformatics. Traditional approaches rely on human-engineered fingerprints like ECFPs, which encode structural features as fixed-length bit vectors. While effective, these representations are constrained by predefined feature extraction rules.</p>
<p>Recent work applied deep neural networks directly to molecular graphs or SMILES strings to learn task-specific representations. However, these end-to-end approaches must learn features from scratch for each new dataset, making them prone to overfitting on the small bioactivity datasets typical in drug discovery.</p>
<p>Unsupervised approaches based on autoencoders (notably <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/latent-space/automatic-chemical-design-vae/">Gomez-Bombarelli et al.&rsquo;s VAE</a> and <a href="/notes/computational-chemistry/chemical-language-models/molecular-encoders/seq2seq-fingerprint-molecular-embedding/">Xu et al.&rsquo;s seq2seq model</a>) offered a path toward general-purpose learned descriptors. These models reconstruct SMILES strings through an information bottleneck, forcing the latent space to capture molecular information. The concern with reconstruction, however, is that the model may focus on syntactic patterns of the string representation rather than the underlying chemical semantics. A model that memorizes SMILES syntax shortcuts can achieve low reconstruction error without truly encoding chemical meaning.</p>
<p>Winter et al. address this by drawing on the analogy to neural machine translation: a translator must understand the meaning of a sentence to produce a correct translation in another language. By training the model to translate between different molecular representations (which share chemical semantics but differ in syntax), the latent space is forced to capture the chemical information common to both representations, rather than representation-specific syntactic artifacts.</p>
<h2 id="translation-as-semantic-compression">Translation as Semantic Compression</h2>
<p>The core insight is that translating between two syntactically different but semantically equivalent representations forces the encoder to capture only the chemical meaning shared by both. The model architecture follows the standard encoder-decoder framework from neural machine translation.</p>
<p>The encoder reads a source molecular string (e.g., a randomized SMILES or InChI) and compresses it into a fixed-size latent vector. The decoder takes this latent vector and generates the target molecular string (canonical SMILES). The model is trained to minimize character-level cross-entropy between the decoder output and the target sequence.</p>
<p>Four translation tasks were evaluated:</p>
<ol>
<li><strong>Randomized SMILES to canonical SMILES</strong> (best performing)</li>
<li><strong>InChI to canonical SMILES</strong></li>
<li><strong>Canonical SMILES to canonical SMILES</strong> (autoencoding baseline)</li>
<li><strong>Canonical SMILES to InChI</strong> (failed to learn)</li>
</ol>
<p>The final model uses an RNN encoder with 3 stacked GRU layers (512, 1024, and 2048 units). The concatenated cell states pass through a fully connected layer with tanh activation to produce a 512-dimensional latent vector. The decoder mirrors this architecture, initializing its GRU states from the latent vector via separate fully connected layers. Teacher forcing is used during training, and left-to-right beam search is used at inference.</p>
<p>An auxiliary property prediction network takes the latent vector as input and predicts nine molecular properties (logP, partial charges, valence electrons, H-bond donors/acceptors, Balaban&rsquo;s J, <a href="https://en.wikipedia.org/wiki/Molar_refractivity">molar refractivity</a>, TPSA). This multi-task signal encourages the latent space to encode physically meaningful information. The full training objective combines the translation cross-entropy loss with the property prediction mean squared error:</p>
<p>$$\mathcal{L} = \mathcal{L}_{\text{translation}} + \mathcal{L}_{\text{properties}}$$</p>
<p>To ensure invariance to input SMILES representation at inference time, the model uses randomized SMILES as input half the time and canonical SMILES the other half during training. Input dropout (15% at the character level) and Gaussian noise (standard deviation 0.05) are applied for regularization.</p>
<h2 id="qsar-benchmarks-virtual-screening-and-latent-space-exploration">QSAR Benchmarks, Virtual Screening, and Latent Space Exploration</h2>
<h3 id="pretraining">Pretraining</h3>
<p>The model was pretrained on approximately 72 million compounds from ZINC15 and PubChem (merged, deduplicated, filtered for organic molecules with MW 12-600, &gt;3 heavy atoms, logP between -7 and 5). All evaluation compounds were removed from the pretraining set.</p>
<h3 id="qsar-experiments">QSAR Experiments</h3>
<p>Ten QSAR datasets were used, spanning classification (<a href="https://en.wikipedia.org/wiki/Ames_test">Ames mutagenicity</a>, <a href="https://en.wikipedia.org/wiki/KCNH2">hERG inhibition</a>, <a href="https://en.wikipedia.org/wiki/Blood%E2%80%93brain_barrier">BBB penetration</a>, BACE inhibition, bee toxicity) and regression (EGFR inhibition, <a href="https://en.wikipedia.org/wiki/Plasmodium_falciparum">Plasmodium falciparum</a> inhibition, lipophilicity, aqueous solubility, melting point). Two datasets (Ames and lipophilicity) served as validation for architecture selection; the remaining eight were held out for final evaluation.</p>
<p>CDDD descriptors with an SVM were benchmarked against:</p>
<ul>
<li>Nine circular fingerprint variants (Morgan fingerprints, radius 1-3, folded to 512/1024/2048 bits) with RF, SVM, and GB</li>
<li>Graph convolution models (<a href="/notes/computational-chemistry/benchmark-problems/moleculenet-benchmark-molecular-ml/">DeepChem</a>)</li>
</ul>
<p>Both random-split and cluster-split (K-means on MACCS fingerprints, K=5) cross-validation were performed.</p>
<table>
  <thead>
      <tr>
          <th>Task</th>
          <th>Split</th>
          <th>CDDD + SVM</th>
          <th>Best Fingerprint</th>
          <th>Graph Conv</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Ames (ROC-AUC)</td>
          <td>Random</td>
          <td>0.89</td>
          <td>0.89 (ecfc2, RF)</td>
          <td>0.88</td>
      </tr>
      <tr>
          <td>hERG (ROC-AUC)</td>
          <td>Random</td>
          <td>0.86</td>
          <td>0.85 (ecfc4, RF)</td>
          <td>0.86</td>
      </tr>
      <tr>
          <td>BBBP (ROC-AUC)</td>
          <td>Random</td>
          <td>0.93</td>
          <td>0.93 (ecfc2, RF)</td>
          <td>0.92</td>
      </tr>
      <tr>
          <td>BACE (ROC-AUC)</td>
          <td>Random</td>
          <td>0.90</td>
          <td>0.91 (ecfc2, RF)</td>
          <td>0.91</td>
      </tr>
      <tr>
          <td>Bee toxicity (ROC-AUC)</td>
          <td>Random</td>
          <td>0.92</td>
          <td>0.91 (ecfc6, RF)</td>
          <td>0.89</td>
      </tr>
      <tr>
          <td>Lipophilicity ($r^2$)</td>
          <td>Random</td>
          <td>0.72</td>
          <td>0.69 (ecfc2, SVM)</td>
          <td>0.73</td>
      </tr>
      <tr>
          <td>ESOL ($r^2$)</td>
          <td>Random</td>
          <td>0.92</td>
          <td>0.58 (ecfc6, SVM)</td>
          <td>0.86</td>
      </tr>
      <tr>
          <td>Melting point ($r^2$)</td>
          <td>Random</td>
          <td>0.42</td>
          <td>0.38 (ecfc2, SVM)</td>
          <td>0.39</td>
      </tr>
  </tbody>
</table>
<p>CDDD descriptors showed competitive or better performance across all tasks. Notably, CDDD achieved substantially higher $r^2$ on aqueous solubility (0.92 vs. 0.58 for the best fingerprint). The authors emphasize that CDDD&rsquo;s feature extraction was fixed based on two validation tasks, while baseline methods selected the best fingerprint/model combination per task, making the comparison conservative for CDDD.</p>
<h3 id="virtual-screening">Virtual Screening</h3>
<p>Ligand-based virtual screening experiments followed the Riniker et al. benchmarking protocol on 40 DUD targets and 17 MUV targets. Five active compounds were randomly selected per target, and remaining compounds were ranked by similarity (cosine similarity for CDDD, <a href="https://en.wikipedia.org/wiki/Jaccard_index">Tanimoto</a> for fingerprints). This process was repeated 50 times per target.</p>
<table>
  <thead>
      <tr>
          <th>Database</th>
          <th>CDDD (ROC-AUC)</th>
          <th>Second Best</th>
          <th>p-value (Wilcoxon)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>DUD</td>
          <td>0.949</td>
          <td>0.899 (laval)</td>
          <td>$5 \times 10^{-38}$</td>
      </tr>
      <tr>
          <td>MUV</td>
          <td>0.679</td>
          <td>0.677 (ap)</td>
          <td>0.04</td>
      </tr>
  </tbody>
</table>
<p>CDDD significantly outperformed all 14 baseline fingerprints on both databases. The DUD improvement was particularly large (+5.0 ROC-AUC points over the next best). On MUV, which is designed to be harder, the advantage was smaller but still statistically significant. Importantly, while the best baseline fingerprint varied between DUD and MUV (laval vs. ap), CDDD ranked first on both, demonstrating consistent performance.</p>
<h3 id="latent-space-exploration">Latent Space Exploration</h3>
<p>The continuous, reversible nature of CDDD enables chemical space navigation. Shifting a molecule&rsquo;s embedding along the first principal component of the pretraining data correlates with molecular size (Spearman $r = 0.947$, $p = 0.00048$), while the second principal component correlates with polarity/logP ($r = -0.916$, $p = 0.00015$).</p>
<p>When shifting 1000 compounds along 100 random directions, the model maintained high valid SMILES generation rates (&gt;97% for the top beam search output, &gt;99% when considering the top 3 outputs). Euclidean distance in the descriptor space correlated smoothly with Tanimoto distance in fingerprint space, confirming that the latent space supports meaningful interpolation.</p>
<h2 id="consistent-learned-descriptors-for-chemistry">Consistent Learned Descriptors for Chemistry</h2>
<p>CDDD demonstrated that translation between molecular representations produces more informative latent spaces than autoencoder reconstruction. The key findings are:</p>
<ol>
<li><strong>Translation outperforms reconstruction</strong>: Models trained on translating between different representations consistently produced better downstream descriptors than autoencoding models, despite autoencoding being an easier task.</li>
<li><strong>Auxiliary property prediction helps</strong>: The additional classification task for molecular properties improved descriptor quality, particularly for physicochemical endpoints correlated with the predicted properties.</li>
<li><strong>Consistent performance</strong>: Unlike baseline methods where the best fingerprint varies by task, CDDD showed consistent performance across all QSAR and VS experiments.</li>
<li><strong>Smooth latent space</strong>: The continuous descriptor space supports meaningful interpolation and chemical space exploration with high valid SMILES rates.</li>
</ol>
<p>The authors acknowledge several limitations. The InChI-to-SMILES translation worked but produced inferior descriptors compared to SMILES-to-SMILES, and SMILES-to-InChI translation failed entirely, likely due to InChI&rsquo;s complex syntax (counting, arithmetic). The approach was only tested with string-based representations; translation between conceptually different representations (e.g., 3D structures) remains future work. The QSAR evaluation, while extensive, used relatively standard datasets, and the method&rsquo;s advantage over graph convolution models was modest on tasks where end-to-end learning had sufficient data.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pretraining</td>
          <td>ZINC15 + PubChem (merged)</td>
          <td>~72M compounds</td>
          <td>Filtered: organic, MW 12-600, &gt;3 heavy atoms, logP -7 to 5</td>
      </tr>
      <tr>
          <td>Validation</td>
          <td>Ames mutagenicity</td>
          <td>6,130</td>
          <td>Classification</td>
      </tr>
      <tr>
          <td>Validation</td>
          <td>Lipophilicity</td>
          <td>3,817</td>
          <td>Regression</td>
      </tr>
      <tr>
          <td>Test</td>
          <td>hERG, BBBP, BACE, bee toxicity</td>
          <td>188-3,440</td>
          <td>Classification</td>
      </tr>
      <tr>
          <td>Test</td>
          <td>EGFR, Plasmodium, ESOL, melting point</td>
          <td>184-4,451</td>
          <td>Regression</td>
      </tr>
      <tr>
          <td>VS</td>
          <td>DUD</td>
          <td>40 targets</td>
          <td>Ligand-based virtual screening</td>
      </tr>
      <tr>
          <td>VS</td>
          <td>MUV</td>
          <td>17 targets</td>
          <td>Maximum unbiased validation</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Encoder: 3 stacked GRU layers (512, 1024, 2048 units) with tanh bottleneck to 512-dim latent space</li>
<li>Decoder: Matching 3 stacked GRU layers, initialized from latent space</li>
<li>Auxiliary classifier: 3 FC layers (512, 128, 9) predicting molecular properties</li>
<li>Optimizer: Adam, initial LR $5 \times 10^{-4}$, decayed by 0.9 every 50,000 steps</li>
<li>Batch size: 64 with bucketing by sequence length</li>
<li>Input regularization: 15% character dropout + Gaussian noise (std 0.05)</li>
<li>Beam search for decoding at inference</li>
</ul>
<h3 id="models">Models</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/jrwnter/cddd">CDDD (GitHub)</a></td>
          <td>Code + Model</td>
          <td>MIT</td>
          <td>Pretrained model and extraction code</td>
      </tr>
  </tbody>
</table>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li>QSAR: 5-fold random CV and 5-fold cluster CV (K-means on MACCS, K=5)</li>
<li>Classification metric: ROC-AUC</li>
<li>Regression metric: $r^2$</li>
<li>VS: ROC-AUC averaged over 50 random active set selections per target</li>
<li>Statistical test: <a href="https://en.wikipedia.org/wiki/Wilcoxon_signed-rank_test">Wilcoxon signed-rank test</a> for VS comparisons</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li>Framework: TensorFlow 1.4.1</li>
<li>Fingerprint extraction on GPU is comparable in speed to RDKit on CPU</li>
<li>SVM training on 512-dim CDDD descriptors takes seconds (vs. minutes for 2048-dim fingerprints)</li>
<li>Graph convolution training: ~30 minutes per task on GPU</li>
</ul>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Winter, R., Montanari, F., Noe, F., &amp; Clevert, D.-A. (2019). Learning continuous and data-driven molecular descriptors by translating equivalent chemical representations. <em>Chemical Science</em>, 10(6), 1692-1701. <a href="https://doi.org/10.1039/C8SC04175J">https://doi.org/10.1039/C8SC04175J</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{winter2019learning,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Learning continuous and data-driven molecular descriptors by translating equivalent chemical representations}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Winter, Robin and Montanari, Floriane and No{\&#39;e}, Frank and Clevert, Djork-Arn{\&#39;e}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Chemical Science}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{10}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{6}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{1692--1701}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2019}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Royal Society of Chemistry}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1039/C8SC04175J}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>AMORE: Testing ChemLLM Robustness to SMILES Variants</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-encoders/amore-smiles-robustness-framework/</link><pubDate>Wed, 25 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-encoders/amore-smiles-robustness-framework/</guid><description>AMORE is a zero-shot framework testing whether chemical language models recognize equivalent SMILES of the same molecule via embedding retrieval.</description><content:encoded><![CDATA[<h2 id="an-empirical-framework-for-probing-chemical-understanding">An Empirical Framework for Probing Chemical Understanding</h2>
<p>This is an <strong>Empirical</strong> paper that introduces Augmented Molecular Retrieval (AMORE), a zero-shot evaluation framework for chemical language models (ChemLMs). The primary contribution is a method to assess whether ChemLMs have learned genuine molecular semantics or simply memorize textual patterns. Rather than relying on traditional NLP metrics like BLEU and ROUGE, AMORE tests whether a model&rsquo;s embedding space treats chemically equivalent <a href="/notes/computational-chemistry/molecular-representations/smiles/">SMILES</a> representations as similar. The authors evaluate 12 models across multiple architectures (encoder-only, encoder-decoder, decoder-only) on two datasets and five augmentation types, and extend the analysis to downstream MoleculeNet tasks.</p>
<h2 id="why-standard-nlp-metrics-fail-for-chemical-evaluation">Why Standard NLP Metrics Fail for Chemical Evaluation</h2>
<p>Chemical language models are typically evaluated using text-based metrics from NLP (BLEU, ROUGE, METEOR) on tasks like molecule captioning. These metrics compare word overlap and sentence fluency but cannot detect whether a model truly understands molecular structure. A SMILES string like <code>C(=O)O</code> and its canonicalized or kekulized form represent the same molecule, yet text-based metrics would penalize valid reformulations. Embedding-based metrics like BERTScore are also insufficient because they were trained on general text, not chemical notation.</p>
<p>The core research question is direct: do evaluation metrics used on ChemLMs reflect actual chemical knowledge, or do the models simply imitate understanding by learning textual features? This question has practical consequences in pharmaceuticals and healthcare, where missteps in chemical reasoning carry serious risks.</p>
<h2 id="embedding-based-retrieval-as-a-chemical-litmus-test">Embedding-Based Retrieval as a Chemical Litmus Test</h2>
<p>AMORE exploits a fundamental property of molecular representations: a single molecule can be written as multiple valid SMILES strings that are chemically identical. These serve as &ldquo;total synonyms,&rdquo; a concept without a true analogue in natural language.</p>
<p>The framework works in four steps:</p>
<ol>
<li>Take a set $X = (x_1, x_2, \ldots, x_n)$ of $n$ molecular representations.</li>
<li>Apply a transformation $f$ to obtain augmented representations $X&rsquo; = (x&rsquo;_1, x&rsquo;_2, \ldots, x&rsquo;_n)$, where $x&rsquo;_i = f(x_i)$. The constraint is that $f$ must not change the underlying molecule.</li>
<li>Obtain vectorized embeddings $e(x_i)$ and $e(x&rsquo;_j)$ from the model for each original and augmented SMILES.</li>
<li>Evaluate in a retrieval task: given $e(x_i)$, retrieve $e(x&rsquo;_i)$ from the augmented set.</li>
</ol>
<p>The evaluation metrics are top-$k$ accuracy (whether the correct augmented SMILES ranks at position $\leq k$) and <a href="https://en.wikipedia.org/wiki/Mean_reciprocal_rank">Mean Reciprocal Rank</a> (MRR). Retrieval uses <a href="https://en.wikipedia.org/wiki/FAISS">FAISS</a> for efficient nearest-neighbor search. The key insight is that if a model truly understands molecular structure, it should embed different SMILES representations of the same molecule close together.</p>
<h3 id="five-smiles-augmentation-types">Five SMILES Augmentation Types</h3>
<p>The framework uses five identity-preserving augmentations, all executed through <a href="https://en.wikipedia.org/wiki/RDKit">RDKit</a>:</p>
<ol>
<li><strong>Canonicalization</strong>: Transform SMILES to the standardized RDKit canonical form.</li>
<li><strong>Hydrogen addition</strong>: Explicitly add hydrogen atoms that are normally implied (e.g., <code>C</code> becomes <code>[CH4]</code>). This dramatically increases string length.</li>
<li><strong>Kekulization</strong>: Convert aromatic ring notation to explicit alternating double bonds.</li>
<li><strong>Cycle renumbering</strong>: Replace ring-closure digit identifiers with random valid alternatives.</li>
<li><strong>Random atom order</strong>: Randomize the atom traversal order used to generate the SMILES string.</li>
</ol>
<h2 id="twelve-models-two-datasets-five-augmentations">Twelve Models, Two Datasets, Five Augmentations</h2>
<h3 id="models-evaluated">Models Evaluated</h3>
<p>The authors test 12 publicly available Transformer-based models spanning three architecture families:</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>Domain</th>
          <th>Parameters</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Text+Chem T5-standard</td>
          <td>Cross-modal</td>
          <td>220M</td>
      </tr>
      <tr>
          <td>Text+Chem T5-augm</td>
          <td>Cross-modal</td>
          <td>220M</td>
      </tr>
      <tr>
          <td>MolT5-base</td>
          <td>Cross-modal</td>
          <td>220M</td>
      </tr>
      <tr>
          <td>MolT5-large</td>
          <td>Cross-modal</td>
          <td>770M</td>
      </tr>
      <tr>
          <td>SciFive</td>
          <td>Text-only</td>
          <td>220M</td>
      </tr>
      <tr>
          <td>PubChemDeBERTa</td>
          <td>Chemical</td>
          <td>86M</td>
      </tr>
      <tr>
          <td>ChemBERT-ChEMBL</td>
          <td>Chemical</td>
          <td>6M</td>
      </tr>
      <tr>
          <td><a href="/notes/computational-chemistry/chemical-language-models/molecular-encoders/chemberta/">ChemBERTa</a></td>
          <td>Chemical</td>
          <td>125M</td>
      </tr>
      <tr>
          <td><a href="/notes/computational-chemistry/chemical-language-models/molecular-encoders/bartsmiles-molecular-representations/">BARTSmiles</a></td>
          <td>Chemical</td>
          <td>400M</td>
      </tr>
      <tr>
          <td>ZINC-RoBERTa</td>
          <td>Chemical</td>
          <td>102M</td>
      </tr>
      <tr>
          <td><a href="/notes/computational-chemistry/chemical-language-models/multimodal-molecular/nach0-multimodal-chemical-language-model/">nach0</a></td>
          <td>Chemical</td>
          <td>220M</td>
      </tr>
      <tr>
          <td>ZINC-GPT</td>
          <td>Chemical</td>
          <td>87M</td>
      </tr>
  </tbody>
</table>
<h3 id="datasets">Datasets</h3>
<ul>
<li><strong>ChEBI-20 test set</strong>: ~3,300 molecule-description pairs, used for both AMORE retrieval and molecule captioning comparisons.</li>
<li><strong>Isomers</strong> (QM9 subset): 918 molecules that are all isomers of C9H12N2O, making retrieval harder because all molecules share the same molecular formula.</li>
</ul>
<h3 id="key-results-on-chebi-20">Key Results on ChEBI-20</h3>
<p>On the ChEBI-20 dataset (Table 2 from the paper), top-1 accuracy varies enormously by augmentation type. Cycle renumbering is easiest (up to 98.48% Acc@1 for SciFive), while hydrogen addition is hardest (no model exceeds 5.97% Acc@1).</p>
<p>For the cross-modal Text+Chem T5-standard model:</p>
<table>
  <thead>
      <tr>
          <th>Augmentation</th>
          <th>Acc@1</th>
          <th>Acc@5</th>
          <th>MRR</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Canonical</td>
          <td>63.03</td>
          <td>82.76</td>
          <td>72.4</td>
      </tr>
      <tr>
          <td>Hydrogen</td>
          <td>5.46</td>
          <td>10.85</td>
          <td>8.6</td>
      </tr>
      <tr>
          <td>Kekulization</td>
          <td>76.76</td>
          <td>92.03</td>
          <td>83.8</td>
      </tr>
      <tr>
          <td>Cycle</td>
          <td>96.70</td>
          <td>99.82</td>
          <td>98.2</td>
      </tr>
      <tr>
          <td>Random</td>
          <td>46.94</td>
          <td>74.18</td>
          <td>59.33</td>
      </tr>
  </tbody>
</table>
<h3 id="key-results-on-isomers">Key Results on Isomers</h3>
<p>Performance drops substantially on the Isomers dataset, where all molecules share the same formula. The best Acc@1 for hydrogen augmentation is just 1.53% (MolT5-large). Even for the relatively easy cycle augmentation, top scores drop from the high 90s to the low 90s for most models, and some models (BARTSmiles: 41.83%) struggle considerably.</p>
<h3 id="downstream-moleculenet-impact">Downstream MoleculeNet Impact</h3>
<p>The authors also fine-tuned models on original <a href="/notes/computational-chemistry/benchmark-problems/moleculenet-benchmark-molecular-ml/">MoleculeNet</a> training data and tested on augmented test sets across 9 tasks (regression, binary classification, multilabel classification). Results confirm that augmentations degrade downstream performance. For example, on ESOL regression, RMSE increased from 0.87 to 7.93 with hydrogen addition. Rankings computed using the Vote&rsquo;n&rsquo;Rank framework (using the <a href="https://en.wikipedia.org/wiki/Copeland%27s_method">Copeland rule</a>) show that hydrogen augmentation is the only one that substantially reshuffles model rankings; other augmentations preserve the original ordering.</p>
<h3 id="correlation-between-amore-and-captioning-metrics">Correlation Between AMORE and Captioning Metrics</h3>
<p>The differences in ROUGE/METEOR between original and augmented SMILES correlate with AMORE retrieval accuracy (Spearman correlation &gt; 0.7 with p-value = 0.003 for Acc@1). This validates AMORE as a proxy for predicting how augmentations will affect generation quality, without requiring labeled captioning data.</p>
<h2 id="current-chemlms-learn-syntax-not-chemistry">Current ChemLMs Learn Syntax, Not Chemistry</h2>
<p>The central finding is that existing ChemLMs are not robust to identity-preserving SMILES augmentations. Several specific conclusions emerge:</p>
<ol>
<li>
<p><strong>Hydrogen augmentation is catastrophic</strong>: All models fail (&lt; 6% Acc@1 on ChEBI-20, &lt; 2% on Isomers). The authors attribute this to the near-complete absence of explicit hydrogen in pretraining data, creating a distribution shift.</p>
</li>
<li>
<p><strong>Cross-modal models outperform unimodal ones</strong>: Models trained on both text and SMILES (Text+Chem T5, MolT5) consistently achieve higher retrieval accuracy on four of five augmentations.</p>
</li>
<li>
<p><strong>Augmentation difficulty follows a consistent order</strong>: For all models, hydrogen is hardest, followed by canonicalization, random ordering, kekulization, and cycle renumbering (easiest).</p>
</li>
<li>
<p><strong>Layer-wise analysis reveals instability</strong>: Retrieval accuracy across Transformer layers is correlated across augmentation types, suggesting that representations degrade at the same layers regardless of augmentation.</p>
</li>
<li>
<p><strong><a href="https://en.wikipedia.org/wiki/Levenshtein_distance">Levenshtein distance</a> partially explains difficulty</strong>: Hydrogen augmentation produces strings ~2x longer than originals (Levenshtein ratio of 1.49), but the low correlation between Levenshtein ratio and downstream metrics (ROUGE1 correlation of -0.05 for hydrogen) suggests string length alone does not explain the failure.</p>
</li>
</ol>
<h3 id="limitations">Limitations</h3>
<p>The authors acknowledge several limitations. Only publicly available HuggingFace models were evaluated, excluding models like <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/autoregressive/chemformer/">Chemformer</a> and <a href="/notes/computational-chemistry/chemical-language-models/molecular-encoders/molformer/">Molformer</a> that lack HF checkpoints. The study focuses exclusively on SMILES sequences, not 3D molecular structures or other formats like <a href="/notes/computational-chemistry/molecular-representations/selfies/">SELFIES</a>. The augmentation types, while representative, do not cover all possible identity transformations.</p>
<p>The authors suggest that AMORE could serve as a regularization tool during training, for example by using metric learning to encourage models to embed SMILES variants of the same molecule close together.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Retrieval evaluation</td>
          <td>ChEBI-20 test set</td>
          <td>3,300 molecules</td>
          <td>Standard benchmark for molecule captioning</td>
      </tr>
      <tr>
          <td>Retrieval evaluation</td>
          <td>Isomers (QM9 subset)</td>
          <td>918 molecules</td>
          <td>All isomers of C9H12N2O</td>
      </tr>
      <tr>
          <td>Downstream evaluation</td>
          <td>MoleculeNet (9 tasks)</td>
          <td>Varies</td>
          <td>ESOL, Lipophilicity, FreeSolv, HIV, BBBP, BACE, Tox21, ToxCast, SIDER</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>SMILES augmentations via RDKit (canonicalization, hydrogen addition, kekulization, cycle renumbering, random atom ordering)</li>
<li>Nearest-neighbor retrieval using FAISS with L2, cosine, inner product, and HNSW metrics</li>
<li>Model ranking via Vote&rsquo;n&rsquo;Rank (Copeland rule) on MoleculeNet tasks</li>
</ul>
<h3 id="models">Models</h3>
<p>All 12 evaluated models are publicly available on HuggingFace. No custom model training was performed for the AMORE retrieval experiments. MoleculeNet experiments used standard fine-tuning on original training splits.</p>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Description</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Acc@1</td>
          <td>Top-1 retrieval accuracy</td>
          <td>Primary AMORE metric</td>
      </tr>
      <tr>
          <td>Acc@5</td>
          <td>Top-5 retrieval accuracy</td>
          <td>Secondary AMORE metric</td>
      </tr>
      <tr>
          <td>MRR</td>
          <td>Mean Reciprocal Rank</td>
          <td>Average rank of correct match</td>
      </tr>
      <tr>
          <td>ROUGE-2</td>
          <td>Bigram overlap for captioning</td>
          <td>Compared against AMORE</td>
      </tr>
      <tr>
          <td>METEOR</td>
          <td>MT evaluation metric for captioning</td>
          <td>Compared against AMORE</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Computational resources from HPC facilities at HSE University. Specific GPU types and training times are not reported.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/ChemistryLLMs/AMORE">AMORE GitHub</a></td>
          <td>Code</td>
          <td>Not specified</td>
          <td>Framework code and evaluation data</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Ganeeva, V., Khrabrov, K., Kadurin, A., &amp; Tutubalina, E. (2025). Measuring Chemical LLM robustness to molecular representations: a SMILES variation-based framework. <em>Journal of Cheminformatics</em>, 17(1). <a href="https://doi.org/10.1186/s13321-025-01079-0">https://doi.org/10.1186/s13321-025-01079-0</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{ganeeva2025measuring,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Measuring Chemical LLM robustness to molecular representations: a SMILES variation-based framework}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Ganeeva, Veronika and Khrabrov, Kuzma and Kadurin, Artur and Tutubalina, Elena}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Cheminformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{17}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2025}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Springer}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1186/s13321-025-01079-0}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>BARTSmiles: BART Pre-Training for Molecular SMILES</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-encoders/bartsmiles-molecular-representations/</link><pubDate>Sun, 22 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-encoders/bartsmiles-molecular-representations/</guid><description>BARTSmiles applies BART-style denoising pre-training to 1.7B SMILES from ZINC20, achieving top results on 11 molecular property and reaction tasks.</description><content:encoded><![CDATA[<h2 id="a-bart-based-method-for-molecular-self-supervised-learning">A BART-Based Method for Molecular Self-Supervised Learning</h2>
<p>BARTSmiles is a <strong>Method</strong> paper. It introduces a self-supervised pre-training approach for molecular representations based on the BART (Bidirectional and Auto-Regressive Transformers) architecture from Lewis et al. (2019). The primary contribution is a pre-training strategy, discovered through systematic ablations, that trains a BART-large model on 1.7 billion deduplicated <a href="/notes/computational-chemistry/molecular-representations/smiles/">SMILES strings</a> from the <a href="/notes/computational-chemistry/datasets/zinc-22/">ZINC20 dataset</a>. BARTSmiles achieves the best reported results on 11 tasks spanning molecular property classification, regression, and chemical reaction generation.</p>
<h2 id="scaling-self-supervised-molecular-representations-beyond-prior-work">Scaling Self-Supervised Molecular Representations Beyond Prior Work</h2>
<p>At the time of publication, large-scale self-supervised representation learning had produced significant improvements in NLP, computer vision, and speech, but molecular representation learning had not benefited from comparable scale. Previous SMILES-based pre-trained models such as <a href="/notes/computational-chemistry/chemical-language-models/molecular-encoders/chemberta/">ChemBERTa</a> (Chithrananda et al., 2020) and <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/autoregressive/chemformer/">ChemFormer</a> (Irwin et al., 2022) used encoder-only or encoder-decoder architectures with substantially less compute. ChemFormer, the most closely related prior work, also trained a BART-like model but with a fraction of the compute and data.</p>
<p>The paper argues that three gaps needed to be addressed:</p>
<ol>
<li><strong>Scale</strong>: Prior molecular pre-training used orders of magnitude less compute than NLP pre-training.</li>
<li><strong>Architecture choice</strong>: Encoder-only models like ChemBERTa cannot perform generative fine-tuning (retrosynthesis, reaction prediction), limiting their applicability.</li>
<li><strong>Pre-training recipe</strong>: Standard BART hyperparameters (e.g., 30% mask token budget) were tuned for natural language and had not been validated for molecular SMILES strings.</li>
</ol>
<h2 id="core-innovation-ablation-driven-pre-training-recipe-for-smiles">Core Innovation: Ablation-Driven Pre-Training Recipe for SMILES</h2>
<p>The key insight of BARTSmiles is that the BART denoising objective, when carefully tuned for the molecular domain, learns representations that implicitly encode downstream task information. The authors discover this through a systematic three-stage ablation:</p>
<h3 id="tokenization">Tokenization</h3>
<p>Rather than using hand-crafted tokenization rules that separate individual atoms (C, N, H) and bond symbols (#, =), BARTSmiles uses a learned SentencePiece unigram tokenizer trained on 10 million random SMILES with a vocabulary size of 1,021. On matched compute budgets, learned tokenization achieves 0.801 average AUC-ROC vs. 0.779 for hand-crafted tokenization on the ablation benchmark (HIV, BBBP, ClinTox).</p>
<h3 id="masking-strategy">Masking Strategy</h3>
<p>The BART denoising objective has three main hyperparameters: the mask token budget (fraction of tokens masked), random mask probability, and the Poisson $\lambda$ controlling mask span length. The ablation results show:</p>
<ul>
<li><strong>Mask token budget</strong>: The standard BART value of 0.30 is suboptimal for molecules. A budget of 0.20 performs best (0.821 AUC-ROC), with performance degrading at both lower (0.10: 0.753) and higher (0.40: 0.701) budgets.</li>
<li><strong>Span masking</strong>: The choice of random mask probability and $\lambda$ has a minor effect once the budget is set to 0.20. Values of random mask = 0.10 and $\lambda$ = 2.5 or 3.5 all yield 0.821.</li>
<li><strong>Token randomization</strong>: Disabling the randomize-tokens noise (where some tokens are replaced with random tokens rather than masked) improves performance from 0.821 to 0.835.</li>
</ul>
<h3 id="scale">Scale</h3>
<p>Training on the full 1.7 billion molecule ZINC20 dataset (20 hours on 1,024 A100 GPUs, totaling 20,480 A100 GPU-hours) improves performance by 5 absolute AUC-ROC points over the same model trained on 100 million samples. The previous most compute-intensive molecular pre-training used 3,330 V100-hours (Ross et al., 2021).</p>
<h3 id="implicit-task-encoding">Implicit Task Encoding</h3>
<p>The paper provides a quantitative demonstration that frozen BARTSmiles representations encode task-specific information. Using L1-regularized logistic regression on frozen 1,024-dimensional mean-pooled representations, just 7 neurons are sufficient to achieve 0.987 AUC-ROC on ClinTox (within 2 percentage points of full fine-tuning). Even a single neuron achieves 0.77 AUC-ROC on ClinTox subtask 1.</p>
<h2 id="experimental-setup-moleculenet-toxicology-and-generative-benchmarks">Experimental Setup: MoleculeNet, Toxicology, and Generative Benchmarks</h2>
<h3 id="classification-tasks">Classification Tasks</h3>
<p>BARTSmiles is evaluated on 7 classification datasets from <a href="/notes/computational-chemistry/benchmark-problems/moleculenet-benchmark-molecular-ml/">MoleculeNet</a> (SIDER, ClinTox, Tox21, ToxCast, HIV, BACE, BBBP) plus 2 toxicology datasets (<a href="https://en.wikipedia.org/wiki/Ames_test">Ames</a>, <a href="https://en.wikipedia.org/wiki/Micronucleus_test">Micronucleus Assay</a>). All classification tasks use AUC-ROC. Baselines include both supervised graph models (D-MPNN, Attentive FP, 3D InfoMax) and self-supervised methods (ChemBERTa, <a href="/notes/computational-chemistry/chemical-language-models/molecular-encoders/molformer/">MolFormer-XL</a>, GROVER-large, MolCLR, iMolCLR).</p>
<p>Selected classification results (AUC-ROC):</p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>BARTSmiles</th>
          <th>Previous Best</th>
          <th>Previous Best Model</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ClinTox</td>
          <td><strong>0.997</strong></td>
          <td>0.954</td>
          <td>iMolCLR</td>
      </tr>
      <tr>
          <td>ToxCast</td>
          <td><strong>0.825</strong></td>
          <td>0.805</td>
          <td>Attentive FP</td>
      </tr>
      <tr>
          <td>SIDER</td>
          <td><strong>0.705</strong></td>
          <td>0.699</td>
          <td>iMolCLR</td>
      </tr>
      <tr>
          <td>Tox21</td>
          <td>0.851</td>
          <td>0.858</td>
          <td>Attentive FP</td>
      </tr>
  </tbody>
</table>
<p>The authors note that three scaffold-split datasets (HIV, BACE, BBBP) are highly sensitive to the specific split used, and they suspect some baseline results use different or random splits. These results are marked with caveats in the paper.</p>
<h3 id="regression-tasks">Regression Tasks</h3>
<p>All three MoleculeNet regression tasks (ESOL, FreeSolv, Lipophilicity) are evaluated using RMSE:</p>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>BARTSmiles</th>
          <th>Previous Best</th>
          <th>Previous Best Model</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ESOL</td>
          <td><strong>0.095</strong></td>
          <td>0.279</td>
          <td>MoLFormer-XL</td>
      </tr>
      <tr>
          <td>FreeSolv</td>
          <td><strong>0.114</strong></td>
          <td>0.231</td>
          <td>MoLFormer-XL</td>
      </tr>
      <tr>
          <td>Lipophilicity</td>
          <td><strong>0.292</strong></td>
          <td>0.529</td>
          <td>MoLFormer-XL</td>
      </tr>
  </tbody>
</table>
<p>BARTSmiles achieves substantial improvements on all three regression tasks.</p>
<h3 id="generative-tasks">Generative Tasks</h3>
<p><strong><a href="https://en.wikipedia.org/wiki/Retrosynthetic_analysis">Retrosynthesis</a></strong> (USPTO-50k): BARTSmiles achieves 55.6% Top-1 accuracy using a sample-128 + perplexity re-ranking strategy, compared to 55.3% for Dual-TF and 54.3% for ChemFormer. Top-5 and Top-10 results are 74.2% and 80.9% respectively.</p>
<p><strong>Chemical Reaction Prediction</strong> (USPTO MIT/LEF/STEREO): BARTSmiles with beam search outperforms the <a href="/notes/computational-chemistry/chemical-language-models/reaction-prediction/molecular-transformer/">Molecular Transformer</a> baseline across all six evaluation settings. On USPTO-MIT (split), BARTSmiles achieves 91.8% vs. 90.4% for the Transformer baseline.</p>
<h3 id="fine-tuning-recipe">Fine-Tuning Recipe</h3>
<p>The fine-tuning approach is designed to minimize hyperparameter tuning:</p>
<ul>
<li>Batch size 16, 10 epochs, polynomial decay learning rate schedule with warmup at 16% of training</li>
<li>Grid search over dropout (0.1, 0.2, 0.3) and learning rate ($5 \times 10^{-6}$, $1 \times 10^{-5}$, $3 \times 10^{-5}$)</li>
<li>Stochastic Weight Averaging (SWA) over three sets of four checkpoints</li>
<li>For generative tasks: R3F regularization (Aghajanyan et al., 2020a) and full fp32 precision</li>
<li>For generation: beam search (beam size 10) or sample 128 sequences with perplexity re-ranking</li>
</ul>
<h2 id="key-findings-and-limitations">Key Findings and Limitations</h2>
<h3 id="key-findings">Key Findings</h3>
<ol>
<li><strong>Scale matters for molecular pre-training</strong>: Training on 1.7B molecules with 20,480 A100 GPU-hours yields 5 absolute points of AUC-ROC improvement over training on 100M molecules.</li>
<li><strong>Domain-specific ablation is necessary</strong>: The optimal BART masking configuration for molecules (20% budget, no token randomization) differs from the standard NLP configuration (30% budget, with randomization).</li>
<li><strong>Frozen representations capture task structure</strong>: A small number of neurons from the frozen model can nearly match full fine-tuning performance on certain tasks, suggesting the pre-training objective implicitly encodes molecular properties.</li>
<li><strong>Interpretability aligns with domain knowledge</strong>: Integrated Gradients attribution on fine-tuned BARTSmiles highlights known structural alerts (e.g., <a href="https://en.wikipedia.org/wiki/Nitro_compound">nitro groups</a> in mutagenic compounds, hydroxyl groups in soluble compounds).</li>
</ol>
<h3 id="limitations">Limitations</h3>
<ul>
<li><strong>Scaffold split sensitivity</strong>: Results on HIV, BACE, and BBBP are sensitive to the specific scaffold split, making direct comparison with baselines difficult.</li>
<li><strong>Pre-training data distribution</strong>: The <a href="https://en.wikipedia.org/wiki/Fr%C3%A9chet_distance">Frechet distance</a> analysis shows that some downstream datasets (BBBP, SIDER) are far from ZINC20 in representation space, which may explain weaker performance on those tasks.</li>
<li><strong>Fingerprints carry complementary information</strong>: On the Ames and Micronucleus Assay datasets, BARTSmiles alone does not beat fingerprint-based baselines. Combining BARTSmiles with ECFP4 fingerprints closes the gap, implying that SMILES-based pre-training does not fully capture all structural information.</li>
<li><strong>Compute requirements</strong>: Pre-training requires 1,024 A100 GPUs, which limits accessibility.</li>
</ul>
<h3 id="future-directions">Future Directions</h3>
<p>The authors suggest investigating the impact of pre-training data composition, noting that ZINC20 contains over a billion molecules but its distribution may be irrelevant for many downstream tasks. They also propose further collaboration between ML and chemistry experts to discover new molecular substructure-property relationships.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/YerevaNN/BARTSmiles">BARTSmiles (GitHub)</a></td>
          <td>Code + Model</td>
          <td>MIT</td>
          <td>Pre-training, fine-tuning, and evaluation scripts with pre-trained weights</td>
      </tr>
  </tbody>
</table>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pre-training</td>
          <td>ZINC20 (deduplicated)</td>
          <td>~1.7B molecules</td>
          <td>Canonicalized SMILES, 10K validation holdout</td>
      </tr>
      <tr>
          <td>Classification</td>
          <td>MoleculeNet (7 datasets)</td>
          <td>1,427-41,127 compounds</td>
          <td>AUC-ROC metric</td>
      </tr>
      <tr>
          <td>Regression</td>
          <td>MoleculeNet (3 datasets)</td>
          <td>642-4,200 compounds</td>
          <td>RMSE metric</td>
      </tr>
      <tr>
          <td>Toxicology</td>
          <td>Ames, MN Assay</td>
          <td>6,512 / 641 compounds</td>
          <td>Cross-validation for Ames; external test for MN</td>
      </tr>
      <tr>
          <td>Retrosynthesis</td>
          <td>USPTO-50k</td>
          <td>Standard split</td>
          <td>Top-K accuracy</td>
      </tr>
      <tr>
          <td>Reaction prediction</td>
          <td>USPTO (MIT/LEF/STEREO)</td>
          <td>Standard splits</td>
          <td>Top-1 accuracy</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Architecture: BART-Large (pre-layer norm Transformer encoder-decoder)</li>
<li>Tokenizer: SentencePiece unigram, vocabulary size 1,021, max sequence length 128</li>
<li>Pre-training objective: BART denoising (mask token budget 0.20, Poisson span masking with $\lambda$ = 2.5, no token randomization)</li>
<li>Fine-tuning: polynomial decay LR, SWA, grid search over dropout and LR</li>
<li>Generative fine-tuning: R3F regularization, fp32 precision, Adam initialized from pre-training moving averages</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>BART-Large architecture (exact parameter count not specified in paper)</li>
<li>Pre-trained checkpoint released on GitHub</li>
<li>Maximum sequence length: 128 tokens</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Task</th>
          <th>Metric</th>
          <th>BARTSmiles</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ClinTox</td>
          <td>AUC-ROC</td>
          <td>0.997</td>
          <td>New SOTA</td>
      </tr>
      <tr>
          <td>ToxCast</td>
          <td>AUC-ROC</td>
          <td>0.825</td>
          <td>New SOTA</td>
      </tr>
      <tr>
          <td>ESOL</td>
          <td>RMSE</td>
          <td>0.095</td>
          <td>New SOTA</td>
      </tr>
      <tr>
          <td>FreeSolv</td>
          <td>RMSE</td>
          <td>0.114</td>
          <td>New SOTA</td>
      </tr>
      <tr>
          <td>Lipophilicity</td>
          <td>RMSE</td>
          <td>0.292</td>
          <td>New SOTA</td>
      </tr>
      <tr>
          <td>USPTO-50k Retro (Top-1)</td>
          <td>Accuracy</td>
          <td>55.6%</td>
          <td>New SOTA (sample + re-rank)</td>
      </tr>
      <tr>
          <td>USPTO-MIT Rxn (Split)</td>
          <td>Accuracy</td>
          <td>91.8%</td>
          <td>New SOTA (beam-10)</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>Pre-training: 1,024 NVIDIA A100 GPUs for 20 hours (20,480 A100 GPU-hours)</li>
<li>Ablation runs: 128 A100 GPUs per run</li>
<li>Framework: FairSeq with FairScale (fully sharded data parallel), automatic mixed precision</li>
<li>Experiment tracking: Aim</li>
</ul>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Chilingaryan, G., Tamoyan, H., Tevosyan, A., Babayan, N., Khondkaryan, L., Hambardzumyan, K., Navoyan, Z., Khachatrian, H., &amp; Aghajanyan, A. (2024). BARTSmiles: Generative Masked Language Models for Molecular Representations. <em>Journal of Chemical Information and Modeling</em>, 64(15), 5832-5843. <a href="https://doi.org/10.1021/acs.jcim.4c00512">https://doi.org/10.1021/acs.jcim.4c00512</a></p>
<p><strong>Publication</strong>: Journal of Chemical Information and Modeling, 2024 (preprint: arXiv 2022)</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/YerevaNN/BARTSmiles">BARTSmiles GitHub Repository (MIT License)</a></li>
</ul>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{chilingaryan2024bartsmiles,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{BARTSmiles: Generative Masked Language Models for Molecular Representations}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Chilingaryan, Gayane and Tamoyan, Hovhannes and Tevosyan, Ani and Babayan, Nelly and Khondkaryan, Lusine and Hambardzumyan, Karen and Navoyan, Zaven and Khachatrian, Hrant and Aghajanyan, Armen}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Chemical Information and Modeling}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{64}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{15}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{5832--5843}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1021/acs.jcim.4c00512}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2024}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>SELFormer: A SELFIES-Based Molecular Language Model</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-encoders/selformer/</link><pubDate>Mon, 16 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-encoders/selformer/</guid><description>A SELFIES-based RoBERTa model pretrained on 2M ChEMBL molecules for molecular property prediction on MoleculeNet benchmarks.</description><content:encoded><![CDATA[<h2 id="a-selfies-based-chemical-language-model">A SELFIES-Based Chemical Language Model</h2>
<p>This is primarily a <strong>Method</strong> paper ($\Psi_{\text{Method}}$) with a secondary <strong>Resource</strong> component ($\Psi_{\text{Resource}}$).</p>
<p>SELFormer applies the RoBERTa transformer architecture to <a href="/notes/computational-chemistry/molecular-representations/selfies-original-paper/">SELFIES</a> molecular string representations instead of the <a href="/notes/computational-chemistry/molecular-representations/smiles/">SMILES</a> notation used by prior chemical language models. The model is pretrained via masked language modeling (MLM) on 2M drug-like compounds from <a href="https://en.wikipedia.org/wiki/ChEMBL">ChEMBL</a> and fine-tuned for molecular property prediction tasks on <a href="/notes/computational-chemistry/benchmark-problems/moleculenet-benchmark-molecular-ml/">MoleculeNet</a> benchmarks. The authors release pretrained models, fine-tuning code, and datasets as open-source resources.</p>
<h2 id="why-selfies-over-smiles-for-pretraining">Why SELFIES Over SMILES for Pretraining?</h2>
<p>Existing chemical language models, including <a href="/notes/computational-chemistry/chemical-language-models/molecular-encoders/chemberta/">ChemBERTa</a>, <a href="/notes/computational-chemistry/chemical-language-models/molecular-encoders/chemberta-2/">ChemBERTa-2</a>, <a href="/notes/computational-chemistry/chemical-language-models/molecular-encoders/molbert-molecular-representations/">MolBERT</a>, and <a href="/notes/computational-chemistry/chemical-language-models/molecular-encoders/molformer/">MolFormer</a>, all use SMILES as their input representation. SMILES has well-documented validity and robustness issues: arbitrary perturbations to a SMILES string frequently produce syntactically invalid outputs. This means a pretrained model must spend capacity learning SMILES grammar rules rather than chemical semantics.</p>
<p><a href="/notes/computational-chemistry/molecular-representations/selfies-original-paper/">SELFIES</a> addresses this by construction: every possible SELFIES string decodes to a valid molecule. Despite this theoretical advantage and SELFIES&rsquo; growing adoption in generative chemistry, no prior work had systematically evaluated SELFIES as input for large-scale transformer pretraining. SELFormer fills this gap by providing a direct comparison between SELFIES-based and SMILES-based chemical language models on standard benchmarks.</p>
<h2 id="masked-language-modeling-on-guaranteed-valid-molecular-strings">Masked Language Modeling on Guaranteed-Valid Molecular Strings</h2>
<p>SELFormer uses byte-level Byte-Pair Encoding (BPE) to tokenize SELFIES strings, then pretrains a RoBERTa encoder using the standard MLM objective. 15% of input tokens are masked, and the model minimizes the cross-entropy loss over the masked positions:</p>
<p>$$
\mathcal{L}_{\text{MLM}} = -\frac{1}{|\mathcal{M}|} \sum_{i \in \mathcal{M}} \log P(x_i \mid x_{\setminus \mathcal{M}}; \theta)
$$</p>
<p>where $\mathcal{M}$ is the set of masked token indices, $x_i$ is the true token at position $i$, $x_{\setminus \mathcal{M}}$ is the corrupted input context, and $\theta$ are the model parameters.</p>
<p>The key insight is that because SELFIES guarantees 100% validity, every masked token prediction corresponds to a valid molecular fragment. The model never wastes capacity predicting invalid chemistry. For fine-tuning, a two-layer classification or regression head is added on top of the encoder&rsquo;s output embedding.</p>
<p>Two model sizes were trained. Notably, the larger SELFormer uses fewer attention heads (4) but more hidden layers (12) than SELFormer-Lite (12 heads, 8 layers). This counterintuitive configuration emerged from the authors&rsquo; hyperparameter search over ~100 models, where deeper architectures with fewer heads outperformed wider, shallower ones:</p>
<table>
  <thead>
      <tr>
          <th>Configuration</th>
          <th>SELFormer-Lite</th>
          <th>SELFormer</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Attention Heads</td>
          <td>12</td>
          <td>4</td>
      </tr>
      <tr>
          <td>Hidden Layers</td>
          <td>8</td>
          <td>12</td>
      </tr>
      <tr>
          <td>Batch Size</td>
          <td>16</td>
          <td>16</td>
      </tr>
      <tr>
          <td>Learning Rate</td>
          <td>5e-5</td>
          <td>5e-5</td>
      </tr>
      <tr>
          <td>Weight Decay</td>
          <td>0.01</td>
          <td>0.01</td>
      </tr>
      <tr>
          <td>Pretraining Epochs</td>
          <td>100</td>
          <td>100</td>
      </tr>
      <tr>
          <td>Parameters</td>
          <td>58.3M</td>
          <td>86.7M</td>
      </tr>
  </tbody>
</table>
<h2 id="benchmarking-against-smiles-transformers-and-graph-models">Benchmarking Against SMILES Transformers and Graph Models</h2>
<p>SELFormer was pretrained on 2.08M drug-like compounds from ChEMBL v30 (converted from SMILES to SELFIES), then fine-tuned on nine MoleculeNet tasks. All evaluations use scaffold splitting via the Chemprop library.</p>
<p><strong>Classification tasks</strong> (ROC-AUC, scaffold split):</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>BACE</th>
          <th>BBBP</th>
          <th>HIV</th>
          <th>Tox21</th>
          <th>SIDER</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>SELFormer</td>
          <td>0.832</td>
          <td><strong>0.902</strong></td>
          <td>0.681</td>
          <td>0.653</td>
          <td><strong>0.745</strong></td>
      </tr>
      <tr>
          <td>ChemBERTa-2</td>
          <td>0.799</td>
          <td>0.728</td>
          <td>0.622</td>
          <td>-</td>
          <td>-</td>
      </tr>
      <tr>
          <td>MolBERT</td>
          <td><strong>0.866</strong></td>
          <td>0.762</td>
          <td><strong>0.783</strong></td>
          <td>-</td>
          <td>-</td>
      </tr>
      <tr>
          <td>D-MPNN</td>
          <td>0.809</td>
          <td>0.710</td>
          <td>0.771</td>
          <td>0.759</td>
          <td>0.570</td>
      </tr>
      <tr>
          <td>MolCLR</td>
          <td><strong>0.890</strong></td>
          <td>0.736</td>
          <td><strong>0.806</strong></td>
          <td><strong>0.787</strong></td>
          <td>0.652</td>
      </tr>
      <tr>
          <td>GEM</td>
          <td>0.856</td>
          <td>0.724</td>
          <td><strong>0.806</strong></td>
          <td>0.781</td>
          <td>0.672</td>
      </tr>
      <tr>
          <td>KPGT</td>
          <td>0.855</td>
          <td><strong>0.908</strong></td>
          <td>-</td>
          <td><strong>0.848</strong></td>
          <td>0.649</td>
      </tr>
  </tbody>
</table>
<p><strong>Regression tasks</strong> (RMSE, scaffold split, lower is better):</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>ESOL</th>
          <th>FreeSolv</th>
          <th>Lipophilicity</th>
          <th>PDBbind</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>SELFormer</td>
          <td><strong>0.682</strong></td>
          <td>2.797</td>
          <td>0.735</td>
          <td>1.488</td>
      </tr>
      <tr>
          <td>ChemBERTa-2</td>
          <td>-</td>
          <td>-</td>
          <td>0.986</td>
          <td>-</td>
      </tr>
      <tr>
          <td>D-MPNN</td>
          <td>1.050</td>
          <td><strong>2.082</strong></td>
          <td><strong>0.683</strong></td>
          <td><strong>1.397</strong></td>
      </tr>
      <tr>
          <td>GEM</td>
          <td>0.798</td>
          <td><strong>1.877</strong></td>
          <td>0.660</td>
          <td>-</td>
      </tr>
      <tr>
          <td>KPGT</td>
          <td>0.803</td>
          <td>2.121</td>
          <td><strong>0.600</strong></td>
          <td>-</td>
      </tr>
  </tbody>
</table>
<p>The ablation study compared SELFormer vs. SELFormer-Lite across pretrained-only, 25-epoch, and 50-epoch fine-tuning configurations on randomly split datasets. SELFormer consistently outperformed SELFormer-Lite, confirming the benefit of the deeper (12-layer) architecture.</p>
<h2 id="strong-classification-performance-with-compact-pretraining">Strong Classification Performance with Compact Pretraining</h2>
<p>SELFormer&rsquo;s strongest results come on classification tasks where molecular substructure matters:</p>
<ul>
<li><strong>SIDER</strong>: Best overall ROC-AUC (0.745), outperforming the next best method (MolCLR at 0.652) by 9.3 percentage points. The authors attribute this to SELFIES&rsquo; ability to capture subtle structural differences relevant to drug side effects.</li>
<li><strong>BBBP</strong>: Second best (0.902), behind only KPGT (0.908). SELFormer scored 17.4 percentage points above ChemBERTa-2 (0.728) on this task.</li>
<li><strong>BACE/HIV vs. ChemBERTa-2</strong>: SELFormer outperformed ChemBERTa-2 by 3.3 points on BACE (0.832 vs 0.799), 17.4 on BBBP, and 5.9 on HIV (0.681 vs 0.622). Since both models use similar RoBERTa architectures, this comparison is suggestive of a SELFIES advantage, though differences in pretraining corpus (ChEMBL vs PubChem), corpus size, and training procedure confound a clean attribution to the input representation alone.</li>
<li><strong>ESOL regression</strong>: Best RMSE (0.682) vs GEM (0.798), a 14.5% relative improvement.</li>
</ul>
<p>Limitations are also apparent:</p>
<ul>
<li><strong>HIV and Tox21</strong>: SELFormer underperforms graph-based methods (MolCLR, GEM, KPGT) on these larger datasets. The authors attribute this to insufficient hyperparameter search given computational constraints.</li>
<li><strong>FreeSolv and Lipophilicity regression</strong>: D-MPNN and graph-based methods maintain an edge, suggesting that explicit 2D/3D structural inductive biases remain valuable for certain property types.</li>
<li><strong>Small pretraining corpus</strong>: At 2M molecules, SELFormer&rsquo;s corpus is orders of magnitude smaller than MolFormer&rsquo;s 1.1B. Despite this, SELFormer outperforms MolFormer on SIDER (0.745 vs 0.690), highlighting SELFIES&rsquo; representational advantage.</li>
<li><strong>Single-task ablation scope</strong>: Some architectural claims rest on limited task coverage, and broader benchmarking would strengthen the conclusions.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pretraining</td>
          <td>ChEMBL v30</td>
          <td>2,084,725 compounds (2,084,472 after SELFIES conversion)</td>
          <td>Drug-like bioactive small molecules</td>
      </tr>
      <tr>
          <td>Classification</td>
          <td>BACE</td>
          <td>1,513</td>
          <td><a href="https://en.wikipedia.org/wiki/Beta-secretase_1">Beta-secretase 1</a> inhibitor binding</td>
      </tr>
      <tr>
          <td>Classification</td>
          <td>BBBP</td>
          <td>2,039</td>
          <td><a href="https://en.wikipedia.org/wiki/Blood%E2%80%93brain_barrier">Blood-brain barrier</a> permeability</td>
      </tr>
      <tr>
          <td>Classification</td>
          <td>HIV</td>
          <td>41,127</td>
          <td>HIV replication inhibition</td>
      </tr>
      <tr>
          <td>Classification</td>
          <td>SIDER</td>
          <td>1,427</td>
          <td>Drug side effects (27 classes)</td>
      </tr>
      <tr>
          <td>Classification</td>
          <td>Tox21</td>
          <td>7,831</td>
          <td>Toxicity (12 targets)</td>
      </tr>
      <tr>
          <td>Regression</td>
          <td>ESOL</td>
          <td>1,128</td>
          <td>Aqueous solubility</td>
      </tr>
      <tr>
          <td>Regression</td>
          <td>FreeSolv</td>
          <td>642</td>
          <td>Hydration free energy</td>
      </tr>
      <tr>
          <td>Regression</td>
          <td>Lipophilicity</td>
          <td>4,200</td>
          <td>Octanol/water distribution coefficient</td>
      </tr>
      <tr>
          <td>Regression</td>
          <td>PDBbind</td>
          <td>11,908</td>
          <td>Binding affinity</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Pretraining objective</strong>: Masked language modeling (MLM), 15% token masking</li>
<li><strong>Tokenization</strong>: Byte-level Byte-Pair Encoding (BPE) on SELFIES strings</li>
<li><strong>SMILES to SELFIES conversion</strong>: SELFIES API with Pandaral.lel for parallelization</li>
<li><strong>Splitting</strong>: Scaffold splitting via Chemprop library (80/10/10 train/validation/test)</li>
<li><strong>Fine-tuning</strong>: Two-layer classification/regression head on encoder output; up to 200 epochs with hyperparameter search</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Architecture</strong>: RoBERTa (HuggingFace Transformers)</li>
<li><strong>SELFormer</strong>: 12 hidden layers, 4 attention heads, 86.7M parameters</li>
<li><strong>SELFormer-Lite</strong>: 8 hidden layers, 12 attention heads, 58.3M parameters</li>
<li><strong>Hyperparameter search</strong>: Sequential search over ~100 configurations on 100K molecule subset</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Task Type</th>
          <th>Details</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ROC-AUC</td>
          <td>Classification</td>
          <td>Area under receiver operating characteristic curve</td>
      </tr>
      <tr>
          <td>PRC-AUC</td>
          <td>Classification</td>
          <td>Area under precision-recall curve (reported for random splits)</td>
      </tr>
      <tr>
          <td>RMSE</td>
          <td>Regression</td>
          <td>Root mean squared error</td>
      </tr>
  </tbody>
</table>
<p>Results reported on scaffold split and random split datasets.</p>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Compute</strong>: 2x NVIDIA A5000 GPUs</li>
<li><strong>Hyperparameter optimization time</strong>: ~11 days</li>
<li><strong>Full pretraining</strong>: 100 epochs on 2.08M molecules</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/HUBioDataLab/SELFormer">SELFormer GitHub</a></td>
          <td>Code</td>
          <td>GPL-3.0</td>
          <td>Pretraining, fine-tuning, and evaluation scripts</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/HUBioDataLab/SELFormer">SELFormer on HuggingFace</a></td>
          <td>Model</td>
          <td>GPL-3.0</td>
          <td>Pretrained SELFormer weights</td>
      </tr>
      <tr>
          <td><a href="https://www.ebi.ac.uk/chembl/">ChEMBL v30</a></td>
          <td>Dataset</td>
          <td>CC BY-SA 3.0</td>
          <td>Source pretraining data</td>
      </tr>
      <tr>
          <td><a href="https://moleculenet.org/">MoleculeNet</a></td>
          <td>Benchmark</td>
          <td>Unknown</td>
          <td>Downstream evaluation tasks</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Yüksel, A., Ulusoy, E., Ünlü, A., &amp; Doğan, T. (2023). SELFormer: Molecular Representation Learning via SELFIES Language Models. <em>Machine Learning: Science and Technology</em>, 4(2), 025035. <a href="https://doi.org/10.1088/2632-2153/acdb30">https://doi.org/10.1088/2632-2153/acdb30</a></p>
<p><strong>Publication</strong>: Machine Learning: Science and Technology 2023</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/HUBioDataLab/SELFormer">GitHub Repository (SELFormer)</a></li>
<li><a href="https://huggingface.co/HUBioDataLab/SELFormer">HuggingFace Model Hub (SELFormer)</a></li>
</ul>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{yuksel2023selformer,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{{SELFormer}: Molecular Representation Learning via {SELFIES} Language Models}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Y{\&#34;u}ksel, Atakan and Ulusoy, Erva and {\&#34;U}nl{\&#34;u}, Atabey and Do{\u{g}}an, Tunca}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Machine Learning: Science and Technology}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{4}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{2}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{025035}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2023}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{IOP Publishing}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1088/2632-2153/acdb30}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>MoLFormer: Large-Scale Chemical Language Representations</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-encoders/molformer/</link><pubDate>Mon, 16 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-encoders/molformer/</guid><description>A linear-attention transformer pretrained on 1.1B SMILES from PubChem and ZINC for molecular property prediction across MoleculeNet benchmarks.</description><content:encoded><![CDATA[<h2 id="a-billion-scale-chemical-language-model">A Billion-Scale Chemical Language Model</h2>
<p>This is primarily a <strong>Method</strong> paper ($\Psi_{\text{Method}}$).</p>
<p>MoLFormer is a transformer encoder pretrained via masked language modeling on 1.1 billion <a href="/notes/computational-chemistry/molecular-representations/smiles/">SMILES</a> strings from <a href="https://en.wikipedia.org/wiki/PubChem">PubChem</a> and <a href="https://en.wikipedia.org/wiki/ZINC_database">ZINC</a>. The key architectural choices are linear attention (for $O(N)$ complexity instead of $O(N^2)$) and rotary positional embeddings (RoPE). The resulting model, MoLFormer-XL, produces molecular embeddings that outperform or match GNN baselines across a wide range of <a href="/notes/computational-chemistry/benchmark-problems/moleculenet-benchmark-molecular-ml/">MoleculeNet</a> classification and regression tasks, including quantum-chemical property prediction from SMILES alone.</p>
<h2 id="bridging-the-gap-between-molecular-languages-and-graph-neural-networks">Bridging the Gap Between Molecular Languages and Graph Neural Networks</h2>
<p>Prior chemical language models like <a href="/notes/computational-chemistry/chemical-language-models/molecular-encoders/chemberta/">ChemBERTa</a> were pretrained on relatively small datasets (10M-77M molecules) and generally underperformed GNNs on molecular property prediction. The core question: does a transformer trained on a sufficiently large SMILES corpus learn enough chemical structure to compete with graph-based methods that have explicit topological inductive biases?</p>
<p>Two specific challenges motivated this work:</p>
<ul>
<li><strong>Scale</strong>: The chemical space spans $10^{60}$ to $10^{100}$ plausible molecules, yet labeled property data is scarce. Self-supervised pretraining on the ~1.1B unlabeled molecules available in public databases could provide a general-purpose representation.</li>
<li><strong>Efficiency</strong>: Standard transformer attention is $O(N^2)$ in sequence length, making billion-scale pretraining impractical without architectural modifications.</li>
</ul>
<h2 id="linear-attention-with-rotary-positional-embeddings">Linear Attention with Rotary Positional Embeddings</h2>
<p>MoLFormer&rsquo;s two key architectural choices are its attention mechanism and positional encoding scheme.</p>
<p><strong>Standard attention</strong> computes:</p>
<p>$$
\text{Attention}_m(Q, K, V) = \frac{\sum_{n=1}^{N} \exp(\langle q_m, k_n \rangle) v_n}{\sum_{n=1}^{N} \exp(\langle q_m, k_n \rangle)}
$$</p>
<p>MoLFormer replaces this with <strong>linear attention</strong> using a generalized feature map $\varphi$, combined with <strong>rotary positional embeddings</strong> $R_m$ applied before the feature map:</p>
<p>$$
\text{Attention}_m(Q, K, V) = \frac{\sum_{n=1}^{N} \langle \varphi(R_m q_m), \varphi(R_n k_n) \rangle v_n}{\sum_{n=1}^{N} \langle \varphi(R_m q_m), \varphi(R_n k_n) \rangle}
$$</p>
<p>This differs from the original RoFormer formulation, which applies the rotation after the feature map. The authors found that rotating the raw queries and keys before projection led to faster convergence and lower validation loss. The combination of linear attention and adaptive sequence-length bucketing reduces GPU requirements from ~1000 to 16 for training on the full 1.1B corpus.</p>
<p>The model uses masked language modeling (15% token masking, following BERT conventions) with a vocabulary of 2,362 SMILES tokens. Sequence length is capped at 202 tokens, covering 99.4% of all molecules.</p>
<h2 id="broad-moleculenet-benchmarking-with-scaling-ablations">Broad MoleculeNet Benchmarking with Scaling Ablations</h2>
<p>MoLFormer-XL was evaluated on 11 MoleculeNet tasks against supervised GNNs, self-supervised GNNs, and prior language models.</p>
<p><strong>Classification tasks</strong> (ROC-AUC, scaffold split; values reported as percentages in the original paper, converted to proportions here for consistency):</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>BBBP</th>
          <th>Tox21</th>
          <th>ClinTox</th>
          <th>HIV</th>
          <th>BACE</th>
          <th>SIDER</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>MoLFormer-XL</td>
          <td><strong>0.937</strong></td>
          <td><strong>0.847</strong></td>
          <td><strong>0.948</strong></td>
          <td>0.822</td>
          <td>0.882</td>
          <td><strong>0.690</strong></td>
      </tr>
      <tr>
          <td>N-Gram</td>
          <td>0.912</td>
          <td>0.769</td>
          <td>0.855</td>
          <td>0.830</td>
          <td>0.876</td>
          <td>0.632</td>
      </tr>
      <tr>
          <td>MolCLR</td>
          <td>0.736</td>
          <td>0.798</td>
          <td>0.932</td>
          <td>0.806</td>
          <td><strong>0.890</strong></td>
          <td>0.680</td>
      </tr>
      <tr>
          <td>GEM</td>
          <td>0.724</td>
          <td>0.781</td>
          <td>0.901</td>
          <td>0.806</td>
          <td>0.856</td>
          <td>0.672</td>
      </tr>
      <tr>
          <td>Hu et al.</td>
          <td>0.708</td>
          <td>0.787</td>
          <td>0.789</td>
          <td>0.802</td>
          <td>0.859</td>
          <td>0.652</td>
      </tr>
      <tr>
          <td>GeomGCL</td>
          <td>-</td>
          <td>0.850</td>
          <td>0.919</td>
          <td>-</td>
          <td>-</td>
          <td>0.648</td>
      </tr>
      <tr>
          <td>ChemBERTa</td>
          <td>0.643</td>
          <td>-</td>
          <td>0.906</td>
          <td>0.622</td>
          <td>-</td>
          <td>-</td>
      </tr>
  </tbody>
</table>
<p><strong>Regression tasks</strong> (RMSE for ESOL/FreeSolv/Lipophilicity, avg MAE for QM9/QM8):</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>QM9</th>
          <th>QM8</th>
          <th>ESOL</th>
          <th>FreeSolv</th>
          <th>Lipophilicity</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>MoLFormer-XL</td>
          <td><strong>1.5894</strong></td>
          <td><strong>0.0102</strong></td>
          <td><strong>0.2787</strong></td>
          <td><strong>0.2308</strong></td>
          <td><strong>0.5289</strong></td>
      </tr>
      <tr>
          <td>A-FP</td>
          <td>2.6355</td>
          <td>0.0282</td>
          <td>0.5030</td>
          <td>0.736</td>
          <td>0.578</td>
      </tr>
      <tr>
          <td>MPNN</td>
          <td>3.1898</td>
          <td>0.0143</td>
          <td>0.58</td>
          <td>1.150</td>
          <td>0.7190</td>
      </tr>
      <tr>
          <td>GC</td>
          <td>4.3536</td>
          <td>0.0148</td>
          <td>0.970</td>
          <td>1.40</td>
          <td>0.655</td>
      </tr>
  </tbody>
</table>
<p>MoLFormer-XL also outperforms geometry-aware GNNs (DimeNet, GeomGCL, GEM) on ESOL (0.279 vs 0.575), FreeSolv (0.231 vs 0.866), and Lipophilicity (0.529 vs 0.541).</p>
<p><strong>Key ablation findings</strong>:</p>
<ul>
<li><strong>Data scale matters</strong>: Performance improves monotonically from 10% subsets through the full 1.1B corpus. Training on 100% ZINC alone performed worst, likely due to its smaller vocabulary and less diverse molecule lengths.</li>
<li><strong>Model depth matters</strong>: MoLFormer-Base (6 layers) underperforms MoLFormer-XL (12 layers) on most tasks.</li>
<li><strong>Fine-tuning &raquo; frozen</strong>: Fine-tuning the full encoder consistently outperforms using frozen embeddings with a downstream classifier.</li>
<li><strong>Rotary &gt; absolute at scale</strong>: Rotary embeddings underperform absolute embeddings on smaller pretraining sets but overtake them once the corpus exceeds 1B molecules.</li>
</ul>
<h2 id="smiles-transformers-learn-molecular-geometry">SMILES Transformers Learn Molecular Geometry</h2>
<p>The most striking finding is that MoLFormer&rsquo;s attention patterns correlate with 3D interatomic distances, despite training only on 1D SMILES strings.</p>
<p>Using QM9 molecules with known 3D geometries, the authors computed cosine similarity between attention maps and spatial distance matrices across three distance categories:</p>
<table>
  <thead>
      <tr>
          <th>Distance Category</th>
          <th>Range</th>
          <th>Linear Attention (Rotary)</th>
          <th>Full Attention (Rotary)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Short</td>
          <td>$\leq$ 2 Å</td>
          <td>0.594-0.602</td>
          <td>0.598-0.615</td>
      </tr>
      <tr>
          <td>Medium</td>
          <td>2-4 Å</td>
          <td>0.724-0.730</td>
          <td>0.716-0.727</td>
      </tr>
      <tr>
          <td>Long</td>
          <td>4-10 Å</td>
          <td>0.209-0.211</td>
          <td>0.204-0.210</td>
      </tr>
  </tbody>
</table>
<p>The strong correlation in the short and medium categories indicates the model captures covalent bond connectivity and near-neighbor spatial relationships. Linear attention shows marginally higher cosine similarity than full attention on medium-range distances (0.724-0.730 vs 0.716-0.727), though the differences are small.</p>
<p>MoLFormer-XL embeddings also correlate more strongly with molecular fingerprint similarity (0.64 vs 0.48 for ChemBERTa) and maximum common subgraph size (-0.60 vs -0.44), confirming that the representations encode structural information.</p>
<p><strong>Limitations</strong>:</p>
<ul>
<li><strong>Quantum-chemical energies</strong>: SchNet and DimeNet (which encode explicit 3D geometry) outperform MoLFormer-XL on QM9 atomization energy tasks, with DimeNet achieving roughly 10x lower MAE on U0_atom (0.008 vs 0.083 eV). 3D information remains important for these properties.</li>
<li><strong>Sequence length cap</strong>: The 202-token limit excludes 0.6% of molecules, potentially limiting applicability to larger structures.</li>
<li><strong>SMILES canonicalization</strong>: The model depends on <a href="https://en.wikipedia.org/wiki/RDKit">RDKit</a> canonical SMILES; sensitivity to non-canonical forms is not evaluated.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pretraining</td>
          <td>PubChem</td>
          <td>111M molecules</td>
          <td>Canonical SMILES via RDKit</td>
      </tr>
      <tr>
          <td>Pretraining</td>
          <td>ZINC</td>
          <td>~1B molecules</td>
          <td>Canonical SMILES via RDKit</td>
      </tr>
      <tr>
          <td>Pretraining (combined)</td>
          <td>PubChem + ZINC</td>
          <td>~1.1B molecules</td>
          <td>MoLFormer-XL training set</td>
      </tr>
      <tr>
          <td>Classification</td>
          <td>BBBP, Tox21, ClinTox, HIV, BACE, SIDER</td>
          <td>1,427-41,127</td>
          <td>MoleculeNet scaffold splits</td>
      </tr>
      <tr>
          <td>Regression</td>
          <td>QM9, QM8, ESOL, FreeSolv, Lipophilicity</td>
          <td>642-133,885</td>
          <td>MoleculeNet random splits (QM9/QM8), scaffold (others)</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Pretraining objective</strong>: Masked language modeling (15% selection: 80% masked, 10% random, 10% unchanged)</li>
<li><strong>Tokenization</strong>: SMILES tokenizer from Schwaller et al., vocabulary of 2,362 tokens</li>
<li><strong>Sequence length</strong>: 1-202 tokens (99.4% coverage)</li>
<li><strong>Optimizer</strong>: Fused LAMB (via APEX), chosen for stability with large batch sizes and no need for learning rate warm-up</li>
<li><strong>Adaptive bucketing</strong>: Sequences grouped by length into buckets to minimize padding waste</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Architecture</strong>: Transformer encoder with linear attention and rotary positional embeddings</li>
<li><strong>MoLFormer-XL</strong>: 12 layers, 12 attention heads, hidden size 768</li>
<li><strong>MoLFormer-Base</strong>: 6 layers (ablation only)</li>
<li><strong>Feature map size</strong>: 32 (generalized feature map for linear attention)</li>
<li><strong>Frozen head</strong>: Fully connected model with hyperparameter sweep (learning rate, batch size, hidden dim, number of layers)</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Task Type</th>
          <th>Details</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ROC-AUC</td>
          <td>Classification</td>
          <td>Scaffold splits per MoleculeNet</td>
      </tr>
      <tr>
          <td>RMSE</td>
          <td>Regression (ESOL, FreeSolv, Lipophilicity)</td>
          <td>Scaffold splits</td>
      </tr>
      <tr>
          <td>Avg MAE</td>
          <td>Regression (QM9, QM8)</td>
          <td>Random splits per MoleculeNet</td>
      </tr>
  </tbody>
</table>
<p>QM9 results also reported with 5-fold cross-validation for robustness.</p>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Compute</strong>: GPU cluster with nodes containing either 8 NVIDIA Tesla V100 (32GB) or 8 Ampere A100 (40GB) GPUs connected via NVLink and InfiniBand</li>
<li><strong>GPU reduction</strong>: Linear attention + bucketing reduced GPU requirements from ~1000 to 16</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/IBM/molformer">IBM/molformer</a></td>
          <td>Code</td>
          <td>Apache-2.0</td>
          <td>Pretraining, fine-tuning, and attention visualization</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/ibm/MoLFormer-XL-both-10pct">MoLFormer-XL (HuggingFace)</a></td>
          <td>Model</td>
          <td>Apache-2.0</td>
          <td>Pretrained weights (46.8M parameters)</td>
      </tr>
      <tr>
          <td><a href="https://pubchem.ncbi.nlm.nih.gov/">PubChem</a></td>
          <td>Dataset</td>
          <td>Public domain</td>
          <td>111M molecules</td>
      </tr>
      <tr>
          <td><a href="https://zinc.docking.org/">ZINC</a></td>
          <td>Dataset</td>
          <td>See ZINC terms</td>
          <td>~1B molecules</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Ross, J., Belgodere, B., Chenthamarakshan, V., Padhi, I., Mroueh, Y., &amp; Das, P. (2022). Large-Scale Chemical Language Representations Capture Molecular Structure and Properties. <em>Nature Machine Intelligence</em>, 4, 1256-1264. <a href="https://doi.org/10.1038/s42256-022-00580-7">https://doi.org/10.1038/s42256-022-00580-7</a></p>
<p><strong>Publication</strong>: Nature Machine Intelligence 2022</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/IBM/molformer">GitHub Repository (MoLFormer)</a></li>
<li><a href="https://huggingface.co/ibm/MoLFormer-XL-both-10pct">HuggingFace Models</a></li>
</ul>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{ross2022molformer,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Large-Scale Chemical Language Representations Capture Molecular Structure and Properties}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Ross, Jerret and Belgodere, Brian and Chenthamarakshan, Vijil and Padhi, Inkit and Mroueh, Youssef and Das, Payel}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Nature Machine Intelligence}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{4}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{12}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{1256--1264}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2022}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Nature Publishing Group}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1038/s42256-022-00580-7}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>ChemBERTa-3: Open Source Chemical Foundation Models</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-encoders/chemberta-3/</link><pubDate>Fri, 26 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-encoders/chemberta-3/</guid><description>An open-source framework integrating DeepChem and Ray for training and benchmarking chemical foundation models like MoLFormer and GROVER at scale.</description><content:encoded><![CDATA[<h2 id="core-contribution-an-open-source-framework">Core Contribution: An Open-Source Framework</h2>
<p>This is primarily a <strong>Resource ($\Psi_{\text{Resource}}$)</strong> paper, with secondary <strong>Method ($\Psi_{\text{Method}}$)</strong> contributions.</p>
<ul>
<li><strong>Resource Basis</strong>: The core contribution is &ldquo;ChemBERTa-3,&rdquo; an open-source framework integrated into DeepChem that standardizes the pretraining and benchmarking of chemical foundation models. The authors focus heavily on infrastructure (AWS/Ray integration) and correcting benchmarking inconsistencies in the field.</li>
<li><strong>Method Basis</strong>: It trains models like &ldquo;c3-MoLFormer&rdquo; to reproduce and validate the infrastructure.</li>
</ul>
<h2 id="the-pretraining-scalability-challenge">The Pretraining Scalability Challenge</h2>
<ul>
<li><strong>Scalability Challenges</strong>: Building robust molecular models is difficult due to the vast size of chemical space and the computational intensity of pretraining on large datasets.</li>
<li><strong>Proprietary Barriers</strong>: Many high-performing chemical foundation models (e.g., the full <a href="/notes/computational-chemistry/chemical-language-models/molecular-encoders/molformer/">MoLFormer-XL</a>) are partially closed-source or difficult to reproduce.</li>
<li><strong>Benchmarking Inconsistencies</strong>: There is a lack of systematic comparison between architectures (e.g., Graph vs. Transformer) using unified protocols. Specifically, previous comparisons relied on reported results that used differing scaffold splitting algorithms, making them inaccurate.</li>
</ul>
<h2 id="unified-infrastructure--standardized-benchmarking">Unified Infrastructure &amp; Standardized Benchmarking</h2>
<ul>
<li><strong>Unified Infrastructure</strong>: Integration of DeepChem with Ray for distributed, scalable pretraining and fine-tuning of both graph and transformer models.</li>
<li><strong>Standardized Benchmarking</strong>: Identification that MoLFormer&rsquo;s scaffold splitting algorithm differs from the standard DeepChem/<a href="/notes/computational-chemistry/benchmark-problems/moleculenet-benchmark-molecular-ml/">MoleculeNet</a> splitter, and the subsequent standardization of these benchmarks for fair comparison.</li>
<li><strong>New DeepChem Tools</strong>: Introduction of the <code>ModularTorchModel</code> class for flexible loss computation and <code>HuggingFaceModel</code> wrappers to bridge ecosystems.</li>
</ul>
<h2 id="benchmarking-transformers-vs-graph-models">Benchmarking Transformers vs. Graph Models</h2>
<ul>
<li><strong>Architecture Comparison</strong>: Benchmarked Transformers (<a href="/notes/computational-chemistry/chemical-language-models/molecular-encoders/chemberta/">ChemBERTa</a>, <a href="/notes/computational-chemistry/chemical-language-models/molecular-encoders/molformer/">MoLFormer</a>) against Graph models (GROVER, InfoGraph, InfoMax3D, DMPNN, GCN) and baselines (Random Forest).</li>
<li><strong>Pretraining Scale Disparity</strong>:
<ul>
<li>Transformers were pretrained on ZINC20 subsets ranging from 10M to 1.1B molecules (combining ZINC and PubChem).</li>
<li>Graph models were limited to 250K molecule subsets due to memory and computational overhead of message passing on large graphs. While this highlights the superior scalability of Transformer architectures, comparing a 1.1B-trained Transformer to a 250K-trained Graph model provides an unbalanced evaluation of architectural capacity.</li>
</ul>
</li>
<li><strong>Reproducibility Validation</strong>: Trained &ldquo;c3-MoLFormer&rdquo; (a reproduction of MoLFormer) on 1.1B molecules using two distinct hardware setups: AWS spot instances (Ray) and a local HPC cluster.</li>
<li><strong>Scaffold Split Analysis</strong>: Compared performance metrics using &ldquo;DeepChem scaffold splits&rdquo; vs. &ldquo;MoLFormer scaffold splits&rdquo; to quantify the impact of data leakage/overlap.</li>
</ul>
<h2 id="overcoming-scaffold-splitting-inconsistencies">Overcoming Scaffold Splitting Inconsistencies</h2>
<ul>
<li><strong>Scaling Transformers vs. Graphs</strong>: Transformer-based models are significantly easier to scale to large datasets than current graph-based approaches, though performance is comparable at small scales.</li>
<li><strong>Benchmarking sensitivity</strong>: MoLFormer&rsquo;s reported superiority over baselines was partly inflated by its specific scaffold splitting method, which had higher structural overlap between train and test sets (yielding a lower <a href="https://en.wikipedia.org/wiki/Jaccard_index">Tanimoto distance</a>, generally quantified via $1 - \frac{|A \cap B|}{|A \cup B|}$) than DeepChem splits. When standardized, baselines like DMPNN perform more competitively.</li>
<li><strong>Infrastructure Viability</strong>: The framework successfully replicated large-scale training (MoLFormer-1.1B) on both cloud and on-premise HPC, confirming reproducibility.</li>
<li><strong>Open Source Release</strong>: All code, configurations, and the c3-MoLFormer-1.1B model weights are released to facilitate future research.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<ul>
<li><strong>Pretraining</strong>:
<ul>
<li><strong>Source</strong>: <a href="/notes/computational-chemistry/datasets/zinc-22/">ZINC20</a> (1.4B compounds) and PubChem.</li>
<li><strong>Scale</strong>: Subsets of 10M, 100M, and 1.1B (100% ZINC20 + 100% PubChem) were used for Transformers. Graph models used a 250K subset.</li>
</ul>
</li>
<li><strong>Fine-tuning</strong>:
<ul>
<li><strong>Suite</strong>: MoleculeNet.</li>
<li><strong>Tasks</strong>: Classification (BACE, BBBP, Tox21, HIV, SIDER, ClinTox) and Regression (ESOL, FreeSolv, Lipophilicity, QM9).</li>
<li><strong>Splits</strong>: Critical distinction made between &ldquo;DeepChem scaffold splits&rdquo; (80/10/10) and &ldquo;MoLFormer scaffold splits&rdquo; (which can be downloaded from <a href="https://ibm.ent.box.com/v/MoLFormer-data"><code>https://ibm.ent.box.com/v/MoLFormer-data</code></a>). The paper notes these algorithms differ.</li>
</ul>
</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Framework</strong>: DeepChem integrated with Ray for distributed training. To recreate the environment, the repository relies on a nightly version of DeepChem (<code>pip install --pre deepchem</code>) and specific dependencies found within the <code>requirements.txt</code>. Pretraining scripts are available in the <code>chemberta3_benchmarking/pretraining</code> directory of the repository.</li>
<li><strong>Data Preparation</strong>: Featurization workflows (e.g., <code>CircularFingerprint</code>, <code>RDKitConformer</code>) are documented under <code>chemberta3_benchmarking/data/data_preprocessing/</code> in the codebase.</li>
<li><strong>Modular Training</strong>: Uses <code>ModularTorchModel</code> to allow loss computation from intermediate values and flexible component connection.</li>
<li><strong>Training Brittleness</strong>:
<ul>
<li><strong>Optimizer</strong>: Linear learning rate scheduler with warmup.</li>
<li><strong>Instability Handling</strong>: The authors observed significant loss spikes during warmup. Their primary mitigation strategy involved checkpointing frequently and restarting from the last stable state upon a spike, highlighting a persistent brittleness in optimizing these large chemical foundation models.</li>
<li><strong>Numerical Issues</strong>: Addressed NaN values by pretraining on a small dataset with low LR before scaling up.</li>
</ul>
</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong><a href="/notes/computational-chemistry/chemical-language-models/molecular-encoders/chemberta/">ChemBERTa</a></strong>: RoBERTa-based architecture trained with Masked Language Modeling (MLM) and Multitask Regression (MTR). Specific model identifiers (e.g., <a href="https://huggingface.co/DeepChem/ChemBERTa-100M-MLM"><code>DeepChem/ChemBERTa-100M-MLM</code></a>) are hosted on Hugging Face so researchers can pull them directly via the <code>transformers</code> library. The core pretraining objective minimized the standard MLM loss:
$$ \mathcal{L}<em>{\text{MLM}} = - \frac{1}{|\mathcal{M}|} \sum</em>{i \in \mathcal{M}} \log \hat{y}<em>{i} $$
where $\mathcal{M}$ represents the set of masked SMILES token indices, and $\hat{y}</em>{i}$ is the model&rsquo;s predicted probability for the correct token given the corrupted sequence context.</li>
<li><strong>MoLFormer (c3-MoLFormer)</strong>: Re-implementation of the MoLFormer architecture (Rotary embeddings, linear attention). Specific model identifiers (e.g., <a href="https://huggingface.co/DeepChem/MoLFormer-c3-1.1B"><code>DeepChem/MoLFormer-c3-1.1B</code></a>) are similarly available on Hugging Face.
<ul>
<li>Tokenizer: <code>ibm/MoLFormer-XL-both-10pct</code> tokenizer.</li>
</ul>
</li>
<li><strong>Graph Models</strong>:
<ul>
<li><strong>GROVER</strong>: Graph Transformer with node/edge/graph level self-supervision.</li>
<li><strong>InfoGraph</strong>: Maximizes mutual information between graph-level and substructure representations.</li>
<li><strong>InfoMax3D</strong>: Incorporates 3D conformer data (via <a href="https://en.wikipedia.org/wiki/RDKit">RDKit</a> ETKDGv2) into contrastive pretraining.</li>
<li><strong>DMPNN</strong>: Directed Message Passing Neural Network (Chemprop variant).</li>
</ul>
</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li><strong>Metrics</strong>: <a href="https://en.wikipedia.org/wiki/Receiver_operating_characteristic">ROC-AUC</a> for classification; RMSE for regression (MAE for QM9).</li>
<li><strong>Baselines</strong>: Random Forest, GCN, DMPNN trained on fine-tuning splits only.</li>
<li><strong>Protocol</strong>: Three independent runs per configuration to report mean and range (not a confidence interval), with the exception of the compute-heavy QM9 dataset, which only received a single run. Benchmarking execution scripts (e.g., GCN, RF, DMPNN, ChemBERTa) are stored in the repo under <code>chemberta3_benchmarking/models_benchmarking/</code> and contain the specific fine-tuning hyperparameters and optimizer configurations used for each downstream task.</li>
<li><strong>Key Results</strong>:
<ul>
<li><em>c3-MoLFormer-1.1B</em> achieved ~0.848 ROC-AUC on BACE and ~0.900 on BBBP (using MoLFormer splits). This closely matches the original IBM MoLFormer metrics, validating the reproducibility of the open-source framework.</li>
<li>When constrained to the equivalent 250K subset, Graph models (InfoGraph, GROVER) performed comparably to Transformers, indicating that Transformer superiority in chemistry is largely driven by data scalability rather than an inherent architectural advantage at small scales.</li>
</ul>
</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Cloud (AWS)</strong>:
<ul>
<li><strong>Compute</strong>: 40 NVIDIA T4 GPUs (<code>g4dn.12xlarge</code> spot instances for pretraining, <code>g4dn.2xlarge</code> for benchmarking).</li>
<li><strong>Cost</strong>: ~$4000 for MoLFormer 1.1B pretraining.</li>
<li><strong>Time</strong>: ~10 days (260 hours) for 1.1B model pretraining.</li>
<li><strong>Setup</strong>: Setup scripts for single-node and multi-node spot EC2 clusters are provided in the GitHub repository&rsquo;s <code>infra/</code> and <code>spot/</code> folders.</li>
</ul>
</li>
<li><strong>On-Premise HPC</strong>:
<ul>
<li><strong>Compute</strong>: 16 nodes (AMD EPYC), each with 4 AMD MI300A APUs.</li>
<li><strong>Environment</strong>: Ray multi-node multi-GPU framework.</li>
</ul>
</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/deepforestsci/chemberta3">ChemBERTa-3 GitHub Repository</a></td>
          <td>Code</td>
          <td>Unknown</td>
          <td>Training, fine-tuning, and benchmarking framework</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/DeepChem/MoLFormer-c3-1.1B">DeepChem/MoLFormer-c3-1.1B</a></td>
          <td>Model</td>
          <td>Unknown</td>
          <td>MoLFormer re-implementation pretrained on 1.1B molecules</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/DeepChem/ChemBERTa-100M-MLM">DeepChem/ChemBERTa-100M-MLM</a></td>
          <td>Model</td>
          <td>Unknown</td>
          <td>ChemBERTa pretrained on 100M ZINC molecules</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/DeepChem/MoLFormer-c3-100M">DeepChem/MoLFormer-c3-100M</a></td>
          <td>Model</td>
          <td>Unknown</td>
          <td>MoLFormer pretrained on 100M molecules</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/DeepChem/MoLFormer-c3-550M">DeepChem/MoLFormer-c3-550M</a></td>
          <td>Model</td>
          <td>Unknown</td>
          <td>MoLFormer pretrained on 550M molecules</td>
      </tr>
  </tbody>
</table>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Singh, R. et al. (2026). ChemBERTa-3: an open source training framework for chemical foundation models. <em>Digital Discovery</em>, 5, 662-685. <a href="https://doi.org/10.1039/D5DD00348B">https://doi.org/10.1039/D5DD00348B</a></p>
<p><strong>Publication</strong>: Digital Discovery 2026</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/deepforestsci/chemberta3">ChemBERTa-3 GitHub Repository</a></li>
<li><a href="https://deepchem.io/">DeepChem Project</a></li>
<li><a href="https://huggingface.co/DeepChem">DeepChem Hugging Face Models</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{singhChemBERTa3OpenSource2026,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Singh, Riya and Barsainyan, Aryan Amit and Irfan, Rida and Amorin, Connor Joseph and He, Stewart and Davis, Tony and Thiagarajan, Arun and Sankaran, Shiva and Chithrananda, Seyone and Ahmad, Walid and Jones, Derek and McLoughlin, Kevin and Kim, Hyojin and Bhutani, Anoushka and Sathyanarayana, Shreyas Vinaya and Viswanathan, Venkat and Allen, Jonathan E. and Ramsundar, Bharath}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{{{ChemBERTa-3}}: an open source training framework for chemical foundation models}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span> = <span style="color:#e6db74">{Digital Discovery}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#e6db74">{2026}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span> = <span style="color:#e6db74">{5}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span> = <span style="color:#e6db74">{662-685}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{The Royal Society of Chemistry}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.1039/D5DD00348B}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">url</span> = <span style="color:#e6db74">{https://doi.org/10.1039/D5DD00348B}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>ChemBERTa-2: Scaling Molecular Transformers to 77M</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-encoders/chemberta-2/</link><pubDate>Thu, 25 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-encoders/chemberta-2/</guid><description>Optimizing transformer pretraining for molecules using MLM vs MTR objectives, scaling to 77M compounds from PubChem for improved property prediction.</description><content:encoded><![CDATA[<h2 id="classifying-chemberta-2s-methodological-contributions">Classifying ChemBERTa-2&rsquo;s Methodological Contributions</h2>
<p>This is primarily a <strong>Methodological</strong> paper with a secondary <strong>Resource</strong> contribution.</p>
<p>It fits the Method classification because it focuses on optimizing the architecture and pretraining pipeline for molecular transformers. The authors perform extensive ablation studies (varying dataset size from 5M to 77M, comparing MLM vs. MTR objectives) to determine &ldquo;how well&rdquo; these strategies work compared to baselines. The secondary Resource classification applies because they open-source the trained models and establish a benchmark on a massive 77M compound dataset.</p>
<p><strong>Key methodological indicators</strong>:</p>
<ul>
<li><strong>Baseline comparison</strong>: The paper explicitly compares ChemBERTa-2 against standard baselines (D-MPNN, Random Forest, GCN) and its predecessor (<a href="/notes/computational-chemistry/chemical-language-models/molecular-encoders/chemberta/">ChemBERTa-1</a>) with prominent benchmark tables</li>
<li><strong>Ablation studies</strong>: Extensive experiments comparing multi-task and self-supervised pretraining by varying hyperparameters and pretraining dataset size</li>
<li><strong>Scaling analysis</strong>: Systematic investigation of whether larger datasets (up to 77M compounds) yield better performance</li>
</ul>
<h2 id="motivations-for-scaling-molecular-transformers">Motivations for Scaling Molecular Transformers</h2>
<p>The authors aim to bridge the gap between NLP success stories (like GPT-3) and molecular machine learning by developing a &ldquo;chemical foundation model&rdquo;.</p>
<p><strong>Key motivations</strong>:</p>
<ul>
<li><strong>Label scarcity</strong>: Experimental labels for molecular properties are rare and expensive, but unlabeled SMILES strings are abundant</li>
<li><strong>Scaling hypothesis</strong>: Testing if scaling pretraining data (up to 77M compounds) yields consistent downstream improvements, similar to scaling laws in NLP</li>
<li><strong>Efficiency</strong>: Optimizing the pretraining process introduced in the original ChemBERTa by comparing self-supervised (MLM) and weakly supervised (MTR, using <a href="https://en.wikipedia.org/wiki/RDKit">RDKit</a> computed properties as labels) approaches</li>
</ul>
<h2 id="novelty-in-multi-task-regression-objectives">Novelty in Multi-Task Regression Objectives</h2>
<p><strong>Scale</strong>: Training on 77M unique SMILES from PubChem, which is one of the largest molecular pretraining datasets used to date (compared to 10M for ChemBERTa-1 or 18.7M for <a href="/notes/computational-chemistry/chemical-language-models/molecular-encoders/smiles-bert/">SMILES-BERT</a>).</p>
<p><strong>Pipeline optimization</strong>: A direct, controlled comparison of <strong>Masked Language Modeling (MLM)</strong> vs. <strong>Multi-Task Regression (MTR)</strong> pretraining objectives on identical datasets.</p>
<p><strong>Proxy selection</strong>: The finding that MLM loss correlates well with MTR loss, allowing the cheaper MLM task to be used for hyperparameter tuning before running the expensive MTR pretraining.</p>
<h2 id="experimental-pretraining-setup-on-77m-compounds">Experimental Pretraining Setup on 77M Compounds</h2>
<h3 id="pretraining-setup">Pretraining Setup</h3>
<p><strong>Datasets</strong>: Subsets of <a href="https://en.wikipedia.org/wiki/PubChem">PubChem</a> containing 5M, 10M, and 77M unique SMILES.</p>
<p><strong>Tasks</strong>:</p>
<ul>
<li><strong>MLM</strong>: Masking 15% of tokens (following RoBERTa procedure). The model is optimized by minimizing the cross-entropy loss over the predicted masked tokens:
$$ \mathcal{L}_{MLM} = -\sum_{i \in \mathcal{M}} \log P(x_i \mid \mathbf{x}_{\setminus \mathcal{M}}) $$
where $\mathcal{M}$ represents the set of masked token indices.</li>
<li><strong>MTR</strong>: Predicting 200 calculated molecular properties (via RDKit) simultaneously using a mean squared error objective:
$$ \mathcal{L}_{MTR} = \frac{1}{200} \sum_{j=1}^{200} \frac{1}{N} \sum_{i=1}^{N} \left( \hat{y}_{ij} - y_{ij} \right)^2 $$
Continuous target labels $y_{ij}$ are mean-normalized prior to training to equilibrate the disparate scales of different chemical properties.</li>
</ul>
<p><strong>Hyperparameter search</strong>: Ran 50 random configurations on the 5M dataset; selected the top 5 to scale up to 10M and 77M.</p>
<h3 id="downstream-validation">Downstream Validation</h3>
<p><strong>Finetuning</strong>: Evaluated on 8 tasks from <strong><a href="/notes/computational-chemistry/benchmark-problems/moleculenet-benchmark-molecular-ml/">MoleculeNet</a></strong> (BACE, BBBP, ClinTox, Delaney, etc.) using scaffold splits (80/10/10).</p>
<p><strong>Analysis</strong>: Used UMAP to visualize embeddings from MLM, MTR, and ECFP to check for clustering by label without finetuning.</p>
<h2 id="key-performance-outcomes-and-scaling-realities">Key Performance Outcomes and Scaling Realities</h2>
<p><strong>Highly competitive performance</strong>: ChemBERTa-2 outperforms the D-MPNN baseline (chemprop) on 6 out of 8 MoleculeNet tasks, though the margins demonstrate that task-specific baselines remain notably robust.</p>
<p><strong>MTR superiority</strong>: Models pretrained on Multi-Task Regression (MTR) consistently perform better on downstream tasks than those pretrained on MLM on every finetuning task evaluated. MTR is substantially slower than MLM due to the larger input size from the 200-element label vector, but MLM loss serves as a reliable proxy for MTR loss, enabling cheaper architecture search before committing to full MTR pretraining.</p>
<p><strong>Scaling laws versus downstream utility</strong>: Pretraining loss improved by 25-35% when increasing the dataset from 5M to 77M compounds. However, this improvement in pretraining loss does not uniformly transfer to downstream tasks. For MTR models, SR-p53 ROC-AUC decreases monotonically from 0.834 (5M) to 0.827 (10M) to 0.817 (77M), and Lipophilicity RMSE is worse at 77M (0.798) than at 5M (0.758), despite a dip at 10M (0.744). This variability in transfer challenges the assumption that pretraining improvements always yield downstream gains.</p>
<p><strong>Transfer learning</strong>: The correlation between pretraining loss and downstream performance is task-dependent; it is strong for Lipophilicity but weaker for BACE classification.</p>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The pretraining corpus is derived from <strong>PubChem</strong>.</p>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>Pretraining</strong></td>
          <td>PubChem</td>
          <td>77M SMILES</td>
          <td>Canonicalized and globally shuffled. Subsets of 5M and 10M used. <strong>Note: Exact splits and datasets are not published.</strong></td>
      </tr>
      <tr>
          <td><strong>Validation</strong></td>
          <td>PubChem</td>
          <td>100k SMILES</td>
          <td>A fixed set held out from the 77M corpus. <strong>Note: Exact 100k subset is not published.</strong></td>
      </tr>
      <tr>
          <td><strong>MTR Labels</strong></td>
          <td>RDKit</td>
          <td>200 props</td>
          <td>200 molecular properties calculated from SMILES using RDKit. Labels are mean-normalized. <strong>Note: Calculated labels are not published and must be re-computed.</strong></td>
      </tr>
      <tr>
          <td><strong>Finetuning</strong></td>
          <td>MoleculeNet</td>
          <td>1.5k - 8k</td>
          <td>Tasks: BACE, Clearance, Delaney, Lipophilicity, BBBP, ClinTox, HIV, Tox21. Split 80/10/10 via scaffold splitter.</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<p><strong>Pretraining Objectives:</strong></p>
<ol>
<li><strong>Masked Language Modeling (MLM)</strong>: Follows RoBERTa procedure. Masks 15% of tokens. Max sequence length 512.</li>
<li><strong>Multi-Task Regression (MTR)</strong>: Predicting 200 RDKit properties. Labels are mean-normalized.</li>
</ol>
<p><strong>Tokenizer:</strong></p>
<ul>
<li>Dictionary of common SMILES characters</li>
<li>Maximum vocabulary size: <strong>591 tokens</strong></li>
</ul>
<p><strong>Optimization:</strong></p>
<ul>
<li><strong>Patience</strong>: Early stopping set to one pass through the dataset to ensure full coverage</li>
<li><strong>Hyperparameter search</strong>: Random search (50 configs) varying hidden size, attention heads, dropout, intermediate size, hidden layers, and learning rate. <strong>Note: The precise configuration of the winning models that were scaled to 77M is absent from the paper.</strong></li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Architecture</strong>: Based on <strong>RoBERTa</strong> (HuggingFace implementation)</li>
<li><strong>Parameter scale</strong>: Models ranged between <strong>5M and 46M parameters</strong></li>
<li><strong>Selection</strong>: Top 5 configurations from the 5M-dataset random search were trained on the full 77M dataset</li>
<li><strong>Checkpoints</strong>: Pre-trained weights are hosted by DeepChem on <a href="https://huggingface.co/DeepChem">Hugging Face</a>. Direct links include <a href="https://huggingface.co/DeepChem/ChemBERTa-77M-MTR">DeepChem/ChemBERTa-77M-MTR</a> and <a href="https://huggingface.co/DeepChem/ChemBERTa-77M-MLM">DeepChem/ChemBERTa-77M-MLM</a> (Note: Model cards are currently empty).</li>
<li><strong>Code Reference</strong>: While the <a href="https://github.com/deepchem/deepchem">DeepChem</a> repository is referenced for code, isolated training scripts tailored to recreate ChemBERTa-2&rsquo;s exact pipeline are not separated from the generalized deepchem library tooling.</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>Benchmarks were performed on <strong>MoleculeNet</strong> using DeepChem.</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Tasks</th>
          <th>Baseline</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>RMSE</strong> ($\downarrow$)</td>
          <td>Delaney, Lipo, BACE (Reg), Clearance</td>
          <td>D-MPNN</td>
          <td>ChemBERTa-2 outperformed D-MPNN on Delaney (0.889 vs 1.105) and Clearance (48.5 vs 49.8).</td>
      </tr>
      <tr>
          <td><strong>ROC-AUC</strong> ($\uparrow$)</td>
          <td>BBBP, ClinTox, HIV, Tox21, BACE (Cls)</td>
          <td>D-MPNN</td>
          <td>ChemBERTa-2 generally competitive; MTR-77M achieved 0.728 on BBBP vs D-MPNN 0.697.</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Compute</strong>: AWS EC2 instances with <strong>Nvidia T4 GPUs</strong></li>
<li><strong>Strategy</strong>: AWS Spot instances were used to reduce cost; implemented frequent checkpointing to handle interruptions.</li>
<li><strong>Note</strong>: For MTR, they wrote a custom data loader wrapper around HuggingFace&rsquo;s text loader to handle CSV parsing efficiency, as the default CSV loader was a major bottleneck for the 200-element target vectors.</li>
</ul>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Ahmad, W., Simon, E., Chithrananda, S., Grand, G., &amp; Ramsundar, B. (2022). ChemBERTa-2: Towards Chemical Foundation Models. <em>arXiv preprint arXiv:2209.01712</em>. <a href="https://doi.org/10.48550/arXiv.2209.01712">https://doi.org/10.48550/arXiv.2209.01712</a></p>
<p><strong>Publication</strong>: arXiv 2022 (Presented at 2021 ELLIS ML for Molecule Discovery Workshop)</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="/notes/computational-chemistry/chemical-language-models/molecular-encoders/chemberta/">ChemBERTa-1 Paper</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@misc</span>{ahmadChemBERTa2ChemicalFoundation2022,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{{{ChemBERTa-2}}: {{Towards Chemical Foundation Models}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">shorttitle</span> = <span style="color:#e6db74">{{{ChemBERTa-2}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Ahmad, Walid and Simon, Elana and Chithrananda, Seyone and Grand, Gabriel and Ramsundar, Bharath}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">2022</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = sep,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{arXiv:2209.01712}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">eprint</span> = <span style="color:#e6db74">{2209.01712}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">primaryclass</span> = <span style="color:#e6db74">{cs}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{arXiv}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.48550/arXiv.2209.01712}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">urldate</span> = <span style="color:#e6db74">{2025-12-25}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">archiveprefix</span> = <span style="color:#e6db74">{arXiv}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>ChemBERTa: Molecular Property Prediction via Transformers</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-encoders/chemberta/</link><pubDate>Tue, 23 Dec 2025 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-encoders/chemberta/</guid><description>A systematic evaluation of RoBERTa transformers pretrained on 77M PubChem SMILES for molecular property prediction tasks.</description><content:encoded><![CDATA[<h2 id="taxonomy-and-paper-contributions">Taxonomy and Paper Contributions</h2>
<p>This is primarily a <strong>Method</strong> paper ($\Psi_{\text{Method}}$), with a significant <strong>Resource</strong> component ($\Psi_{\text{Resource}}$).</p>
<p>It is a methodological investigation because it systematically evaluates a specific architecture (Transformers/RoBERTa) against established State-of-the-Art (SOTA) baselines like directed Message Passing Neural Networks (D-MPNNs) to determine &ldquo;how well does this work?&rdquo; in the chemical domain. It ablates dataset size, tokenization, and input representation.</p>
<p>It is also a resource paper as it introduces &ldquo;PubChem-77M,&rdquo; a curated dataset of 77 million SMILES strings designed to facilitate large-scale self-supervised pretraining for the community.</p>
<h2 id="overcoming-data-scarcity-in-property-prediction">Overcoming Data Scarcity in Property Prediction</h2>
<p>The primary motivation is <strong>data scarcity</strong> in molecular property prediction. Graph Neural Networks (GNNs) achieve strong performance on property prediction tasks when provided with sufficient labeled data. Generating these labels requires costly and time-consuming laboratory testing, leading to severe data scarcity in specialized chemical domains.</p>
<p>Massive quantities of <strong>unlabeled chemical structure data</strong> exist in the form of <a href="/notes/computational-chemistry/molecular-representations/smiles/">SMILES</a> strings. Inspired by the success of Transformers in NLP, where self-supervised pretraining on large corpora yields strong transfer learning, the authors aim to use these unlabeled datasets to learn effective molecular representations. Additionally, Transformers benefit from a mature software ecosystem (HuggingFace) that offers efficiency advantages over GNNs.</p>
<h2 id="pretraining-scaling-laws-and-novelty">Pretraining Scaling Laws and Novelty</h2>
<p>Previous works applied Transformers to SMILES strings. This paper advances the field by systematically evaluating scaling laws and architectural components for this domain. Specifically:</p>
<ul>
<li><strong>Scaling Analysis</strong>: It explicitly tests how pretraining dataset size (100K to 10M) impacts downstream performance.</li>
<li><strong>Tokenizer Comparison</strong>: It compares standard NLP <a href="https://en.wikipedia.org/wiki/Byte-pair_encoding">Byte-Pair Encoding (BPE)</a> against a chemically-aware &ldquo;SmilesTokenizer&rdquo;.</li>
<li><strong>Representation Comparison</strong>: It evaluates if the robust <a href="/notes/computational-chemistry/molecular-representations/selfies/">SELFIES</a> string representation offers advantages over standard SMILES in a Transformer context.</li>
</ul>
<h2 id="experimental-setup-pretraining-and-finetuning">Experimental Setup: Pretraining and Finetuning</h2>
<p>The authors trained <strong>ChemBERTa</strong> (based on RoBERTa) using Masked Language Modeling (MLM) on subsets of the <a href="https://en.wikipedia.org/wiki/PubChem">PubChem</a> dataset. The core training objective minimizes the cross-entropy loss over a corrupted input where a subset of basic tokens, denoted by $\mathcal{M}$, are masked:</p>
<p>$$
\mathcal{L}_{\text{MLM}} = - \frac{1}{|\mathcal{M}|} \sum_{i \in \mathcal{M}} \log P(x_i \mid x_{\setminus \mathcal{M}}; \theta)
$$</p>
<p>where $x_i$ is the exact masked token, $x_{\setminus \mathcal{M}}$ is the corrupted SMILES context string, and $\theta$ represents the network parameters.</p>
<ul>
<li><strong>Pretraining</strong>: Models were pretrained on dataset sizes of 100K, 250K, 1M, and 10M compounds.</li>
<li><strong>Baselines</strong>: Performance was compared against D-MPNN (Graph Neural Network), Random Forest (RF), and SVM using 2048-bit Morgan Fingerprints.</li>
<li><strong>Downstream Tasks</strong>: Finetuning was performed individually on small <a href="/notes/computational-chemistry/benchmark-problems/moleculenet-benchmark-molecular-ml/">MoleculeNet</a> classification tasks: BBBP (<a href="https://en.wikipedia.org/wiki/Blood%E2%80%93brain_barrier">blood-brain barrier</a>), ClinTox (clinical toxicity), HIV, and Tox21 (p53 stress-response). This poses a transfer learning challenge, as the model must adapt from pretraining on 10 million molecules to classifying datasets ranging from ~1.5K to ~41K examples.</li>
<li><strong>Ablations</strong>:
<ul>
<li><strong>Tokenization</strong>: BPE vs. SmilesTokenizer on the 1M dataset, evaluated on Tox21.</li>
<li><strong>Input</strong>: SMILES vs. SELFIES strings on the Tox21 task.</li>
</ul>
</li>
</ul>
<h2 id="results-vs-graph-neural-network-baselines">Results vs. Graph Neural Network Baselines</h2>
<p>The main comparison between ChemBERTa (pretrained on 10M compounds) and Chemprop baselines on MoleculeNet tasks is summarized below (Table 1 from the paper):</p>
<table>
  <thead>
      <tr>
          <th>Model</th>
          <th>BBBP ROC</th>
          <th>BBBP PRC</th>
          <th>ClinTox ROC</th>
          <th>ClinTox PRC</th>
          <th>HIV ROC</th>
          <th>HIV PRC</th>
          <th>Tox21 ROC</th>
          <th>Tox21 PRC</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ChemBERTa 10M</td>
          <td>0.643</td>
          <td>0.620</td>
          <td>0.733</td>
          <td>0.975</td>
          <td>0.622</td>
          <td>0.119</td>
          <td>0.728</td>
          <td>0.207</td>
      </tr>
      <tr>
          <td>D-MPNN</td>
          <td>0.708</td>
          <td>0.697</td>
          <td>0.906</td>
          <td>0.993</td>
          <td>0.752</td>
          <td>0.152</td>
          <td>0.688</td>
          <td>0.429</td>
      </tr>
      <tr>
          <td>RF</td>
          <td>0.681</td>
          <td>0.692</td>
          <td>0.693</td>
          <td>0.968</td>
          <td>0.780</td>
          <td>0.383</td>
          <td>0.724</td>
          <td>0.335</td>
      </tr>
      <tr>
          <td>SVM</td>
          <td>0.702</td>
          <td>0.724</td>
          <td>0.833</td>
          <td>0.986</td>
          <td>0.763</td>
          <td>0.364</td>
          <td>0.708</td>
          <td>0.345</td>
      </tr>
  </tbody>
</table>
<ul>
<li><strong>Scaling Improvements &amp; Training Dynamics</strong>: Performance scales predictably with pretraining data size. Increasing data from 100K to 10M improved ROC-AUC by +0.110 and PRC-AUC by +0.059 on average across BBBP, ClinTox, and Tox21 (HIV was omitted due to resource constraints). Notably, researchers had to halt pretraining on the 10M subset after just 3 epochs due to overfitting, suggesting that simple 15% token masking might not provide a sufficiently difficult learning curvature for large-scale chemical representation.</li>
<li><strong>Performance Limits vs. GNNs</strong>: ChemBERTa generally performs below the D-MPNN baseline. On the Tox21 dataset, ChemBERTa-10M achieved a higher ROC-AUC (0.728) than D-MPNN (0.688); nonetheless, it recorded a substantially lower PRC-AUC (0.207 vs 0.429). This gap indicates that current Transformer iterations lack the explicit inductive biases of graph algorithms and struggle with the severe class imbalances typical of chemical datasets.</li>
<li><strong>Ablation Limitations (Tokenization &amp; SELFIES)</strong>: The authors&rsquo; ablation studies for tokenization (SmilesTokenizer narrowly beating BPE) and input representation (SELFIES performing comparably to SMILES) were evaluated exclusively on the single Tox21 task. Deriving broad architectural conclusions regarding &ldquo;semantically-aware tokenization&rdquo; or string robustness from an $N=1$ empirical evaluation is a significant limitation of the study. Broader benchmarking is required to validate these findings.</li>
<li><strong>Interpretability</strong>: Attention heads organically learn to track chemically relevant substructures (like specific functional groups and aromatic rings), mimicking the inductive biases of graph convolutions.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<p>The authors curated a massive dataset for pretraining and utilized standard benchmarks for evaluation.</p>
<ul>
<li><strong>Pretraining Data</strong>: <strong>PubChem-77M</strong>.
<ul>
<li>Source: 77 million unique SMILES from PubChem.</li>
<li>Preprocessing: Canonicalized and globally shuffled.</li>
<li>Subsets used: 100K, 250K, 1M, and 10M subsets.</li>
<li><em>Availability Note</em>: The authors provided a direct link to the <a href="https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/pubchem_10m.txt.zip">canonicalized 10M compound subset</a> used for their largest experiments. Full reproducibility of the smaller (100K, 250K, 1M) or full 77M sets may require re-extracting from PubChem.</li>
</ul>
</li>
<li><strong>Evaluation Data</strong>: <strong>MoleculeNet</strong>.
<ul>
<li>Tasks: BBBP (2,039), ClinTox (1,478), HIV (41,127), Tox21 (7,831).</li>
<li>Splitting: 80/10/10 train/valid/test split using a <strong>scaffold splitter</strong> to ensure chemical diversity between splits.</li>
</ul>
</li>
</ul>
<h3 id="algorithms">Algorithms</h3>
<p>The core training methodology mirrors standard BERT/RoBERTa procedures adapted for chemical strings.</p>
<ul>
<li><strong>Objective</strong>: Masked Language Modeling (MLM) with <strong>15% token masking</strong>.</li>
<li><strong>Tokenization</strong>:
<ul>
<li><strong>BPE</strong>: Byte-Pair Encoder (vocab size 52K).</li>
<li><strong>SmilesTokenizer</strong>: Regex-based custom tokenizer available in DeepChem (documented <a href="https://deepchem.readthedocs.io/en/latest/tokenizers.html#smilestokenizer">here</a>).</li>
</ul>
</li>
<li><strong>Sequence Length</strong>: Maximum sequence length of <strong>512 tokens</strong>.</li>
<li><strong>Finetuning</strong>: Appended a linear classification layer; backpropagated through the base model for up to 25 epochs with early stopping on ROC-AUC.</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Architecture</strong>: <strong>RoBERTa</strong> (via HuggingFace).
<ul>
<li>Layers: 6</li>
<li>Attention Heads: 12 (72 distinct mechanisms total).</li>
<li><em>Implementation Note</em>: The original training notebooks and scripts are maintained in the authors&rsquo; <a href="https://github.com/seyonechithrananda/bert-loves-chemistry">bert-loves-chemistry repository</a>, alongside the primary downstream tasks integrated into DeepChem. A <a href="https://github.com/deepchem/deepchem/blob/master/examples/tutorials/22_Transfer_Learning_With_HuggingFace_tox21.ipynb">full Tox21 transfer learning tutorial</a> has been incorporated into the DeepChem repository.</li>
</ul>
</li>
<li><strong>Baselines</strong> (via Chemprop library):
<ul>
<li><strong>D-MPNN</strong>: Directed Message Passing Neural Network with default hyperparameters.</li>
<li><strong>RF/SVM</strong>: Scikit-learn Random Forest and SVM using 2048-bit Morgan fingerprints (<a href="https://en.wikipedia.org/wiki/RDKit">RDKit</a>).</li>
</ul>
</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<p>Performance is measured using dual metrics to account for class imbalance common in toxicity datasets.</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Details</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><strong>ROC-AUC</strong></td>
          <td>Area Under Receiver Operating Characteristic Curve</td>
      </tr>
      <tr>
          <td><strong>PRC-AUC</strong></td>
          <td>Area Under Precision-Recall Curve (vital for imbalanced data)</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li><strong>Compute</strong>: Single <strong>NVIDIA V100 GPU</strong>.</li>
<li><strong>Training Time</strong>: Approximately <strong>48 hours</strong> for the 10M compound subset.</li>
<li><strong>Carbon Footprint</strong>: Estimated 17.1 kg $\text{CO}_2\text{eq}$ (offset by Google Cloud).</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/seyonechithrananda/bert-loves-chemistry">bert-loves-chemistry</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Training notebooks and finetuning scripts</td>
      </tr>
      <tr>
          <td><a href="https://github.com/deepchem/deepchem">DeepChem</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Integration of ChemBERTa and SmilesTokenizer</td>
      </tr>
      <tr>
          <td><a href="https://huggingface.co/seyonec/ChemBERTa-zinc-base-v1">ChemBERTa-zinc-base-v1</a></td>
          <td>Model</td>
          <td>Unknown</td>
          <td>Pre-trained RoBERTa on 100K ZINC SMILES</td>
      </tr>
      <tr>
          <td><a href="https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/pubchem_10m.txt.zip">PubChem-10M subset</a></td>
          <td>Dataset</td>
          <td>Unknown</td>
          <td>Canonicalized 10M compound subset used for largest experiments</td>
      </tr>
  </tbody>
</table>
<p><strong>Reproducibility status</strong>: Partially Reproducible. Code and pre-trained models are available, and the 10M pretraining subset is downloadable. However, smaller subsets (100K, 250K, 1M) may need re-extraction from PubChem, and exact hyperparameter details for finetuning (learning rate, batch size) are not fully specified in the paper.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Chithrananda, S., Grand, G., &amp; Ramsundar, B. (2020). ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction. <em>arXiv preprint arXiv:2010.09885</em>. <a href="https://doi.org/10.48550/arXiv.2010.09885">https://doi.org/10.48550/arXiv.2010.09885</a></p>
<p><strong>Publication</strong>: arXiv 2020 (Preprint)</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://huggingface.co/seyonec/ChemBERTa-zinc-base-v1">HuggingFace Model Hub (ChemBERTa-zinc-base-v1)</a> - <em>Additional pre-trained variations on PubChem &amp; ZINC datasets are available on the author&rsquo;s <a href="https://huggingface.co/seyonec">seyonec</a> HF profile.</em></li>
<li><a href="https://github.com/seyonechithrananda/bert-loves-chemistry">bert-loves-chemistry GitHub Repository</a> - <em>Notebooks and scripts used for MLM pretraining and finetuning evaluations.</em></li>
</ul>
<h3 id="bibtex">BibTeX</h3>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@misc</span>{chithranandaChemBERTaLargeScaleSelfSupervised2020,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span> = <span style="color:#e6db74">{{{ChemBERTa}}: {{Large-Scale Self-Supervised Pretraining}} for {{Molecular Property Prediction}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">shorttitle</span> = <span style="color:#e6db74">{{{ChemBERTa}}}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span> = <span style="color:#e6db74">{Chithrananda, Seyone and Grand, Gabriel and Ramsundar, Bharath}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span> = <span style="color:#ae81ff">2020</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">month</span> = oct,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span> = <span style="color:#e6db74">{arXiv:2010.09885}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">eprint</span> = <span style="color:#e6db74">{2010.09885}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">primaryclass</span> = <span style="color:#e6db74">{cs}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span> = <span style="color:#e6db74">{arXiv}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span> = <span style="color:#e6db74">{10.48550/arXiv.2010.09885}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">urldate</span> = <span style="color:#e6db74">{2025-12-24}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">archiveprefix</span> = <span style="color:#e6db74">{arXiv}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item></channel></rss>