Molecular Encoders on Hunter Heidenreich | ML Research Scientist

Mol2vec: Unsupervised ML with Chemical Intuition

Fri, 27 Mar 2026 00:00:00 +0000

Word2vec Meets Cheminformatics

Mol2vec is a Method paper that introduces an unsupervised approach for learning dense vector representations of molecular substructures. The core idea is a direct analogy to Word2vec from natural language processing: molecular substructures (derived from the Morgan algorithm) are treated as “words,” and entire molecules are treated as “sentences.” By training on a large unlabeled corpus of 19.9 million compounds, Mol2vec produces embeddings where chemically related substructures occupy nearby regions of vector space. Compound-level vectors are then obtained by summing constituent substructure vectors, and these can serve as features for downstream supervised learning tasks.

Sparse Fingerprints and Their Limitations

Molecular fingerprints, particularly Morgan fingerprints (extended-connectivity fingerprints, ECFP), are among the most widely used molecular representations in cheminformatics. They perform well for similarity searching, virtual screening, and activity prediction. However, they suffer from several practical drawbacks:

High dimensionality and sparsity: Morgan fingerprints are typically hashed to fixed-length binary vectors (e.g., 2048 or 4096 bits), resulting in very sparse representations.
Bit collisions: The hashing step can map distinct substructures to the same bit position, losing structural information.
No learned relationships: Each bit is independent, so the representation does not encode any notion of chemical similarity between substructures.

At the time of this work (2017), NLP techniques had started to appear in cheminformatics. The tf-idf method had been applied to Morgan fingerprints for compound-protein interaction prediction, and Latent Dirichlet Allocation had been used for chemical topic modeling. The Word2vec concept had been adapted for protein sequences (ProtVec) but had not yet been applied to small molecules. Mol2vec fills this gap.

From Substructure Identifiers to Dense Embeddings

The central insight of Mol2vec is that the Morgan algorithm already produces a natural “vocabulary” of molecular substructures, and the order in which these substructures appear in a molecule provides local context, analogous to word order in a sentence.

Corpus Construction

The training corpus was assembled from ZINC v15 and ChEMBL v23, merged and deduplicated, then filtered by molecular weight (12-600), heavy atom count (3-50), clogP (-5 to 7), and allowed elements (H, B, C, N, O, F, P, S, Cl, Br). This yielded 19.9 million compounds.

Sentence Generation

For each molecule, the Morgan algorithm generates atom identifiers at radius 0 and radius 1. Each atom contributes two identifiers (one per radius), ordered according to the atom order in the canonical SMILES. This sequence of identifiers forms a “sentence” for Word2vec training.

Word2vec Training

The model was trained using the gensim implementation of Word2vec. After evaluating both CBOW and Skip-gram architectures with window sizes of 5, 10, and 20, and embedding dimensions of 100 and 300, the best configuration was:

Architecture: Skip-gram
Window size: 10
Embedding dimension: 300

Rare identifiers appearing fewer than 3 times in the corpus were replaced with a special “UNSEEN” token, which learns a near-zero vector. This allows the model to handle novel substructures at inference time.

Compound Vector Generation

The final vector for a molecule is the sum of all its substructure vectors:

$$\mathbf{v}_{\text{mol}} = \sum_{i=1}^{N} \mathbf{v}_{s_i}$$

where $\mathbf{v}_{s_i}$ is the 300-dimensional embedding for the $i$-th substructure identifier in the molecule. This summation implicitly captures substructure counts and importance through vector amplitude.

Benchmarking Across Regression and Classification Tasks

Datasets

The authors evaluated Mol2vec on four datasets:

Dataset	Task	Size	Description
ESOL	Regression	1,144	Aqueous solubility prediction
Ames	Classification	6,511	Mutagenicity (balanced: 3,481 positive, 2,990 negative)
Tox21	Classification	8,192	12 human toxicity targets (imbalanced)
Kinase	Classification	284 kinases	Bioactivity from ChEMBL v23

Machine Learning Methods

Three ML methods were compared using both Mol2vec and Morgan FP features:

Random Forest (RF): scikit-learn, 500 estimators
Gradient Boosting Machine (GBM): XGBoost, 2000 estimators, max depth 3, learning rate 0.1
Deep Neural Network (DNN): Keras/TensorFlow, 4 hidden layers with 2000 neurons each for Mol2vec; 1 hidden layer with 512 neurons for Morgan FP

All models were validated using 20x 5-fold cross-validation with the Wilcoxon signed-rank test for statistical comparison.

ESOL Regression Results

Features	Method	$R^2_{\text{ext}}$	MSE	MAE
Descriptors	MLR	0.81 +/- 0.01	0.82	0.69
Molecular Graph	CNN	0.93	0.31 +/- 0.03	0.40 +/- 0.00
Morgan FP	GBM	0.66 +/- 0.00	1.43 +/- 0.00	0.88 +/- 0.00
Mol2vec	GBM	0.86 +/- 0.00	0.62 +/- 0.00	0.60 +/- 0.00

Mol2vec substantially outperformed Morgan FP ($R^2_{\text{ext}}$ 0.86 vs. 0.66) but did not match the best graph convolution methods ($R^2_{\text{ext}}$ ~0.93).

Classification Results (Ames and Tox21)

On the Ames dataset, Mol2vec and Morgan FP performed comparably (AUC 0.87 vs. 0.88), both matching or exceeding prior SVM and Naive Bayes results. On Tox21, both achieved an average AUC of 0.83, outperforming literature results from graph convolution (0.71) and DNN/SVM approaches (0.71-0.72).

Proteochemometric (PCM) Extension

Mol2vec was combined with ProtVec (protein sequence embeddings using the same Word2vec approach on 3-grams) by concatenating vectors, forming PCM2vec. This was evaluated using a rigorous 4-level cross-validation scheme:

CV1: New compound-target pairs
CV2: New targets
CV3: New compounds
CV4: New compounds and targets

On Tox21, PCM2vec improved predictions for new compound-target pairs (CV1: AUC 0.87 vs. 0.79 for Morgan FP) and new compounds (CV3: AUC 0.85 vs. 0.78). On the kinase dataset, PCM2vec approached the performance of classical PCM (Morgan + z-scales) while being alignment-independent, meaning it can be applied to proteins with low sequence similarity.

Chemical Intuition and Practical Value

Embedding Quality

The learned substructure embeddings capture meaningful chemical relationships. Hierarchical clustering of the 25 most common substructures shows expected groupings: aromatic carbons cluster together, aliphatic ring carbons form a separate group, and carbonyl carbons and oxygens are closely related. Similarly, t-SNE projections of amino acid vectors encoded by Mol2vec reproduce known amino acid relationships (e.g., similar distances between Glu/Gln and Asp/Asn pairs, reflecting the carboxylic acid to amide transition).

Key Findings

Skip-gram with 300-dimensional embeddings provides the best Mol2vec representations, consistent with NLP best practices.
Mol2vec excels at regression tasks, substantially outperforming Morgan FP on ESOL solubility prediction ($R^2_{\text{ext}}$ 0.86 vs. 0.66).
Classification performance is competitive with Morgan FP across Ames and Tox21 datasets.
PCM2vec enables alignment-independent proteochemometrics, extending PCM approaches to diverse protein families with low sequence similarity.
Tree-based methods (RF, GBM) outperformed DNNs on these tasks, though the authors note further DNN tuning could help.

Limitations

The compound vector is a simple sum of substructure vectors, which discards information about substructure arrangement and molecular topology.
Only Morgan identifiers at radii 0 and 1 were used. Larger radii might capture more context but would increase vocabulary size.
DNN architectures were not extensively optimized, leaving open the question of how well Mol2vec pairs with deep learning.
The approach was benchmarked against Morgan FP but not against other learned representations such as graph neural networks in a controlled comparison.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pre-training	ZINC v15 + ChEMBL v23	19.9M compounds	Filtered by MW, atom count, clogP, element types
Evaluation	ESOL	1,144 compounds	Aqueous solubility regression
Evaluation	Ames	6,511 compounds	Mutagenicity classification
Evaluation	Tox21	8,192 compounds	12 toxicity targets, retrieved via DeepChem
Evaluation	Kinase (ChEMBL v23)	284 kinases	IC50/Kd/Ki binding assays
Protein corpus	UniProt	554,241 sequences	For ProtVec training

Algorithms

Word2vec: Skip-gram, window size 10, 300-dimensional embeddings, min count 3
Morgan algorithm: Radii 0 and 1 (119 and 19,831 unique identifiers respectively)
UNSEEN token: Replaces identifiers occurring fewer than 3 times
Compound vector: Sum of all substructure vectors

Models

RF: scikit-learn, 500 estimators, sqrt features, balanced class weights
GBM: XGBoost, 2000 estimators, max depth 3, learning rate 0.1
DNN: Keras/TensorFlow, 4 layers x 2000 neurons (Mol2vec) or 1 layer x 512 neurons (Morgan FP), ReLU activation, dropout 0.1

Evaluation

Metric	Mol2vec Best	Morgan FP Best	Task
$R^2_{\text{ext}}$	0.86 (GBM)	0.66 (GBM)	ESOL regression
AUC	0.87 (RF)	0.88 (RF)	Ames classification
AUC	0.83 (RF)	0.83 (RF)	Tox21 classification

Hardware

Not specified in the paper.

Artifacts

Artifact	Type	License	Notes
mol2vec	Code	BSD-3-Clause	Python package with pre-trained model

Paper Information

Citation: Jaeger, S., Fulle, S., & Turk, S. (2018). Mol2vec: Unsupervised Machine Learning Approach with Chemical Intuition. Journal of Chemical Information and Modeling, 58(1), 27-35. https://doi.org/10.1021/acs.jcim.7b00616

@article{jaeger2018mol2vec,
  title={Mol2vec: Unsupervised Machine Learning Approach with Chemical Intuition},
  author={Jaeger, Sabrina and Fulle, Simone and Turk, Samo},
  journal={Journal of Chemical Information and Modeling},
  volume={58},
  number={1},
  pages={27--35},
  year={2018},
  publisher={American Chemical Society},
  doi={10.1021/acs.jcim.7b00616}
}

X-MOL: Pre-training on 1.1B Molecules for SMILES

Thu, 26 Mar 2026 00:00:00 +0000

A Unified Molecular Pre-training Framework

X-MOL is a Method paper that introduces a large-scale pre-training framework for SMILES-based molecular understanding. The primary contribution is a Transformer encoder-decoder model pre-trained on 1.1 billion molecules from ZINC15, which is then fine-tuned across five distinct molecular analysis tasks: molecular property prediction (classification and regression), chemical reaction productivity prediction, drug-drug interaction (DDI) prediction, de novo molecule generation (distribution learning and goal-directed), and molecule optimization. The paper demonstrates that a single pre-trained model can serve as a universal foundation for diverse downstream chemistry tasks.

Bridging Scale and Understanding in Molecular SMILES

Prior to X-MOL, most molecular analysis tasks were investigated individually with task-specific models. SMILES-based deep learning methods existed but lacked the benefit of large-scale pre-training that had proven transformative in NLP (BERT, RoBERTa, ERNIE, XLNet, T5). Two challenges motivated this work:

SMILES sacrifices structural information for simplicity. While SMILES is a convenient linear representation, it does not directly encode molecular topology, making it harder for models to learn 3D structure from string input.
Labelled molecular data is scarce. Most benchmark datasets (MoleculeNet) contain only thousands of labelled examples, making it difficult to train large models from scratch without overfitting.

The authors hypothesized that massive-scale pre-training on unlabelled SMILES could teach a model the grammar rules and implicit structural information in SMILES, providing a strong initialization for multiple downstream tasks.

Generative Pre-training with Random SMILES

The core innovation in X-MOL is a generative pre-training strategy that exploits the non-uniqueness of SMILES. A single molecule can be represented by many valid SMILES strings (random SMILES), depending on the starting atom, main chain selection, and ring-opening position. X-MOL trains the model to generate a valid alternative SMILES given an input SMILES of the same molecule, forcing the model to:

Reconstruct the molecular structure from the input SMILES
Generate a valid output SMILES following SMILES grammar rules

The architecture uses a shared-parameter encoder-decoder based on the Transformer. Unlike standard encoder-decoder models (e.g., for machine translation), X-MOL shares all parameters between encoder and decoder, forcing both encoding and decoding to occur in the same semantic space. The output SMILES is fully masked during training, and only unidirectional attention is permitted within the output sequence.

The self-attention mechanism computes attention for each character $i$ as:

$$ Z_{i} = \text{SoftMax}\left(\frac{Q_{i} \cdot K^{T}}{\sqrt{D}}\right) \cdot V $$

where $Q_{i}$, $K$, and $V$ are the query, key, and value matrices, and $D$ is the feature dimension. The model uses 12 attention heads to capture different relational patterns.

Model Architecture

12 Transformer encoder layers
768-dimensional hidden units
12 attention heads
Character-level SMILES tokenization (108 chemical characters plus 5 special tokens: [PAD], [CLS], [SEP], [MASK], [UNK])
Characters within square brackets and double digits preceded by “%” are treated as single tokens

Data Augmentation in Pre-training

Because a molecule has multiple valid random SMILES, the output may differ from the predefined target. To handle this, X-MOL generates multiple training samples per molecule with the same input SMILES but different output random SMILES, and places these in the same mini-batch.

Experimental Setup Across Five Tasks

X-MOL is fine-tuned with task-specific strategies organized into two categories: prediction tasks and generation tasks.

Prediction Tasks

For prediction tasks, the [CLS] token’s output representation is passed through a fully connected network to produce predictions. The input format varies by task:

Task	Input Format	Loss Function	Metric
Property prediction (classification)	Single SMILES	Cross-entropy	ROC-AUC
Property prediction (regression)	Single SMILES	MSE	RMSE
Reaction productivity prediction	Four SMILES (reactant, additive, base, ligand)	MSE	RMSE
DDI prediction	Two SMILES (drug pair)	Cross-entropy	Accuracy

Molecular Property Prediction (Classification): Four MoleculeNet benchmarks were used: HIV (41,127 compounds), BACE (1,513), BBBP (2,039), and ClinTox (1,484). Data were randomly split 20 times, and average ROC-AUC is reported.

Molecular Property Prediction (Regression): Three MoleculeNet benchmarks: ESOL (1,128), FreeSolv (642), and Lipophilicity (4,200). Data augmentation with random SMILES was applied to the training set. Average RMSE over 20 random splits is reported.

Chemical Reaction Productivity Prediction: The C-N cross-coupling dataset (3,956 reactions) from Ahneman et al. was used with 10-fold cross-validation.

DDI Prediction: The DeepDDI dataset (192,284 DDI pairs, 86 interaction types) was used as benchmark.

Generation Tasks

Task	Generation Source	Sampling Strategy
Distribution learning (DL) generation	Fixed initial symbol ([CLS])	Random sampling
Goal-directed (GD) generation	Unfixed initial symbol	Random sampling
Molecule optimization	Input molecule	Beam search (beam size = 4)

DL-based Generation: Evaluated on ZINC250K (249,456 molecules) using validity, uniqueness, and novelty.

GD Generation: Also on ZINC250K, using QED as the goal property with target QED = 0.948 (the dataset maximum). 10,000 molecules were generated for evaluation.

Molecule Optimization: Evaluated on ZINC250K with QED as the optimization goal. Molecular pairs were constructed by selecting pairs with Tanimoto similarity in [0.6, 0.8], where the lower-QED molecule serves as input and the higher-QED molecule as target.

Key Results

Classification (ROC-AUC, higher is better): X-MOL achieved state-of-the-art on all four datasets, outperforming both shallow learning methods and deep learning baselines including graph convolutional models.

Regression (RMSE, lower is better): X-MOL achieved the best RMSE on ESOL, FreeSolv, and Lipophilicity.

Reaction Productivity: X-MOL obtained an average RMSE of 0.0626, compared to the random forest baseline of 0.078.

DDI Prediction: X-MOL achieved accuracy of 0.952, improving over DeepDDI’s 0.924.

DL-based Generation:

Method	Validity	Uniqueness	Novelty
GCPN	20%	99.97%	100%
MRNN	65%	99.89%	100%
GraphAF	68%	99.10%	100%
X-MOL	85.28%	99.91%	100%

GD Generation: X-MOL generated all top-3 molecules with QED = 0.948, matching the dataset maximum. GraphAF reached 0.948/0.948/0.947, while JT-VAE and MRNN fell further behind.

Knowledge Embedding Ablation

The paper tested three additional embedding strategies to inject structural information into the model:

Link embedding: Encodes connection information between atoms (position of the previous connected atom)
Ring embedding: Encodes ring structure information from SMILES number pairs
Type embedding: Categorizes characters into 9 types (atoms, bonds, structural symbols)

None of these additional embeddings improved performance on the HIV or DDI tasks, whether with or without pre-training. The authors conclude that SMILES already contains sufficient information for molecular understanding and that pre-training effectively extracts this information, a finding they label “SMILES is all you need.”

Attention Visualization

The authors provide attention heatmap analysis demonstrating that:

Middle layers (e.g., layer 9) reconstruct molecular structure by correctly identifying atom connectivity and ring closures
Later layers abstract higher-level features for property prediction
In multi-input prediction tasks (reaction productivity), attention reveals which reaction components are most important (e.g., the ligand receives highest cross-attention)
In generation tasks, attention patterns differ between DL (self-focused), GD (source-constrained), and optimization (gradual shift from input to output)

Findings, Limitations, and Future Directions

X-MOL demonstrates that large-scale pre-training on SMILES can produce a single model that achieves competitive or state-of-the-art performance across five distinct molecular analysis tasks. The key findings are:

Scale enables SMILES understanding. Pre-training on 1.1 billion molecules allows the model to learn SMILES grammar rules well enough to outperform graph-based methods on molecule generation validity.
Unified framework. A single pre-trained backbone serves classification, regression, reaction prediction, DDI prediction, and generative tasks through different fine-tuning strategies.
SMILES is sufficient. Additional knowledge embeddings (link, ring, type) do not improve performance, suggesting pre-training extracts the necessary structural information from SMILES alone.
Interpretable attention. Attention visualization confirms that the model reconstructs molecular structure internally.

Limitations (observed):

The paper reports only MoleculeNet benchmarks with relatively few datasets. No scaffold splits or temporal splits are used; all splits are random, which can overestimate performance on structurally novel compounds.
Comparison baselines are somewhat dated (2018-2019 era methods), and the paper does not compare against concurrent SMILES pre-training methods.
The molecule generation validity (85.28%) is much higher than graph baselines like GCPN (20%), but later work achieved near 100% validity with constrained SMILES grammars.
No code or model weights have been publicly released, limiting independent verification.
The paper remains a bioRxiv preprint and has not been published in a peer-reviewed venue.

Future directions proposed by the authors include: better pre-training strategies, extension to graph-based representations, and fine-tuning on additional downstream tasks.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pre-training	ZINC15	1.1 billion molecules	Random SMILES augmentation
Classification	HIV (MoleculeNet)	41,127	Binary classification
Classification	BACE (MoleculeNet)	1,513	Binary classification
Classification	BBBP (MoleculeNet)	2,039	Binary classification
Classification	ClinTox (MoleculeNet)	1,484	Two sub-datasets, averaged
Regression	ESOL (MoleculeNet)	1,128	Water solubility
Regression	FreeSolv (MoleculeNet)	642	Hydration free energy
Regression	Lipophilicity (MoleculeNet)	4,200	logD at pH 7.4
Reaction	C-N cross-coupling	3,956	From Ahneman et al. (2018)
DDI	DeepDDI	192,284 DDI pairs	86 interaction types
Generation	ZINC250K	249,456	For DL, GD, and optimization

Algorithms

Pre-training: Generative SMILES-to-SMILES with shared encoder-decoder Transformer
Fine-tuning prediction tasks: [CLS] token passed through fully connected layers
Fine-tuning generation tasks: Autoregressive generation with random sampling (DL, GD) or beam search (optimization)
Data augmentation: Random SMILES augmentation for regression tasks
Repeated training: 20 random splits with averaged results for classification/regression
10-fold cross-validation for reaction productivity

Models

12-layer Transformer, 768 hidden dimensions, 12 attention heads
Character-level tokenization: 108 chemical characters + 5 special tokens
Implemented in PaddlePaddle framework

Evaluation

Task	Metric	X-MOL	Best Baseline
HIV (classification)	ROC-AUC	State-of-the-art	Previous best (various)
BACE (classification)	ROC-AUC	State-of-the-art	Previous best (various)
BBBP (classification)	ROC-AUC	State-of-the-art	Previous best (various)
ClinTox (classification)	ROC-AUC	State-of-the-art	Previous best (various)
ESOL (regression)	RMSE	State-of-the-art	Previous best (various)
FreeSolv (regression)	RMSE	State-of-the-art	Previous best (various)
Lipophilicity (regression)	RMSE	State-of-the-art	Previous best (various)
C-N coupling	RMSE	0.0626	0.078 (random forest)
DDI prediction	Accuracy	0.952	0.924 (DeepDDI)
DL generation	Validity	85.28%	68% (GraphAF)
GD generation	Top-3 QED	All 0.948	0.948/0.948/0.947 (GraphAF)

Hardware

Pre-training: 8/16 Tesla P40 GPUs (24 GB each), approximately 4 days
Data pre-processing: Over 1,000 CPUs with Hadoop

Artifacts

No code, model weights, or pre-trained checkpoints have been publicly released. The model was implemented in Baidu’s PaddlePaddle framework, but no repository is available.

Reproducibility status: Closed. While the datasets are all publicly available (ZINC15, MoleculeNet, ZINC250K, DeepDDI, C-N coupling), the model implementation, pre-trained weights, and fine-tuning code are not released. The computational requirements (1,000+ CPUs for data processing, 8-16 GPUs for 4 days of pre-training) are substantial.

Paper Information

Citation: Xue, D., Zhang, H., Xiao, D., Gong, Y., Chuai, G., Sun, Y., Tian, H., Wu, H., Li, Y., & Liu, Q. (2020). X-MOL: Large-scale pre-training for molecular understanding and diverse molecular analysis. bioRxiv.

@article{xue2020xmol,
  title={X-MOL: large-scale pre-training for molecular understanding and diverse molecular analysis},
  author={Xue, Dongyu and Zhang, Han and Xiao, Dongling and Gong, Yukang and Chuai, Guohui and Sun, Yu and Tian, Hao and Wu, Hua and Li, Yukun and Liu, Qi},
  journal={bioRxiv},
  year={2020},
  doi={10.1101/2020.12.23.424259},
  publisher={Cold Spring Harbor Laboratory}
}

SMILES-BERT: BERT-Style Pre-Training for Molecules

Thu, 26 Mar 2026 00:00:00 +0000

Pre-Training Transformers on SMILES for Molecular Properties

SMILES-BERT is a Method paper that introduces a BERT-inspired pre-training and fine-tuning framework for molecular property prediction. The primary contribution is adapting the masked language model paradigm from NLP to SMILES strings, enabling a Transformer encoder to learn molecular representations from large-scale unlabeled data before fine-tuning on smaller labeled datasets.

Limited Labels in Molecular Property Prediction

Molecular property prediction is central to drug discovery and chemical design, but obtaining labeled data requires expensive biological assays. Deep learning methods for this task fall into three categories: manually designed fingerprints (e.g., ECFP), graph-based methods (GCNs operating on molecular graphs), and sequence-based methods (RNNs or CNNs operating on SMILES strings).

Prior unsupervised approaches like Seq2seq Fingerprint used an encoder-decoder architecture to learn representations from unlabeled SMILES, but the decoder acts as scaffolding that consumes GPU memory during pre-training without contributing to downstream prediction. The semi-supervised Seq3seq Fingerprint improved on this by incorporating labeled data, but retained the encoder-decoder inefficiency. RNN-based methods also suffer from difficulty in parallel training and require careful tuning (gradient clipping, early stopping) to converge.

The authors identify two motivations: (1) building a semi-supervised model that effectively leverages large pools of unlabeled SMILES to improve prediction with limited labels, and (2) designing an architecture where the entire pre-trained model participates in fine-tuning (no wasted decoder parameters) and naturally supports parallel training.

Masked SMILES Recovery with Transformer Encoders

The core innovation is the Masked SMILES Recovery pre-training task, directly analogous to BERT’s masked language modeling. The model architecture is a stack of Transformer encoder layers, making it fully convolutional and parallelizable.

Architecture

SMILES-BERT uses 6 Transformer encoder layers, each with 4-head multi-head self-attention and feed-forward dimension of 1024. Each Transformer layer contains three components: a pre-attention feed-forward network, a self-attention layer, and a post-attention feed-forward network, all followed by layer normalization with residual connections.

The self-attention mechanism uses scaled dot-product attention:

$$ Z = \text{Softmax}\left(\frac{(XW^{Q})(XW^{K})^{T}}{\sqrt{d_{k}}}\right) XW^{V} $$

where $X \in \mathbb{R}^{N \times M}$ is the input feature matrix, $W^{Q}$, $W^{K}$, $W^{V} \in \mathbb{R}^{M \times d_{k}}$ are the query, key, and value weight matrices, and $\sqrt{d_{k}}$ is the scaling factor.

Input SMILES are tokenized at the character level with token embeddings and positional embeddings. A special token is prepended to each SMILES, and its output representation is used for downstream classification/regression after fine-tuning.

Pre-training: Masked SMILES Recovery

Following BERT’s masking strategy, 15% of tokens in each SMILES are selected for masking (minimum one per SMILES). Of the selected tokens:

85% are replaced with a token
10% are replaced with a random token from the vocabulary
5% are kept unchanged

The model is trained to recover the original tokens at masked positions. The loss is computed only on the masked token outputs.

Fine-tuning

After pre-training, a classifier or regressor head is added to the token output. The entire model (all Transformer layers plus the new head) is fine-tuned on the labeled dataset.

Key differences from the original BERT:

Only the Masked SMILES Recovery task is used (BERT’s next sentence prediction is dropped since SMILES have no consecutive-sentence structure)
Segment embeddings are removed
The architecture is smaller (6 layers, 4 heads, 1024 FFN dim) since SMILES have a much smaller vocabulary and shorter sequences than natural language

The authors compared this configuration against a larger BERT-base setup (12 layers, 12 heads, 3072 FFN dim) and found no meaningful performance difference, confirming that the smaller model is sufficient for SMILES.

Experimental Setup and Baseline Comparisons

Pre-training Data

SMILES-BERT was pre-trained on the ZINC database with 18,671,355 training SMILES, 10,000 for validation, and 10,000 for evaluation. Pre-training ran for 10 epochs using the Adam optimizer with a warm-up strategy (learning rate from $10^{-9}$ to $10^{-4}$ over 4,000 steps, then inverse-square-root decay). Batch size was 256 and dropout was 0.1. The pre-training masked SMILES exact recovery rate reached 82.85% on the validation set.

Fine-tuning Datasets

Dataset	Source	Size	Task	Metric
LogP	NCATS/NIH	10,850	Classification (threshold 1.88)	Accuracy
PM2	NCATS/NIH	323,242	Classification (threshold 0.024896)	Accuracy
PCBA-686978	PubChem	302,175	Classification	Accuracy

All datasets were split 80/10/10 for train/validation/test. Fine-tuning used Adam with a fixed learning rate for 50 epochs, selecting the best model on validation data.

Baselines

Circular Fingerprint (CircularFP): Manually designed hash-based fingerprint (ECFP family)
Neural Fingerprint (NeuralFP): Graph-based neural network replacing hash functions with learned layers
Seq2seq Fingerprint (Seq2seqFP): Unsupervised encoder-decoder model on SMILES
Seq3seq Fingerprint (Seq3seqFP): Semi-supervised encoder-decoder model on SMILES

Results

Method	LogP	PM2	PCBA-686978
CircularFP	~0.90	0.6858	~0.82
NeuralFP	~0.90	0.6802	~0.82
Seq2seqFP	~0.87	0.6112	~0.80
Seq3seqFP	~0.90	0.7038	~0.84
SMILES-BERT	0.9154	0.7589	0.8784

SMILES-BERT outperformed all baselines on all three datasets. The improvement over Seq3seqFP was approximately 2% on LogP, 5.5% on PM2, and 3.8% on PCBA-686978. The results on PM2 (the largest labeled dataset) show that pre-training benefits persist even with substantial labeled data.

Structure Study

Configuration	Layers	Attention Heads	FFN Dim	LogP Accuracy
SMILES-BERT	6	4	1024	0.9154
SMILES-BERT (large)	12	12	3072	0.9147

The larger configuration provided no improvement, supporting the choice of the smaller, more efficient architecture.

Findings, Limitations, and Future Directions

SMILES-BERT demonstrated that BERT-style masked pre-training on SMILES strings produces transferable molecular representations that improve property prediction across datasets of varying sizes and property types.

Key findings:

The Masked SMILES Recovery pre-training task transfers effectively to molecular property prediction
The full model participates in fine-tuning (no wasted decoder), making SMILES-BERT more parameter-efficient than encoder-decoder alternatives
A smaller Transformer configuration (6 layers, 4 heads) matches the performance of a BERT-base-sized model for SMILES data
Pre-training on ~18.7M SMILES from ZINC provides robust initialization across different downstream tasks

Limitations: The evaluation uses only classification accuracy as the metric, without reporting AUC-ROC, F1, or other metrics common in molecular property prediction. The comparison is limited to four baselines, and two of the three evaluation datasets (LogP, PM2) are non-public NIH datasets. The paper does not explore different pre-training dataset sizes or ablate the masking strategy. Only classification tasks are evaluated, though the architecture supports regression.

Future work: The authors propose incorporating Quantitative Estimate of Druglikeness (QED) prediction as an additional pre-training task to warm up the model’s classification capability, analogous to BERT’s next sentence prediction.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pre-training	ZINC	18,671,355 SMILES	Publicly available database
Fine-tuning	LogP	10,850	Non-public, from NCATS/NIH
Fine-tuning	PM2	323,242	Non-public, from NCATS/NIH
Fine-tuning	PCBA-686978	302,175	Public, from PubChem BioAssay

Algorithms

Pre-training: Adam optimizer, warm-up for 4,000 steps ($10^{-9}$ to $10^{-4}$), inverse-square-root LR schedule, batch size 256, dropout 0.1, 10 epochs
Fine-tuning: Adam optimizer, fixed LR (insensitive to choice among $10^{-5}$, $10^{-6}$, $10^{-7}$), 50 epochs, best model on validation

Models

6 Transformer encoder layers, 4-head multi-head attention, FFN dim 1024
Token embedding + positional embedding, special token
Implemented with FairSeq (Facebook AI Research Sequence-to-Sequence Toolkit)

Evaluation

Metric	SMILES-BERT	Best Baseline (Seq3seqFP)	Notes
LogP Accuracy	0.9154	~0.90	~2% improvement
PM2 Accuracy	0.7589	0.7038	~5.5% improvement
PCBA Accuracy	0.8784	~0.84	~3.8% improvement

Hardware

The paper mentions GPU training and NVIDIA GPU donation in acknowledgments but does not specify the exact GPU model or training time beyond noting that pre-training on a single GPU takes over a week for 10 epochs.

Artifact	Type	License	Notes
No public code or model release identified	-	-	Paper does not provide a GitHub link or model checkpoint

Reproducibility status: Partially Reproducible. The ZINC pre-training data is public and the architecture is described in detail, but no code or pre-trained weights are released. Two of three evaluation datasets (LogP, PM2) are non-public.

Paper Information

Citation: Wang, S., Guo, Y., Wang, Y., Sun, H., & Huang, J. (2019). SMILES-BERT: Large Scale Unsupervised Pre-Training for Molecular Property Prediction. In Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics (ACM-BCB ‘19), 429-436. https://doi.org/10.1145/3307339.3342186

@inproceedings{wang2019smilesbert,
  title={SMILES-BERT: Large Scale Unsupervised Pre-Training for Molecular Property Prediction},
  author={Wang, Sheng and Guo, Yuzhi and Wang, Yuhong and Sun, Hongmao and Huang, Junzhou},
  booktitle={Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics},
  pages={429--436},
  year={2019},
  publisher={ACM},
  doi={10.1145/3307339.3342186}
}

SMILES Transformer: Low-Data Molecular Fingerprints

Thu, 26 Mar 2026 00:00:00 +0000

A Transformer Approach to Learned Molecular Fingerprints

This is a Method paper that introduces SMILES Transformer (ST), a Transformer-based sequence-to-sequence model pre-trained on unlabeled SMILES strings to produce continuous, data-driven molecular fingerprints. The primary contribution is demonstrating that unsupervised pre-training on chemical text representations yields fingerprints that generalize well under low-data conditions, outperforming both rule-based fingerprints (ECFP) and graph convolution models on several MoleculeNet benchmarks. A secondary contribution is the Data Efficiency Metric (DEM), a scalar metric for evaluating model performance across varying training set sizes.

The Low-Data Problem in Molecular Property Prediction

Machine learning for drug discovery depends on molecular representations, but labeled datasets of experimentally validated properties are typically small. Conventional approaches fall into two camps: rule-based fingerprints like ECFP that hash substructures into sparse binary vectors, and graph-based methods like GraphConv that learn representations end-to-end. Rule-based fingerprints perform poorly with shallow models or limited data, while graph-based methods are designed for large fully-labeled settings.

Pre-training on unlabeled data had shown strong results in NLP (ELMo, BERT, XLNet), and prior work in cheminformatics had explored RNN-based and VAE-based pre-training on SMILES (Seq2Seq fingerprints, Grammar VAE, heteroencoders). However, none of these studies systematically evaluated performance in small-data settings. Honda et al. fill this gap by applying Transformer-based pre-training to SMILES and measuring data efficiency explicitly.

Transformer Pre-training on SMILES with Pooled Fingerprint Extraction

The core innovation is a Transformer encoder-decoder architecture pre-trained as an autoencoder on SMILES strings, with a specific fingerprint extraction strategy that pools the encoder outputs into a fixed-length vector.

Architecture

The model uses 4 Transformer blocks for both the encoder and decoder, each with 4-head attention and 256 embedding dimensions plus 2 linear layers. Input SMILES are tokenized at the symbol level (e.g., ‘c’, ‘Br’, ‘=’, ‘(’, ‘2’) and one-hot encoded. Following Vaswani et al. (2017), the input uses the sum of token encoding and positional encoding.

Pre-training

The model is pre-trained on 861,000 unlabeled SMILES sampled from ChEMBL24 to minimize cross-entropy between input and output SMILES (i.e., reconstruction). SMILES enumeration (Bjerrum, 2017) randomly generates non-canonical SMILES at each epoch to reduce representation bias. Training runs for 5 epochs with Adam optimization, reaching a perplexity of 1.0 (perfect decoding).

Fingerprint Extraction

Since the Transformer outputs symbol-level (atom-level) representations, a pooling strategy produces molecule-level fingerprints. Four vectors are concatenated:

Mean-pooled output of the last encoder layer
Max-pooled output of the last encoder layer
First output token of the last encoder layer
First output token of the penultimate encoder layer

This produces a 1024-dimensional fingerprint, matching the dimensionality of ECFP for fair comparison.

Data Efficiency Metric

The paper proposes DEM to measure how well a model performs across different training set sizes:

$$ M_{DE}(f, m) = \frac{1}{|I|} \sum_{i \in I} m(f_i, X_i, Y_i) $$

where $f_i$ is the model trained on the fraction $i$ of training data, $m$ is the task metric, and $I = {0.0125, 0.025, 0.05, 0.1, 0.2, 0.4, 0.8}$ doubles the training percentage at each step. This captures average performance across a range of data availability, giving a single scalar that balances accuracy and data efficiency.

Benchmarking Across MoleculeNet with Data Efficiency Focus

Datasets

The evaluation uses 10 datasets from MoleculeNet spanning three categories:

Category	Dataset	Tasks	Type	Molecules	Metric
Physical chemistry	ESOL	1	Regression	1,128	RMSE
Physical chemistry	FreeSolv	1	Regression	643	RMSE
Physical chemistry	Lipophilicity	1	Regression	4,200	RMSE
Biophysics	MUV	17	Classification	93,127	PRC-AUC
Biophysics	HIV	1	Classification	41,913	ROC-AUC
Biophysics	BACE	1	Classification	1,522	ROC-AUC
Physiology	BBBP	1	Classification	2,053	ROC-AUC
Physiology	Tox21	12	Classification	8,014	ROC-AUC
Physiology	SIDER	27	Classification	1,427	ROC-AUC
Physiology	ClinTox	2	Classification	1,491	ROC-AUC

Baselines

ECFP4: Rule-based extended-connectivity fingerprint with 1024 dimensions
RNNS2S: RNN-based Seq2Seq pre-trained fingerprint (3-layer bidirectional GRU, same pre-training data as ST)
GraphConv: Graph convolution network trained end-to-end on labeled data

Experimental Setup

All fingerprint methods use a simple MLP classifier/regressor from scikit-learn with default hyperparameters to isolate the fingerprint quality from model capacity. Datasets are randomly split (stratified for classification), and results are averaged over 20 trials. Note that random splits are used rather than scaffold splits for the DEM experiments.

Data Efficiency Results (DEM)

Dataset	ST+MLP	ECFP+MLP	RNNS2S+MLP	GraphConv
ESOL (RMSE, lower is better)	1.144	1.741	1.317	1.673
FreeSolv (RMSE, lower is better)	2.246	3.043	2.987	3.476
Lipophilicity (RMSE, lower is better)	1.169	1.090	1.219	1.062
MUV (PRC-AUC, higher is better)	0.009	0.036	0.010	0.004
HIV (ROC-AUC, higher is better)	0.683	0.697	0.682	0.723
BACE (ROC-AUC, higher is better)	0.719	0.769	0.717	0.744
BBBP (ROC-AUC, higher is better)	0.900	0.760	0.884	0.795
Tox21 (ROC-AUC, higher is better)	0.706	0.616	0.702	0.687
SIDER (ROC-AUC, higher is better)	0.559	0.588	0.558	0.557
ClinTox (ROC-AUC, higher is better)	0.963	0.515	0.904	0.936

ST achieves the best DEM in 5 of 10 datasets (ESOL, FreeSolv, BBBP, Tox21, ClinTox), with particularly strong margins on ClinTox (+0.027 over GraphConv) and BBBP (+0.016 over RNNS2S).

Linear Model Experiments

To further isolate fingerprint quality, the authors replace MLP with ridge/logistic regression with L2 penalty. On 8 datasets (excluding MUV and SIDER due to class imbalance issues), ST achieves best DEM in 5 of 8, confirming the fingerprint quality holds regardless of downstream model.

Stratified Analysis by Molecule Size

On BBBP stratified by SMILES length, ST’s ROC-AUC increases with longer SMILES, similar to RNNS2S but unlike GraphConv which shows stable performance across lengths. This suggests text-based models extract richer information from longer sequences.

Comparison with Record Scores (Large Data)

Under the large-data setting (80/10/10 train/val/test split with hyperparameter tuning via Optuna), ST achieves first place only in ClinTox (0.954) but performs comparably to ECFP and graph-based models on the other datasets. This confirms that ST’s main advantage is in the low-data regime.

Strong Low-Data Performance with Caveats on Scalability

Key Findings

Transformer-based unsupervised pre-training on SMILES produces fingerprints that excel in low-data molecular property prediction, achieving best data efficiency on 5 of 10 MoleculeNet tasks.
The advantage is most pronounced on small datasets (ESOL with 1,128 molecules, FreeSolv with 643, BBBP with 2,053, ClinTox with 1,491) where pre-training enables good generalization.
With sufficient labeled data and hyperparameter tuning, ST fingerprints perform comparably to (but do not surpass) graph-based methods.
Longer SMILES provide richer information for text-based models, as shown by the stratified analysis on BBBP.

Limitations

Random splits are used for most DEM experiments rather than scaffold splits, which may inflate performance estimates for drug discovery applications where training and test molecules are structurally distinct.
The pre-training corpus (861K SMILES from ChEMBL24) is relatively small by modern standards.
MUV performance is poor across all methods (PRC-AUC near zero), suggesting the DEM framework may not be informative for extremely imbalanced or noisy datasets.
No comparison with BERT-style masked language model pre-training, which later work (ChemBERTa) would show as a viable alternative.

Future Directions

The authors propose three directions: (1) replacing the Transformer with Transformer-XL to handle longer SMILES, (2) multi-task pre-training that jointly predicts molecular descriptors (e.g., molecular weight, LogP) alongside SMILES reconstruction, and (3) better exploitation of enumerated SMILES to constrain the latent space.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pre-training	ChEMBL24	861,000 SMILES	Unlabeled, randomly sampled
Evaluation	MoleculeNet (10 datasets)	643 to 93,127 molecules	See Table 1 for per-dataset details

Algorithms

Transformer encoder-decoder: 4 blocks each, 4-head attention, 256 embedding dimensions
Pre-training: 5 epochs, Adam optimizer, cross-entropy loss, SMILES enumeration for augmentation
Fingerprint: 1024 dimensions from concatenated mean pool, max pool, and first-token outputs
Downstream: scikit-learn MLP (default hyperparameters) for DEM experiments; ridge/logistic regression for linear model experiments; Optuna for hyperparameter search in large-data comparison

Models

Artifact	Type	License	Notes
smiles-transformer	Code	MIT	Official implementation (Jupyter notebooks)

Evaluation

DEM averaged over 7 training fractions (1.25% to 80%), 20 trials each
Random splits for DEM; scaffold splits for HIV, BACE, BBBP in large-data comparison
Metrics: RMSE (regression), ROC-AUC or PRC-AUC (classification) per MoleculeNet conventions

Hardware

The paper does not specify GPU type or training time for the pre-training phase.

Paper Information

Citation: Honda, S., Shi, S., & Ueda, H. R. (2019). SMILES Transformer: Pre-trained Molecular Fingerprint for Low Data Drug Discovery. arXiv preprint arXiv:1911.04738.

@article{honda2019smiles,
  title={SMILES Transformer: Pre-trained Molecular Fingerprint for Low Data Drug Discovery},
  author={Honda, Shion and Shi, Shoi and Ueda, Hiroki R.},
  journal={arXiv preprint arXiv:1911.04738},
  year={2019}
}

SMI-TED: Encoder-Decoder Foundation Models for Chemistry

Thu, 26 Mar 2026 00:00:00 +0000

An Encoder-Decoder Chemical Foundation Model Family

SMI-TED is a Method paper that introduces a family of encoder-decoder transformer-based foundation models for chemistry. The primary contribution is the SMI-TED289M architecture, a 289-million parameter model pre-trained on 91 million curated SMILES from PubChem, along with a Mixture-of-Experts variant (MoE-OSMI) that scales to 8x289M parameters. The models support molecular property prediction, molecule reconstruction, reaction yield prediction, and few-shot reasoning over molecular embeddings. All model weights and code are open-sourced under an Apache 2.0 license.

Bridging Encoding and Decoding for Molecular Representations

Chemical language models based on SMILES have gained traction for molecular property prediction and generation. Most existing models, such as MoLFormer and ChemBERTa, are encoder-only architectures that produce molecular embeddings through mean pooling. While effective for downstream classification and regression, this encoder-only approach has a limitation: mean pooling has no natural inverse, meaning the model cannot reconstruct the input molecule from its latent representation. This restricts the model’s utility for generative tasks and limits the interpretability of the learned latent space.

The authors argue that adding a decoder with a reconstruction objective forces the model to encode a more complete set of structural features. Prior work has shown that the quality of pre-training data matters more than the choice of SMILES vs. SELFIES, and that large-scale pre-training can yield useful chemical representations. SMI-TED builds on these observations by combining an encoder-decoder architecture with a carefully curated 91-million molecule dataset from PubChem.

Invertible Pooling and Two-Phase Pre-Training

The core architectural innovation in SMI-TED is a learned pooling mechanism that replaces standard mean or max pooling with an invertible projection. Given token embeddings $\mathbf{x} \in \mathbb{R}^{D \times L}$ (where $D = 202$ is the maximum token count and $L = 768$ is the embedding dimension), the submersion into the latent space $\mathbf{z} \in \mathbb{R}^{L}$ is computed as:

$$ \mathbf{z} = \left(\text{LayerNorm}\left(\text{GELU}\left(\mathbf{W}_1^T \mathbf{x} + \mathbf{b}_1\right)\right)\right) \mathbf{W}_2 $$

where $\mathbf{W}_1 \in \mathbb{R}^{D \times L}$, $\mathbf{b}_1 \in \mathbb{R}^{L}$, and $\mathbf{W}_2 \in \mathbb{R}^{L \times L}$. The immersion (inverse mapping) back to the token space is:

$$ \tilde{\mathbf{x}}^T = \left(\text{LayerNorm}\left(\text{GELU}\left(\mathbf{z} \mathbf{W}_3 + \mathbf{b}_3\right)\right)\right) \mathbf{W}_4 $$

where $\mathbf{W}_3 \in \mathbb{R}^{L \times L}$, $\mathbf{b}_3 \in \mathbb{R}^{L}$, and $\mathbf{W}_4 \in \mathbb{R}^{L \times D}$. A decoder language model then predicts the next token from $\tilde{\mathbf{x}}$.

The encoder uses a modified RoFormer attention mechanism with rotary position embeddings:

$$ \text{Attention}_m(Q, K, V) = \frac{\sum_{n=1}^{N} \langle \varphi(R_m q_m), \varphi(R_n k_n) \rangle v_n}{\sum_{n=1}^{N} \langle \varphi(R_m q_m), \varphi(R_n k_n) \rangle} $$

where $R_m$ are position-dependent rotation matrices and $\varphi$ is a random feature map.

Two-phase pre-training strategy:

Phase 1: The token encoder is pre-trained on 95% of the data using masked language modeling (15% token selection, of which 80% masked, 10% random, 10% unchanged). The remaining 5% trains the encoder-decoder layer, preventing convergence issues from unstable early embeddings.
Phase 2: After the token embeddings converge, both the encoder and decoder train on 100% of the data jointly.

Mixture-of-Experts (MoE-OSMI): The MoE variant composes 8 fine-tuned SMI-TED289M expert models with a gating network. Given an input embedding $x$, the output is:

$$ y = \sum_{i=1}^{n} G(x)_i E_i(\hat{x}) $$

where $G(x) = \text{Softmax}(\text{TopK}(x \cdot W_g))$ selects the top $k = 2$ experts per input, setting all other gate values to zero.

Benchmarks Across Property Prediction, Generation, and Reaction Yield

MoleculeNet classification (6 datasets, ROC-AUC)

Method	BBBP	ClinTox	HIV	BACE	SIDER	Tox21
MoLFormer	73.6 +/- 0.8	91.2 +/- 1.4	80.5 +/- 1.65	86.3 +/- 0.6	65.5 +/- 0.2	80.46 +/- 0.2
Uni-Mol	72.9 +/- 0.6	91.9 +/- 1.8	80.8 +/- 0.3	85.7 +/- 0.2	65.9 +/- 1.3	79.6 +/- 0.5
GEM	72.4 +/- 0.4	90.1 +/- 1.3	80.6 +/- 0.9	85.6 +/- 1.1	67.2 +/- 0.4	78.1 +/- 0.1
SMI-TED289M (pre-trained)	91.46 +/- 0.47	93.49 +/- 0.85	80.51 +/- 1.34	85.58 +/- 0.92	66.01 +/- 0.88	81.53 +/- 0.45
SMI-TED289M (fine-tuned)	92.26 +/- 0.57	94.27 +/- 1.83	76.85 +/- 0.89	88.24 +/- 0.50	65.68 +/- 0.45	81.85 +/- 1.42

SMI-TED achieves the best results in 4 of 6 classification tasks. Notably, the pre-trained version (without fine-tuning) already matches or exceeds many baselines on BBBP, ClinTox, and Tox21.

MoleculeNet regression (5 datasets, MAE for QM9/QM8, RMSE for ESOL/FreeSolv/Lipophilicity)

Method	QM9	QM8	ESOL	FreeSolv	Lipophilicity
MoLFormer	1.5894	0.0102	0.880	2.342	0.700
D-MPNN	3.241	0.0143	0.98	2.18	0.65
SMI-TED289M (fine-tuned)	1.3246	0.0095	0.6112	1.2233	0.5522

SMI-TED289M achieves the best results across all 5 regression tasks when fine-tuned. The improvements are substantial on ESOL (0.61 vs. 0.82 for next best) and FreeSolv (1.22 vs. 1.91 for next best).

Reaction yield prediction (Buchwald-Hartwig C-N cross-coupling)

The model was tested on Pd-catalyzed Buchwald-Hartwig reactions with 3,955 reactions across varying train/test splits. Selected $R^2$ results:

Split	Yield-BERT (Aug)	DRFP	SMI-TED289M
70/30	0.97	0.95	0.984
10/90	0.81	0.81	0.961
2.5/97.5	0.61	0.62	0.875
Test 1-4 avg	0.58	0.71	0.983

SMI-TED shows particularly strong performance in low-data regimes. With only 2.5% training data, it achieves $R^2 = 0.875$, compared to 0.61-0.62 for competing methods.

MOSES molecular generation benchmarks

SMI-TED is competitive with baselines including CharRNN, SMILES VAE, JT-VAE, LIMO, MolGen-7b, and GP-MoLFormer on standard metrics (validity, uniqueness, novelty, FCD, internal diversity). It achieves superior scaffold cosine similarity (Scaf) and nearest-neighbor similarity (SNN) scores.

Latent space compositionality

Using six families of carbon chains ($\mathcal{F} = {CC, CO, CN, CS, CF, CP}$), the authors test whether the embedding space respects hierarchical distance structures. A linear regression on SMI-TED embeddings yields $R^2 = 0.99$ and $MSE = 0.002$, compared to $R^2 = 0.55$ and $MSE = 0.237$ for MoLFormer. This indicates that the SMI-TED latent space captures compositional chemical relationships far more faithfully.

For structure-property analysis on QM9, nitrogen-containing molecules represent 9.10% of the dataset but account for 32.81% of the top 10% by HOMO energy. In the SMI-TED latent space, these molecules cluster distinctly (Davies-Bouldin index of 2.82 vs. 4.28 for MoLFormer), suggesting the decoder objective encourages encoding of functional group information.

Strong Performance with a Compositional Latent Space

SMI-TED289M demonstrates competitive or superior performance across molecular property prediction, reaction yield prediction, and molecular generation benchmarks. The key findings include:

Broad applicability: The single pre-trained model achieves strong results across classification (4/6 best), regression (5/5 best), reaction yield, and generation tasks.
Low-data robustness: The pre-training on 91M molecules provides chemical knowledge that transfers well to small training sets, as shown by the reaction yield experiments where SMI-TED maintains high accuracy even at 2.5% training data.
Compositional embeddings: The encoder-decoder architecture produces a latent space where molecular similarity follows chemical intuition, with near-perfect linear relationships between functional group families ($R^2 = 0.99$).
Structure-property capture: The reconstruction objective appears to enforce encoding of chemically meaningful features like nitrogen substituent effects on HOMO energy, outperforming encoder-only models in latent space organization.

Limitations: The paper evaluates on MoleculeNet benchmarks, which are well-studied but may not reflect performance on more diverse chemical tasks. The BBBP classification result (92.26) shows a large jump from prior methods (73.6 for MoLFormer), which is worth scrutinizing. The MoE variant is evaluated only in supplementary materials, and scaling behavior beyond 8 experts is not explored.

Future directions: The authors note that compositionality of the learned representations suggests potential for reasoning applications, though they acknowledge that stronger claims require further studies following compositionality analysis methodologies from natural language processing. The model has been integrated into the dZiner agent for inverse molecular design.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pre-training	PubChem (curated)	91M molecules, 4B tokens	Deduplicated, canonicalized, validity-checked
Classification	MoleculeNet (BBBP, ClinTox, HIV, BACE, SIDER, Tox21)	Varies	Original benchmark splits
Regression	MoleculeNet (QM9, QM8, ESOL, FreeSolv, Lipophilicity)	Varies	Original benchmark splits
Generation	MOSES	1.94M molecules	Train/test/scaffold test splits
Reaction yield	Buchwald-Hartwig HTE	3,955 reactions	3x 1536-well plates

Algorithms

Masked language modeling for token encoder (15% selection: 80% masked, 10% random, 10% unchanged)
Two-phase pre-training (95/5 split then 100% joint training)
RoFormer attention with rotary position embeddings
Vocabulary: 2,993 tokens (2,988 molecular + 5 special)
Maximum sequence length: 202 tokens (covers 99.4% of PubChem)
Learning rate: 1.6e-4, batch size: 288 molecules
40 epochs over the full PubChem corpus
10 random seeds per experiment for robustness

Models

Variant	Parameters	Encoder	Decoder	Description
SMI-TED289M base	289M	47M	242M	12 layers, 12 attention heads, hidden size 768, dropout 0.2
MoE-OSMI	8x289M	-	-	8 experts, top-k=2 routing, gating network

Evaluation

Classification: ROC-AUC
Regression: MAE (QM9, QM8), RMSE (ESOL, FreeSolv, Lipophilicity)
Reaction yield: $R^2$
Generation: Validity, uniqueness, novelty, FCD, IntDiv, Scaf, SNN (MOSES metrics)
Latent space: Linear regression $R^2$, MSE, Davies-Bouldin index, t-SNE visualization

Hardware

24 NVIDIA V100 GPUs (16GB)
4 nodes with DDP (Distributed Data Parallel)
Pre-training: 40 epochs on 91M molecules

Artifacts

Artifact	Type	License	Notes
IBM/materials (smi_ted)	Code	Apache-2.0	Training, fine-tuning scripts, Jupyter notebooks
ibm/materials.smi-ted	Model	Apache-2.0	Pre-trained model weights
Zenodo archive	Code + Data	Apache-2.0	Archival copy of scripts

Paper Information

Citation: Soares, E., Vital Brazil, E., Shirasuna, V., Zubarev, D., Cerqueira, R., & Schmidt, K. (2025). An open-source family of large encoder-decoder foundation models for chemistry. Communications Chemistry, 8(1). https://doi.org/10.1038/s42004-025-01585-0

@article{soares2025smited,
  title={An open-source family of large encoder-decoder foundation models for chemistry},
  author={Soares, Eduardo and Vital Brazil, Emilio and Shirasuna, Victor and Zubarev, Dmitry and Cerqueira, Renato and Schmidt, Kristin},
  journal={Communications Chemistry},
  volume={8},
  number={1},
  year={2025},
  publisher={Nature Publishing Group},
  doi={10.1038/s42004-025-01585-0}
}

Seq2seq Fingerprint: Unsupervised Molecular Embedding

Thu, 26 Mar 2026 00:00:00 +0000

An Unsupervised Seq2seq Method for Molecular Fingerprints

This is a Method paper that introduces seq2seq fingerprint, an unsupervised molecular embedding approach based on sequence-to-sequence learning. The core idea is to train a GRU encoder-decoder network to translate SMILES strings to themselves, then extract the intermediate fixed-length vector as a molecular fingerprint. These fingerprints are then used with standard supervised classifiers for downstream property prediction tasks such as solubility classification and promiscuity prediction.

The Labeled Data Bottleneck in Drug Discovery

Machine learning approaches to molecular property prediction depend on fixed-length feature vectors as inputs. Traditional molecular fingerprints fall into two categories: hash-based methods like Extended-Connectivity Fingerprints (ECFP) that are fast but lossy and non-invertible, and biologist-guided local-feature fingerprints that require domain expertise and are task-specific. Supervised deep learning fingerprints (e.g., neural fingerprints) can learn representations from data but require large amounts of labeled data, which is expensive to obtain in drug discovery due to the cost of biological experiments.

The authors identify three limitations of existing approaches:

Hash-based fingerprints discard information during the hashing process and cannot reconstruct the original molecule
Local-feature fingerprints require expert knowledge and generalize poorly across tasks
Supervised deep learning fingerprints are data-hungry and fail when labeled data is limited

Self-Translation as Unsupervised Molecular Encoding

The key insight is to adapt the sequence-to-sequence learning framework from machine translation (originally English-to-French) to molecular representation learning by setting both the input and output to the same SMILES string. Since the intermediate vector must contain enough information to reconstruct the original SMILES, it serves as a rich, task-agnostic molecular fingerprint.

The architecture consists of two components:

Perceiver network: A multi-layer GRU encoder that reads the SMILES string and compresses it into a fixed-length vector
Interpreter network: A multi-layer GRU decoder that reconstructs the original SMILES from the fingerprint vector

The GRU cell computes a sequence of outputs $(s_1, \ldots, s_T)$ from input sequences $(x_1, \ldots, x_T)$ by iterating:

$$ z_t = \sigma_g(W_z x_t + U_z s_{t-1} + b_z) $$

$$ r_t = \sigma_r(W_r x_t + U_r s_{t-1} + b_r) $$

$$ h_t = \tanh(U_h x_t + W_h(s_{t-1} \circ r_t)) $$

$$ s_t = (1 - z_t) \circ h_{t-1} + z_t \circ s_{t-1} $$

where $z_t$ is the update gate, $r_t$ is the reset gate, $\circ$ denotes element-wise multiplication, and $W$, $U$, $b$ are trainable parameters.

Several adaptations to the original seq2seq framework make this work for molecular data:

GRU instead of LSTM: GRU provides comparable performance with faster training, which is important given the large training data pool
Attention mechanism: Establishes a stronger connection between the perceiver and interpreter networks via soft alignment, addressing the challenge of passing information through hidden memory for long sequences (SMILES can be up to 250 characters)
Dropout layers: Added to input and output gates (but not hidden memory transfer) following the approach of Zaremba et al. to combat overfitting when training on large datasets
Fingerprint extraction layer: A fixed-unit fully connected layer combined with a GRU cell state concatenation layer is inserted between encoder and decoder to explicitly output the fingerprint vector
Reverse target sequence: Following Sutskever et al., the target sequence is reversed to improve SGD optimization
Bucket training: Sequences are distributed into buckets by length and padded to enable GPU parallelization

Classification Experiments on LogP and PM2 Datasets

Training Setup

The unsupervised training used 334,092 valid SMILES representations from combined LogP and PM2-full datasets obtained from the National Center for Advancing Translational Sciences (NCATS) at NIH. Three model variants were trained with fingerprint dimensions of 512, 768, and 1024, differing in the number of GRU layers (2, 3, and 4 respectively) while keeping the latent dimension at 256. Each model was trained for 24 hours on a workstation with an Intel i7-6700K CPU, 16 GB RAM, and an NVIDIA GTX 1080 GPU.

Reconstruction Performance

The models were evaluated on their ability to reconstruct SMILES strings from their fingerprints:

Model	GRU Layers	Latent Dim	Perplexity	Exact Match Accuracy
seq2seq-512	2	256	1.00897	94.24%
seq2seq-768	3	256	1.00949	92.92%
seq2seq-1024	4	256	1.01472	90.26%

Deeper models showed lower reconstruction accuracy, possibly because larger fingerprint spaces introduce more null spaces and require longer training to converge.

Classification Results

Two labeled datasets were used for downstream classification:

LogP: 10,850 samples with water-octanol partition coefficient values, binarized at a threshold of 1.88
PM2-10k: 10,000 samples with binary promiscuity class labels

The seq2seq fingerprints were evaluated with three ensemble classifiers (AdaBoost, GradientBoost, RandomForest) against circular fingerprints (ECFP) and neural fingerprints. Results are 100-run averages of 5-fold cross-validation accuracy.

LogP classification accuracy:

Method	Mean Accuracy	Std Dev
Circular FP (ECFP)	0.3674	0.0074
Neural FP	0.6080	0.0135
Seq2seq-1024 + GradientBoost	0.7664	0.0043
Seq2seq-1024 + AdaBoost	0.7342	0.0042
Seq2seq-512 + GradientBoost	0.7350	0.0060

PM2-10k classification accuracy:

Method	Mean Accuracy	Std Dev
Circular FP (ECFP)	0.3938	0.0114
Neural FP	0.5227	0.0112
Seq2seq-1024 + GradientBoost	0.6206	0.0198
Seq2seq-1024 + AdaBoost	0.6036	0.0147
Seq2seq-512 + GradientBoost	0.5741	0.0086

The seq2seq fingerprint outperformed both baselines across all configurations. Despite the seq2seq-1024 model having lower reconstruction accuracy, it provided the best classification performance, suggesting that the longer fingerprint captures more discriminative information for downstream tasks even if the reconstruction is less exact.

Unsupervised Transfer Learning for Molecular Properties

The results demonstrate that unsupervised pretraining on large unlabeled molecular datasets can produce fingerprints that transfer well to supervised property prediction with limited labels. The key advantages confirmed by the experiments are:

Label-free training: The unsupervised approach uses essentially unlimited SMILES data, avoiding the expensive label collection process
Task-agnostic representations: The same fingerprints work across different classification tasks (solubility and promiscuity) without retraining
Invertibility: The fingerprints contain enough information to reconstruct the original SMILES (up to 94.24% exact match), unlike hash-based methods

Limitations acknowledged by the authors include:

Long training times (24 hours per model variant), motivating future work on distributed training
The relationship between fingerprint dimensionality and downstream performance is non-monotonic (768-dim underperforms 512-dim on some tasks), suggesting sensitivity to hyperparameter choices
Only classification tasks were evaluated; regression performance was not assessed
The comparison baselines are limited to ECFP and neural fingerprints from 2015

Future directions proposed include distributed training strategies, hyperparameter optimization methods, and semi-supervised extensions that incorporate label information into the fingerprint training.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Unsupervised training	LogP + PM2-full (combined)	334,092 SMILES	Obtained from NCATS at NIH
Classification	LogP	10,850 samples	Binary labels at LogP threshold 1.88
Classification	PM2-10k	10,000 samples	Binary promiscuity labels

Algorithms

Encoder-decoder: Multi-layer GRU with attention mechanism and dropout
Fingerprint dimensions: 512, 768, 1024 (with 2, 3, 4 GRU layers respectively)
Latent dimension: 256 for all variants
Downstream classifiers: AdaBoost, GradientBoost, RandomForest
Evaluation: 5-fold cross-validation, 100-run averages
Baselines: ECFP via RDKit, Neural Fingerprint from HIPS/neural-fingerprint

Models

Three model variants trained for 24 hours each. The paper states code would become publicly available after acceptance, but no public repository has been confirmed.

Evaluation

Metric	Best Value	Task	Configuration
Classification accuracy	0.7664	LogP	seq2seq-1024 + GradientBoost
Classification accuracy	0.6206	PM2-10k	seq2seq-1024 + GradientBoost
Exact match reconstruction	94.24%	SMILES recovery	seq2seq-512
Perplexity	1.00897	SMILES recovery	seq2seq-512

Hardware

Training: Intel i7-6700K @ 4.00 GHz, 16 GB RAM, NVIDIA GTX 1080 GPU
Hyperparameter search and classifier training: TACC Lonestar 5 cluster
Training time: 24 hours per model variant

Artifacts

Artifact	Type	License	Notes
Neural Fingerprint (baseline)	Code	MIT	Baseline comparison code

The authors indicated the seq2seq fingerprint code would be released after acceptance, but no public repository has been found as of this writing. The datasets were sourced from NCATS/NIH.

Paper Information

Citation: Xu, Z., Wang, S., Zhu, F., & Huang, J. (2017). Seq2seq Fingerprint: An Unsupervised Deep Molecular Embedding for Drug Discovery. Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics (ACM-BCB ‘17), 285-294. https://doi.org/10.1145/3107411.3107424

@inproceedings{xu2017seq2seq,
  title={Seq2seq Fingerprint: An Unsupervised Deep Molecular Embedding for Drug Discovery},
  author={Xu, Zheng and Wang, Sheng and Zhu, Feiyun and Huang, Junzhou},
  booktitle={Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics},
  pages={285--294},
  year={2017},
  publisher={ACM},
  doi={10.1145/3107411.3107424}
}

MolBERT: Auxiliary Tasks for Molecular BERT Models

Thu, 26 Mar 2026 00:00:00 +0000

BERT-Based Molecular Representations with Auxiliary Pre-Training Tasks

This is a Method paper that introduces MolBERT, a bidirectional Transformer (BERT) architecture applied to SMILES-based molecular representations for drug discovery. The primary contribution is a systematic study of how different domain-relevant self-supervised pre-training tasks affect the quality of learned molecular embeddings, paired with a model that achieves state-of-the-art performance on virtual screening and quantitative structure-activity relationship (QSAR) benchmarks.

Why Domain-Relevant Pre-Training Matters for Molecular Language Models

Molecular representations are foundational for predictive, generative, and analytical tasks in drug discovery. Language models applied to text-based molecular representations like SMILES have demonstrated strong performance across property prediction, reaction prediction, and molecular generation. However, several open questions remained at the time of this work:

Task selection for pre-training: Prior work explored masked token prediction, input translation, and property concatenation, but there was no systematic comparison of how different self-supervised tasks affect downstream performance.
SMILES ambiguity: The same molecule can be encoded as many different SMILES strings depending on how the molecular graph is traversed. Canonicalization algorithms address this but introduce their own artifacts that may distract the model.
Domain knowledge integration: Standard NLP pre-training objectives (e.g., masked language modeling) do not explicitly encode chemical knowledge. It was unclear whether incorporating chemistry-specific supervision during pre-training could improve representation quality.

MolBERT addresses these gaps by evaluating three pre-training tasks, including a novel physicochemical property prediction objective, and measuring their individual and combined effects on downstream drug discovery benchmarks.

Three Auxiliary Tasks for Chemistry-Aware Pre-Training

MolBERT uses the BERT-Base architecture (12 attention heads, 12 layers, 768-dimensional hidden states, approximately 85M parameters) and explores three self-supervised pre-training tasks:

Masked Language Modeling (MaskedLM): The standard BERT objective where 15% of input tokens are masked and the model predicts their identity. The loss is cross-entropy between predicted and true tokens.

SMILES Equivalence (SMILES-Eq): A binary classification task where the model receives two SMILES strings and predicts whether they represent the same molecule. The second string is either a random permutation of the first (same molecule, different traversal) or a randomly sampled molecule. This is optimized with cross-entropy loss.

Physicochemical Property Prediction (PhysChemPred): Using RDKit, a set of 200 real-valued molecular descriptors are computed for each molecule. The model predicts these normalized descriptors from the SMILES input using mean squared error:

$$\mathcal{L}_{\text{PhysChemPred}} = \frac{1}{D} \sum_{d=1}^{D} (y_d - \hat{y}_d)^2$$

where $D = 200$ is the number of descriptors, $y_d$ is the true normalized descriptor value, and $\hat{y}_d$ is the model’s prediction.

The final training loss is the arithmetic mean of all active task losses:

$$\mathcal{L}_{\text{total}} = \frac{1}{|\mathcal{T}|} \sum_{t \in \mathcal{T}} \mathcal{L}_t$$

where $\mathcal{T}$ is the set of active pre-training tasks.

Additionally, MolBERT supports SMILES permutation augmentation during training, where each input molecule is represented by a randomly sampled non-canonical SMILES string rather than the canonical form. The model uses a fixed vocabulary of 42 tokens, a sequence length of 128, and relative positional embeddings (from Transformer-XL) to support arbitrary-length SMILES at inference time.

Ablation Study and Benchmark Evaluation

Pre-Training Setup

All models were pre-trained on the GuacaMol benchmark dataset, consisting of approximately 1.6M compounds curated from ChEMBL, using an 80%/5% train/validation split. Training used the Adam optimizer with a learning rate of $3 \times 10^{-5}$ for 20 epochs (ablation) or 100 epochs (final model).

Ablation: Impact of Task Combinations on Virtual Screening

The ablation study evaluated all seven possible task combinations on the RDKit virtual screening benchmark (69 datasets, 5 query molecules per target). Results measured by AUROC and BEDROC20 (an early enrichment metric with $\alpha = 20$):

MaskedLM	PhysChemPred	SMILES-Eq	AUROC (w/ perm)	BEDROC20 (w/ perm)	AUROC (w/o perm)	BEDROC20 (w/o perm)
Yes	Yes	Yes	0.685 +/- 0.069	0.246 +/- 0.041	0.707 +/- 0.059	0.280 +/- 0.042
Yes	Yes	No	0.738 +/- 0.060	0.323 +/- 0.071	0.740 +/- 0.066	0.322 +/- 0.065
Yes	No	Yes	0.483 +/- 0.092	0.092 +/- 0.069	0.493 +/- 0.068	0.108 +/- 0.070
No	Yes	Yes	0.476 +/- 0.077	0.064 +/- 0.034	0.514 +/- 0.165	0.084 +/- 0.014
Yes	No	No	0.696 +/- 0.058	0.283 +/- 0.077	0.676 +/- 0.060	0.250 +/- 0.073
No	Yes	No	0.719 +/- 0.057	0.293 +/- 0.071	0.716 +/- 0.061	0.290 +/- 0.076
No	No	Yes	0.129 +/- 0.067	0.005 +/- 0.037	0.508 +/- 0.068	0.048 +/- 0.035

Key findings from the ablation:

PhysChemPred had the highest individual impact (average BEDROC20 of 0.292 alone vs. 0.266 for MaskedLM alone).
Combining MaskedLM + PhysChemPred achieved the best performance (BEDROC20 of 0.323), though the additive gain from MaskedLM was modest (+0.031).
The SMILES-Eq task consistently decreased performance when added to other task combinations.

A further sub-ablation on PhysChemPred descriptor groups showed that surface descriptors alone (49 of 200 descriptors) achieved nearly the same performance as the full set, suggesting molecular surface properties provide particularly informative supervision.

Virtual Screening Results

Using the best task combination (MaskedLM + PhysChemPred) trained for 100 epochs:

Method	AUROC	BEDROC20
MolBERT (100 epochs)	0.743 +/- 0.062	0.344 +/- 0.062
CDDD	0.725 +/- 0.057	0.310 +/- 0.080
RDKit descriptors	0.633 +/- 0.027	0.217 +/- 0.000
ECFC4	0.603 +/- 0.056	0.170 +/- 0.079

MolBERT outperformed all baselines including CDDD (the prior state of the art), RDKit calculated descriptors, and extended-connectivity fingerprints (ECFC4).

QSAR Results

On MoleculeNet regression tasks (RMSE, lower is better):

Dataset	RDKit (norm)	ECFC4	CDDD	MolBERT	MolBERT (finetune)
ESOL	0.687 +/- 0.08	0.902 +/- 0.06	0.567 +/- 0.06	0.552 +/- 0.07	0.531 +/- 0.04
FreeSolv	1.671 +/- 0.45	2.876 +/- 0.38	1.456 +/- 0.43	1.523 +/- 0.66	0.948 +/- 0.33
Lipophilicity	0.738 +/- 0.04	0.770 +/- 0.03	0.669 +/- 0.02	0.602 +/- 0.01	0.561 +/- 0.03

On MoleculeNet classification tasks (AUROC, higher is better):

Dataset	RDKit (norm)	ECFC4	CDDD	MolBERT	MolBERT (finetune)
BACE	0.831	0.845	0.833	0.849	0.866
BBBP	0.696	0.678	0.761	0.750	0.762
HIV	0.708	0.714	0.753	0.747	0.783

Fine-tuned MolBERT achieved the best performance on all six QSAR datasets. When used as a fixed feature extractor with an SVM, MolBERT embeddings outperformed other representations on three of six tasks.

Key Findings and Limitations

Key Findings

Pre-training task selection matters significantly. The choice of auxiliary tasks during pre-training has a large effect on downstream performance. PhysChemPred provides the strongest individual signal.
Domain-relevant auxiliary tasks improve representation quality. Predicting physicochemical properties during pre-training encodes chemical knowledge directly into the embeddings, outperforming purely linguistic objectives.
The SMILES equivalence task hurts performance. Despite being chemically motivated, the SMILES-Eq task consistently degraded results, suggesting it may introduce conflicting learning signals.
PhysChemPred organizes the embedding space. Analysis of pairwise cosine similarities showed that models trained with PhysChemPred assign high similarity to permutations of the same molecule and low similarity to different molecules, creating a more semantically meaningful representation space.

Limitations

The paper evaluates only SMILES-based representations, inheriting all limitations of string-based molecular encodings (inability to capture 3D structure, sensitivity to tokenization).
The virtual screening evaluation uses a fixed number of query molecules ($n = 5$), which may not reflect realistic screening scenarios.
Cross-validation splits from ChemBench were used for QSAR evaluation rather than scaffold splits, which may overestimate performance on structurally novel compounds.
The model’s 128-token sequence length limit may truncate larger molecules, though relative positional embeddings partially address this at inference time.

Future Directions

The authors propose extending MolBERT to learn representations for other biological entities such as proteins, and developing more advanced pre-training strategies.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pre-training	GuacaMol (ChEMBL)	~1.6M compounds	80% train / 5% validation split
Virtual Screening	RDKit benchmark v1.2	69 target datasets	Filtered subset with active/decoy compounds
QSAR (Regression)	ESOL, FreeSolv, Lipophilicity	Varies	From MoleculeNet, ChemBench splits
QSAR (Classification)	BACE, BBBP, HIV	Varies	From MoleculeNet, ChemBench splits

Algorithms

Architecture: BERT-Base (12 heads, 12 layers, 768-dim hidden, ~85M params)
Optimizer: Adam, learning rate $3 \times 10^{-5}$
Vocabulary: 42 tokens, sequence length 128
Masking: 15% of tokenized input
Positional encoding: relative positional embeddings (Transformer-XL)
Fine-tuning SVM: $C = 5.0$, RBF kernel (from Winter et al.)
Fine-tuning head: single linear layer on pooled output
Embeddings: pooled output (or average sequence output when only MaskedLM is used)

Models

BERT-Base with ~85M parameters
Pre-trained weights available at BenevolentAI/MolBERT

Evaluation

Metric	Task	Notes
AUROC	Virtual Screening, Classification QSAR	Standard area under ROC curve
BEDROC20	Virtual Screening	Early enrichment metric, $\alpha = 20$
RMSE	Regression QSAR	Root mean squared error

Hardware

2 GPUs, 16 CPUs
Pre-training time: ~40 hours (20 epochs)

Artifacts

Artifact	Type	License	Notes
BenevolentAI/MolBERT	Code + Model	MIT	Official implementation with pre-trained weights

Paper Information

Citation: Fabian, B., Edlich, T., Gaspar, H., Segler, M., Meyers, J., Fiscato, M., & Ahmed, M. (2020). Molecular representation learning with language models and domain-relevant auxiliary tasks. arXiv preprint arXiv:2011.13230.

@article{fabian2020molecular,
  title={Molecular representation learning with language models and domain-relevant auxiliary tasks},
  author={Fabian, Benedek and Edlich, Thomas and Gaspar, H{\'e}l{\'e}na and Segler, Marwin and Meyers, Joshua and Fiscato, Marco and Ahmed, Mohamed},
  journal={arXiv preprint arXiv:2011.13230},
  year={2020}
}

CDDD: Learning Descriptors by Translating SMILES

Thu, 26 Mar 2026 00:00:00 +0000

A Translation-Based Method for Learned Molecular Descriptors

This is a Method paper that introduces Continuous and Data-Driven Descriptors (CDDD), a neural machine translation approach for learning fixed-size, continuous molecular representations. Rather than training an autoencoder to reconstruct SMILES strings, Winter et al. train an encoder-decoder model to translate between semantically equivalent but syntactically different molecular representations (e.g., randomized SMILES to canonical SMILES, or InChI to canonical SMILES). The bottleneck latent vector serves as a general-purpose molecular descriptor. Pretrained on approximately 72 million compounds from ZINC15 and PubChem, CDDD produces 512-dimensional descriptors that achieve competitive QSAR performance and significantly outperform all tested molecular fingerprints in ligand-based virtual screening.

Why Translation Instead of Reconstruction?

Molecular descriptors are central to cheminformatics. Traditional approaches rely on human-engineered fingerprints like ECFPs, which encode structural features as fixed-length bit vectors. While effective, these representations are constrained by predefined feature extraction rules.

Recent work applied deep neural networks directly to molecular graphs or SMILES strings to learn task-specific representations. However, these end-to-end approaches must learn features from scratch for each new dataset, making them prone to overfitting on the small bioactivity datasets typical in drug discovery.

Unsupervised approaches based on autoencoders (notably Gomez-Bombarelli et al.’s VAE and Xu et al.’s seq2seq model) offered a path toward general-purpose learned descriptors. These models reconstruct SMILES strings through an information bottleneck, forcing the latent space to capture molecular information. The concern with reconstruction, however, is that the model may focus on syntactic patterns of the string representation rather than the underlying chemical semantics. A model that memorizes SMILES syntax shortcuts can achieve low reconstruction error without truly encoding chemical meaning.

Winter et al. address this by drawing on the analogy to neural machine translation: a translator must understand the meaning of a sentence to produce a correct translation in another language. By training the model to translate between different molecular representations (which share chemical semantics but differ in syntax), the latent space is forced to capture the chemical information common to both representations, rather than representation-specific syntactic artifacts.

Translation as Semantic Compression

The core insight is that translating between two syntactically different but semantically equivalent representations forces the encoder to capture only the chemical meaning shared by both. The model architecture follows the standard encoder-decoder framework from neural machine translation.

The encoder reads a source molecular string (e.g., a randomized SMILES or InChI) and compresses it into a fixed-size latent vector. The decoder takes this latent vector and generates the target molecular string (canonical SMILES). The model is trained to minimize character-level cross-entropy between the decoder output and the target sequence.

Four translation tasks were evaluated:

Randomized SMILES to canonical SMILES (best performing)
InChI to canonical SMILES
Canonical SMILES to canonical SMILES (autoencoding baseline)
Canonical SMILES to InChI (failed to learn)

The final model uses an RNN encoder with 3 stacked GRU layers (512, 1024, and 2048 units). The concatenated cell states pass through a fully connected layer with tanh activation to produce a 512-dimensional latent vector. The decoder mirrors this architecture, initializing its GRU states from the latent vector via separate fully connected layers. Teacher forcing is used during training, and left-to-right beam search is used at inference.

An auxiliary property prediction network takes the latent vector as input and predicts nine molecular properties (logP, partial charges, valence electrons, H-bond donors/acceptors, Balaban’s J, molar refractivity, TPSA). This multi-task signal encourages the latent space to encode physically meaningful information. The full training objective combines the translation cross-entropy loss with the property prediction mean squared error:

$$\mathcal{L} = \mathcal{L}_{\text{translation}} + \mathcal{L}_{\text{properties}}$$

To ensure invariance to input SMILES representation at inference time, the model uses randomized SMILES as input half the time and canonical SMILES the other half during training. Input dropout (15% at the character level) and Gaussian noise (standard deviation 0.05) are applied for regularization.

QSAR Benchmarks, Virtual Screening, and Latent Space Exploration

Pretraining

The model was pretrained on approximately 72 million compounds from ZINC15 and PubChem (merged, deduplicated, filtered for organic molecules with MW 12-600, >3 heavy atoms, logP between -7 and 5). All evaluation compounds were removed from the pretraining set.

QSAR Experiments

Ten QSAR datasets were used, spanning classification (Ames mutagenicity, hERG inhibition, BBB penetration, BACE inhibition, bee toxicity) and regression (EGFR inhibition, Plasmodium falciparum inhibition, lipophilicity, aqueous solubility, melting point). Two datasets (Ames and lipophilicity) served as validation for architecture selection; the remaining eight were held out for final evaluation.

CDDD descriptors with an SVM were benchmarked against:

Nine circular fingerprint variants (Morgan fingerprints, radius 1-3, folded to 512/1024/2048 bits) with RF, SVM, and GB
Graph convolution models (DeepChem)

Both random-split and cluster-split (K-means on MACCS fingerprints, K=5) cross-validation were performed.

Task	Split	CDDD + SVM	Best Fingerprint	Graph Conv
Ames (ROC-AUC)	Random	0.89	0.89 (ecfc2, RF)	0.88
hERG (ROC-AUC)	Random	0.86	0.85 (ecfc4, RF)	0.86
BBBP (ROC-AUC)	Random	0.93	0.93 (ecfc2, RF)	0.92
BACE (ROC-AUC)	Random	0.90	0.91 (ecfc2, RF)	0.91
Bee toxicity (ROC-AUC)	Random	0.92	0.91 (ecfc6, RF)	0.89
Lipophilicity ($r^2$)	Random	0.72	0.69 (ecfc2, SVM)	0.73
ESOL ($r^2$)	Random	0.92	0.58 (ecfc6, SVM)	0.86
Melting point ($r^2$)	Random	0.42	0.38 (ecfc2, SVM)	0.39

CDDD descriptors showed competitive or better performance across all tasks. Notably, CDDD achieved substantially higher $r^2$ on aqueous solubility (0.92 vs. 0.58 for the best fingerprint). The authors emphasize that CDDD’s feature extraction was fixed based on two validation tasks, while baseline methods selected the best fingerprint/model combination per task, making the comparison conservative for CDDD.

Virtual Screening

Ligand-based virtual screening experiments followed the Riniker et al. benchmarking protocol on 40 DUD targets and 17 MUV targets. Five active compounds were randomly selected per target, and remaining compounds were ranked by similarity (cosine similarity for CDDD, Tanimoto for fingerprints). This process was repeated 50 times per target.

Database	CDDD (ROC-AUC)	Second Best	p-value (Wilcoxon)
DUD	0.949	0.899 (laval)	$5 \times 10^{-38}$
MUV	0.679	0.677 (ap)	0.04

CDDD significantly outperformed all 14 baseline fingerprints on both databases. The DUD improvement was particularly large (+5.0 ROC-AUC points over the next best). On MUV, which is designed to be harder, the advantage was smaller but still statistically significant. Importantly, while the best baseline fingerprint varied between DUD and MUV (laval vs. ap), CDDD ranked first on both, demonstrating consistent performance.

Latent Space Exploration

The continuous, reversible nature of CDDD enables chemical space navigation. Shifting a molecule’s embedding along the first principal component of the pretraining data correlates with molecular size (Spearman $r = 0.947$, $p = 0.00048$), while the second principal component correlates with polarity/logP ($r = -0.916$, $p = 0.00015$).

When shifting 1000 compounds along 100 random directions, the model maintained high valid SMILES generation rates (>97% for the top beam search output, >99% when considering the top 3 outputs). Euclidean distance in the descriptor space correlated smoothly with Tanimoto distance in fingerprint space, confirming that the latent space supports meaningful interpolation.

Consistent Learned Descriptors for Chemistry

CDDD demonstrated that translation between molecular representations produces more informative latent spaces than autoencoder reconstruction. The key findings are:

Translation outperforms reconstruction: Models trained on translating between different representations consistently produced better downstream descriptors than autoencoding models, despite autoencoding being an easier task.
Auxiliary property prediction helps: The additional classification task for molecular properties improved descriptor quality, particularly for physicochemical endpoints correlated with the predicted properties.
Consistent performance: Unlike baseline methods where the best fingerprint varies by task, CDDD showed consistent performance across all QSAR and VS experiments.
Smooth latent space: The continuous descriptor space supports meaningful interpolation and chemical space exploration with high valid SMILES rates.

The authors acknowledge several limitations. The InChI-to-SMILES translation worked but produced inferior descriptors compared to SMILES-to-SMILES, and SMILES-to-InChI translation failed entirely, likely due to InChI’s complex syntax (counting, arithmetic). The approach was only tested with string-based representations; translation between conceptually different representations (e.g., 3D structures) remains future work. The QSAR evaluation, while extensive, used relatively standard datasets, and the method’s advantage over graph convolution models was modest on tasks where end-to-end learning had sufficient data.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pretraining	ZINC15 + PubChem (merged)	~72M compounds	Filtered: organic, MW 12-600, >3 heavy atoms, logP -7 to 5
Validation	Ames mutagenicity	6,130	Classification
Validation	Lipophilicity	3,817	Regression
Test	hERG, BBBP, BACE, bee toxicity	188-3,440	Classification
Test	EGFR, Plasmodium, ESOL, melting point	184-4,451	Regression
VS	DUD	40 targets	Ligand-based virtual screening
VS	MUV	17 targets	Maximum unbiased validation

Algorithms

Encoder: 3 stacked GRU layers (512, 1024, 2048 units) with tanh bottleneck to 512-dim latent space
Decoder: Matching 3 stacked GRU layers, initialized from latent space
Auxiliary classifier: 3 FC layers (512, 128, 9) predicting molecular properties
Optimizer: Adam, initial LR $5 \times 10^{-4}$, decayed by 0.9 every 50,000 steps
Batch size: 64 with bucketing by sequence length
Input regularization: 15% character dropout + Gaussian noise (std 0.05)
Beam search for decoding at inference

Models

Artifact	Type	License	Notes
CDDD (GitHub)	Code + Model	MIT	Pretrained model and extraction code

Evaluation

QSAR: 5-fold random CV and 5-fold cluster CV (K-means on MACCS, K=5)
Classification metric: ROC-AUC
Regression metric: $r^2$
VS: ROC-AUC averaged over 50 random active set selections per target
Statistical test: Wilcoxon signed-rank test for VS comparisons

Hardware

Framework: TensorFlow 1.4.1
Fingerprint extraction on GPU is comparable in speed to RDKit on CPU
SVM training on 512-dim CDDD descriptors takes seconds (vs. minutes for 2048-dim fingerprints)
Graph convolution training: ~30 minutes per task on GPU

Paper Information

Citation: Winter, R., Montanari, F., Noe, F., & Clevert, D.-A. (2019). Learning continuous and data-driven molecular descriptors by translating equivalent chemical representations. Chemical Science, 10(6), 1692-1701. https://doi.org/10.1039/C8SC04175J

@article{winter2019learning,
  title={Learning continuous and data-driven molecular descriptors by translating equivalent chemical representations},
  author={Winter, Robin and Montanari, Floriane and No{\'e}, Frank and Clevert, Djork-Arn{\'e}},
  journal={Chemical Science},
  volume={10},
  number={6},
  pages={1692--1701},
  year={2019},
  publisher={Royal Society of Chemistry},
  doi={10.1039/C8SC04175J}
}

AMORE: Testing ChemLLM Robustness to SMILES Variants

Wed, 25 Mar 2026 00:00:00 +0000

An Empirical Framework for Probing Chemical Understanding

This is an Empirical paper that introduces Augmented Molecular Retrieval (AMORE), a zero-shot evaluation framework for chemical language models (ChemLMs). The primary contribution is a method to assess whether ChemLMs have learned genuine molecular semantics or simply memorize textual patterns. Rather than relying on traditional NLP metrics like BLEU and ROUGE, AMORE tests whether a model’s embedding space treats chemically equivalent SMILES representations as similar. The authors evaluate 12 models across multiple architectures (encoder-only, encoder-decoder, decoder-only) on two datasets and five augmentation types, and extend the analysis to downstream MoleculeNet tasks.

Why Standard NLP Metrics Fail for Chemical Evaluation

Chemical language models are typically evaluated using text-based metrics from NLP (BLEU, ROUGE, METEOR) on tasks like molecule captioning. These metrics compare word overlap and sentence fluency but cannot detect whether a model truly understands molecular structure. A SMILES string like C(=O)O and its canonicalized or kekulized form represent the same molecule, yet text-based metrics would penalize valid reformulations. Embedding-based metrics like BERTScore are also insufficient because they were trained on general text, not chemical notation.

The core research question is direct: do evaluation metrics used on ChemLMs reflect actual chemical knowledge, or do the models simply imitate understanding by learning textual features? This question has practical consequences in pharmaceuticals and healthcare, where missteps in chemical reasoning carry serious risks.

Embedding-Based Retrieval as a Chemical Litmus Test

AMORE exploits a fundamental property of molecular representations: a single molecule can be written as multiple valid SMILES strings that are chemically identical. These serve as “total synonyms,” a concept without a true analogue in natural language.

The framework works in four steps:

Take a set $X = (x_1, x_2, \ldots, x_n)$ of $n$ molecular representations.
Apply a transformation $f$ to obtain augmented representations $X’ = (x’_1, x’_2, \ldots, x’_n)$, where $x’_i = f(x_i)$. The constraint is that $f$ must not change the underlying molecule.
Obtain vectorized embeddings $e(x_i)$ and $e(x’_j)$ from the model for each original and augmented SMILES.
Evaluate in a retrieval task: given $e(x_i)$, retrieve $e(x’_i)$ from the augmented set.

The evaluation metrics are top-$k$ accuracy (whether the correct augmented SMILES ranks at position $\leq k$) and Mean Reciprocal Rank (MRR). Retrieval uses FAISS for efficient nearest-neighbor search. The key insight is that if a model truly understands molecular structure, it should embed different SMILES representations of the same molecule close together.

Five SMILES Augmentation Types

The framework uses five identity-preserving augmentations, all executed through RDKit:

Canonicalization: Transform SMILES to the standardized RDKit canonical form.
Hydrogen addition: Explicitly add hydrogen atoms that are normally implied (e.g., C becomes [CH4]). This dramatically increases string length.
Kekulization: Convert aromatic ring notation to explicit alternating double bonds.
Cycle renumbering: Replace ring-closure digit identifiers with random valid alternatives.
Random atom order: Randomize the atom traversal order used to generate the SMILES string.

Twelve Models, Two Datasets, Five Augmentations

Models Evaluated

The authors test 12 publicly available Transformer-based models spanning three architecture families:

Model	Domain	Parameters
Text+Chem T5-standard	Cross-modal	220M
Text+Chem T5-augm	Cross-modal	220M
MolT5-base	Cross-modal	220M
MolT5-large	Cross-modal	770M
SciFive	Text-only	220M
PubChemDeBERTa	Chemical	86M
ChemBERT-ChEMBL	Chemical	6M
ChemBERTa	Chemical	125M
BARTSmiles	Chemical	400M
ZINC-RoBERTa	Chemical	102M
nach0	Chemical	220M
ZINC-GPT	Chemical	87M

Datasets

ChEBI-20 test set: ~3,300 molecule-description pairs, used for both AMORE retrieval and molecule captioning comparisons.
Isomers (QM9 subset): 918 molecules that are all isomers of C9H12N2O, making retrieval harder because all molecules share the same molecular formula.

Key Results on ChEBI-20

On the ChEBI-20 dataset (Table 2 from the paper), top-1 accuracy varies enormously by augmentation type. Cycle renumbering is easiest (up to 98.48% Acc@1 for SciFive), while hydrogen addition is hardest (no model exceeds 5.97% Acc@1).

For the cross-modal Text+Chem T5-standard model:

Augmentation	Acc@1	Acc@5	MRR
Canonical	63.03	82.76	72.4
Hydrogen	5.46	10.85	8.6
Kekulization	76.76	92.03	83.8
Cycle	96.70	99.82	98.2
Random	46.94	74.18	59.33

Key Results on Isomers

Performance drops substantially on the Isomers dataset, where all molecules share the same formula. The best Acc@1 for hydrogen augmentation is just 1.53% (MolT5-large). Even for the relatively easy cycle augmentation, top scores drop from the high 90s to the low 90s for most models, and some models (BARTSmiles: 41.83%) struggle considerably.

Downstream MoleculeNet Impact

The authors also fine-tuned models on original MoleculeNet training data and tested on augmented test sets across 9 tasks (regression, binary classification, multilabel classification). Results confirm that augmentations degrade downstream performance. For example, on ESOL regression, RMSE increased from 0.87 to 7.93 with hydrogen addition. Rankings computed using the Vote’n’Rank framework (using the Copeland rule) show that hydrogen augmentation is the only one that substantially reshuffles model rankings; other augmentations preserve the original ordering.

Correlation Between AMORE and Captioning Metrics

The differences in ROUGE/METEOR between original and augmented SMILES correlate with AMORE retrieval accuracy (Spearman correlation > 0.7 with p-value = 0.003 for Acc@1). This validates AMORE as a proxy for predicting how augmentations will affect generation quality, without requiring labeled captioning data.

Current ChemLMs Learn Syntax, Not Chemistry

The central finding is that existing ChemLMs are not robust to identity-preserving SMILES augmentations. Several specific conclusions emerge:

Hydrogen augmentation is catastrophic: All models fail (< 6% Acc@1 on ChEBI-20, < 2% on Isomers). The authors attribute this to the near-complete absence of explicit hydrogen in pretraining data, creating a distribution shift.
Cross-modal models outperform unimodal ones: Models trained on both text and SMILES (Text+Chem T5, MolT5) consistently achieve higher retrieval accuracy on four of five augmentations.
Augmentation difficulty follows a consistent order: For all models, hydrogen is hardest, followed by canonicalization, random ordering, kekulization, and cycle renumbering (easiest).
Layer-wise analysis reveals instability: Retrieval accuracy across Transformer layers is correlated across augmentation types, suggesting that representations degrade at the same layers regardless of augmentation.
Levenshtein distance partially explains difficulty: Hydrogen augmentation produces strings ~2x longer than originals (Levenshtein ratio of 1.49), but the low correlation between Levenshtein ratio and downstream metrics (ROUGE1 correlation of -0.05 for hydrogen) suggests string length alone does not explain the failure.

Limitations

The authors acknowledge several limitations. Only publicly available HuggingFace models were evaluated, excluding models like Chemformer and Molformer that lack HF checkpoints. The study focuses exclusively on SMILES sequences, not 3D molecular structures or other formats like SELFIES. The augmentation types, while representative, do not cover all possible identity transformations.

The authors suggest that AMORE could serve as a regularization tool during training, for example by using metric learning to encourage models to embed SMILES variants of the same molecule close together.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Retrieval evaluation	ChEBI-20 test set	3,300 molecules	Standard benchmark for molecule captioning
Retrieval evaluation	Isomers (QM9 subset)	918 molecules	All isomers of C9H12N2O
Downstream evaluation	MoleculeNet (9 tasks)	Varies	ESOL, Lipophilicity, FreeSolv, HIV, BBBP, BACE, Tox21, ToxCast, SIDER

Algorithms

SMILES augmentations via RDKit (canonicalization, hydrogen addition, kekulization, cycle renumbering, random atom ordering)
Nearest-neighbor retrieval using FAISS with L2, cosine, inner product, and HNSW metrics
Model ranking via Vote’n’Rank (Copeland rule) on MoleculeNet tasks

Models

All 12 evaluated models are publicly available on HuggingFace. No custom model training was performed for the AMORE retrieval experiments. MoleculeNet experiments used standard fine-tuning on original training splits.

Evaluation

Metric	Description	Notes
Acc@1	Top-1 retrieval accuracy	Primary AMORE metric
Acc@5	Top-5 retrieval accuracy	Secondary AMORE metric
MRR	Mean Reciprocal Rank	Average rank of correct match
ROUGE-2	Bigram overlap for captioning	Compared against AMORE
METEOR	MT evaluation metric for captioning	Compared against AMORE

Hardware

Computational resources from HPC facilities at HSE University. Specific GPU types and training times are not reported.

Artifacts

Artifact	Type	License	Notes
AMORE GitHub	Code	Not specified	Framework code and evaluation data

Paper Information

Citation: Ganeeva, V., Khrabrov, K., Kadurin, A., & Tutubalina, E. (2025). Measuring Chemical LLM robustness to molecular representations: a SMILES variation-based framework. Journal of Cheminformatics, 17(1). https://doi.org/10.1186/s13321-025-01079-0

@article{ganeeva2025measuring,
  title={Measuring Chemical LLM robustness to molecular representations: a SMILES variation-based framework},
  author={Ganeeva, Veronika and Khrabrov, Kuzma and Kadurin, Artur and Tutubalina, Elena},
  journal={Journal of Cheminformatics},
  volume={17},
  number={1},
  year={2025},
  publisher={Springer},
  doi={10.1186/s13321-025-01079-0}
}

BARTSmiles: BART Pre-Training for Molecular SMILES

Sun, 22 Mar 2026 00:00:00 +0000

A BART-Based Method for Molecular Self-Supervised Learning

BARTSmiles is a Method paper. It introduces a self-supervised pre-training approach for molecular representations based on the BART (Bidirectional and Auto-Regressive Transformers) architecture from Lewis et al. (2019). The primary contribution is a pre-training strategy, discovered through systematic ablations, that trains a BART-large model on 1.7 billion deduplicated SMILES strings from the ZINC20 dataset. BARTSmiles achieves the best reported results on 11 tasks spanning molecular property classification, regression, and chemical reaction generation.

Scaling Self-Supervised Molecular Representations Beyond Prior Work

At the time of publication, large-scale self-supervised representation learning had produced significant improvements in NLP, computer vision, and speech, but molecular representation learning had not benefited from comparable scale. Previous SMILES-based pre-trained models such as ChemBERTa (Chithrananda et al., 2020) and ChemFormer (Irwin et al., 2022) used encoder-only or encoder-decoder architectures with substantially less compute. ChemFormer, the most closely related prior work, also trained a BART-like model but with a fraction of the compute and data.

The paper argues that three gaps needed to be addressed:

Scale: Prior molecular pre-training used orders of magnitude less compute than NLP pre-training.
Architecture choice: Encoder-only models like ChemBERTa cannot perform generative fine-tuning (retrosynthesis, reaction prediction), limiting their applicability.
Pre-training recipe: Standard BART hyperparameters (e.g., 30% mask token budget) were tuned for natural language and had not been validated for molecular SMILES strings.

Core Innovation: Ablation-Driven Pre-Training Recipe for SMILES

The key insight of BARTSmiles is that the BART denoising objective, when carefully tuned for the molecular domain, learns representations that implicitly encode downstream task information. The authors discover this through a systematic three-stage ablation:

Tokenization

Rather than using hand-crafted tokenization rules that separate individual atoms (C, N, H) and bond symbols (#, =), BARTSmiles uses a learned SentencePiece unigram tokenizer trained on 10 million random SMILES with a vocabulary size of 1,021. On matched compute budgets, learned tokenization achieves 0.801 average AUC-ROC vs. 0.779 for hand-crafted tokenization on the ablation benchmark (HIV, BBBP, ClinTox).

Masking Strategy

The BART denoising objective has three main hyperparameters: the mask token budget (fraction of tokens masked), random mask probability, and the Poisson $\lambda$ controlling mask span length. The ablation results show:

Mask token budget: The standard BART value of 0.30 is suboptimal for molecules. A budget of 0.20 performs best (0.821 AUC-ROC), with performance degrading at both lower (0.10: 0.753) and higher (0.40: 0.701) budgets.
Span masking: The choice of random mask probability and $\lambda$ has a minor effect once the budget is set to 0.20. Values of random mask = 0.10 and $\lambda$ = 2.5 or 3.5 all yield 0.821.
Token randomization: Disabling the randomize-tokens noise (where some tokens are replaced with random tokens rather than masked) improves performance from 0.821 to 0.835.

Scale

Training on the full 1.7 billion molecule ZINC20 dataset (20 hours on 1,024 A100 GPUs, totaling 20,480 A100 GPU-hours) improves performance by 5 absolute AUC-ROC points over the same model trained on 100 million samples. The previous most compute-intensive molecular pre-training used 3,330 V100-hours (Ross et al., 2021).

Implicit Task Encoding

The paper provides a quantitative demonstration that frozen BARTSmiles representations encode task-specific information. Using L1-regularized logistic regression on frozen 1,024-dimensional mean-pooled representations, just 7 neurons are sufficient to achieve 0.987 AUC-ROC on ClinTox (within 2 percentage points of full fine-tuning). Even a single neuron achieves 0.77 AUC-ROC on ClinTox subtask 1.

Experimental Setup: MoleculeNet, Toxicology, and Generative Benchmarks

Classification Tasks

BARTSmiles is evaluated on 7 classification datasets from MoleculeNet (SIDER, ClinTox, Tox21, ToxCast, HIV, BACE, BBBP) plus 2 toxicology datasets (Ames, Micronucleus Assay). All classification tasks use AUC-ROC. Baselines include both supervised graph models (D-MPNN, Attentive FP, 3D InfoMax) and self-supervised methods (ChemBERTa, MolFormer-XL, GROVER-large, MolCLR, iMolCLR).

Selected classification results (AUC-ROC):

Dataset	BARTSmiles	Previous Best	Previous Best Model
ClinTox	0.997	0.954	iMolCLR
ToxCast	0.825	0.805	Attentive FP
SIDER	0.705	0.699	iMolCLR
Tox21	0.851	0.858	Attentive FP

The authors note that three scaffold-split datasets (HIV, BACE, BBBP) are highly sensitive to the specific split used, and they suspect some baseline results use different or random splits. These results are marked with caveats in the paper.

Regression Tasks

All three MoleculeNet regression tasks (ESOL, FreeSolv, Lipophilicity) are evaluated using RMSE:

Dataset	BARTSmiles	Previous Best	Previous Best Model
ESOL	0.095	0.279	MoLFormer-XL
FreeSolv	0.114	0.231	MoLFormer-XL
Lipophilicity	0.292	0.529	MoLFormer-XL

BARTSmiles achieves substantial improvements on all three regression tasks.

Generative Tasks

Retrosynthesis (USPTO-50k): BARTSmiles achieves 55.6% Top-1 accuracy using a sample-128 + perplexity re-ranking strategy, compared to 55.3% for Dual-TF and 54.3% for ChemFormer. Top-5 and Top-10 results are 74.2% and 80.9% respectively.

Chemical Reaction Prediction (USPTO MIT/LEF/STEREO): BARTSmiles with beam search outperforms the Molecular Transformer baseline across all six evaluation settings. On USPTO-MIT (split), BARTSmiles achieves 91.8% vs. 90.4% for the Transformer baseline.

Fine-Tuning Recipe

The fine-tuning approach is designed to minimize hyperparameter tuning:

Batch size 16, 10 epochs, polynomial decay learning rate schedule with warmup at 16% of training
Grid search over dropout (0.1, 0.2, 0.3) and learning rate ($5 \times 10^{-6}$, $1 \times 10^{-5}$, $3 \times 10^{-5}$)
Stochastic Weight Averaging (SWA) over three sets of four checkpoints
For generative tasks: R3F regularization (Aghajanyan et al., 2020a) and full fp32 precision
For generation: beam search (beam size 10) or sample 128 sequences with perplexity re-ranking

Key Findings and Limitations

Key Findings

Scale matters for molecular pre-training: Training on 1.7B molecules with 20,480 A100 GPU-hours yields 5 absolute points of AUC-ROC improvement over training on 100M molecules.
Domain-specific ablation is necessary: The optimal BART masking configuration for molecules (20% budget, no token randomization) differs from the standard NLP configuration (30% budget, with randomization).
Frozen representations capture task structure: A small number of neurons from the frozen model can nearly match full fine-tuning performance on certain tasks, suggesting the pre-training objective implicitly encodes molecular properties.
Interpretability aligns with domain knowledge: Integrated Gradients attribution on fine-tuned BARTSmiles highlights known structural alerts (e.g., nitro groups in mutagenic compounds, hydroxyl groups in soluble compounds).

Limitations

Scaffold split sensitivity: Results on HIV, BACE, and BBBP are sensitive to the specific scaffold split, making direct comparison with baselines difficult.
Pre-training data distribution: The Frechet distance analysis shows that some downstream datasets (BBBP, SIDER) are far from ZINC20 in representation space, which may explain weaker performance on those tasks.
Fingerprints carry complementary information: On the Ames and Micronucleus Assay datasets, BARTSmiles alone does not beat fingerprint-based baselines. Combining BARTSmiles with ECFP4 fingerprints closes the gap, implying that SMILES-based pre-training does not fully capture all structural information.
Compute requirements: Pre-training requires 1,024 A100 GPUs, which limits accessibility.

Future Directions

The authors suggest investigating the impact of pre-training data composition, noting that ZINC20 contains over a billion molecules but its distribution may be irrelevant for many downstream tasks. They also propose further collaboration between ML and chemistry experts to discover new molecular substructure-property relationships.

Reproducibility Details

Artifacts

Artifact	Type	License	Notes
BARTSmiles (GitHub)	Code + Model	MIT	Pre-training, fine-tuning, and evaluation scripts with pre-trained weights

Data

Purpose	Dataset	Size	Notes
Pre-training	ZINC20 (deduplicated)	~1.7B molecules	Canonicalized SMILES, 10K validation holdout
Classification	MoleculeNet (7 datasets)	1,427-41,127 compounds	AUC-ROC metric
Regression	MoleculeNet (3 datasets)	642-4,200 compounds	RMSE metric
Toxicology	Ames, MN Assay	6,512 / 641 compounds	Cross-validation for Ames; external test for MN
Retrosynthesis	USPTO-50k	Standard split	Top-K accuracy
Reaction prediction	USPTO (MIT/LEF/STEREO)	Standard splits	Top-1 accuracy

Algorithms

Architecture: BART-Large (pre-layer norm Transformer encoder-decoder)
Tokenizer: SentencePiece unigram, vocabulary size 1,021, max sequence length 128
Pre-training objective: BART denoising (mask token budget 0.20, Poisson span masking with $\lambda$ = 2.5, no token randomization)
Fine-tuning: polynomial decay LR, SWA, grid search over dropout and LR
Generative fine-tuning: R3F regularization, fp32 precision, Adam initialized from pre-training moving averages

Models

BART-Large architecture (exact parameter count not specified in paper)
Pre-trained checkpoint released on GitHub
Maximum sequence length: 128 tokens

Evaluation

Task	Metric	BARTSmiles	Notes
ClinTox	AUC-ROC	0.997	New SOTA
ToxCast	AUC-ROC	0.825	New SOTA
ESOL	RMSE	0.095	New SOTA
FreeSolv	RMSE	0.114	New SOTA
Lipophilicity	RMSE	0.292	New SOTA
USPTO-50k Retro (Top-1)	Accuracy	55.6%	New SOTA (sample + re-rank)
USPTO-MIT Rxn (Split)	Accuracy	91.8%	New SOTA (beam-10)

Hardware

Pre-training: 1,024 NVIDIA A100 GPUs for 20 hours (20,480 A100 GPU-hours)
Ablation runs: 128 A100 GPUs per run
Framework: FairSeq with FairScale (fully sharded data parallel), automatic mixed precision
Experiment tracking: Aim

Paper Information

Citation: Chilingaryan, G., Tamoyan, H., Tevosyan, A., Babayan, N., Khondkaryan, L., Hambardzumyan, K., Navoyan, Z., Khachatrian, H., & Aghajanyan, A. (2024). BARTSmiles: Generative Masked Language Models for Molecular Representations. Journal of Chemical Information and Modeling, 64(15), 5832-5843. https://doi.org/10.1021/acs.jcim.4c00512

Publication: Journal of Chemical Information and Modeling, 2024 (preprint: arXiv 2022)

Additional Resources:

BARTSmiles GitHub Repository (MIT License)

Citation

@article{chilingaryan2024bartsmiles,
  title={BARTSmiles: Generative Masked Language Models for Molecular Representations},
  author={Chilingaryan, Gayane and Tamoyan, Hovhannes and Tevosyan, Ani and Babayan, Nelly and Khondkaryan, Lusine and Hambardzumyan, Karen and Navoyan, Zaven and Khachatrian, Hrant and Aghajanyan, Armen},
  journal={Journal of Chemical Information and Modeling},
  volume={64},
  number={15},
  pages={5832--5843},
  doi={10.1021/acs.jcim.4c00512},
  year={2024}
}

SELFormer: A SELFIES-Based Molecular Language Model

Mon, 16 Mar 2026 00:00:00 +0000

A SELFIES-Based Chemical Language Model

This is primarily a Method paper ($\Psi_{\text{Method}}$) with a secondary Resource component ($\Psi_{\text{Resource}}$).

SELFormer applies the RoBERTa transformer architecture to SELFIES molecular string representations instead of the SMILES notation used by prior chemical language models. The model is pretrained via masked language modeling (MLM) on 2M drug-like compounds from ChEMBL and fine-tuned for molecular property prediction tasks on MoleculeNet benchmarks. The authors release pretrained models, fine-tuning code, and datasets as open-source resources.

Why SELFIES Over SMILES for Pretraining?

Existing chemical language models, including ChemBERTa, ChemBERTa-2, MolBERT, and MolFormer, all use SMILES as their input representation. SMILES has well-documented validity and robustness issues: arbitrary perturbations to a SMILES string frequently produce syntactically invalid outputs. This means a pretrained model must spend capacity learning SMILES grammar rules rather than chemical semantics.

SELFIES addresses this by construction: every possible SELFIES string decodes to a valid molecule. Despite this theoretical advantage and SELFIES’ growing adoption in generative chemistry, no prior work had systematically evaluated SELFIES as input for large-scale transformer pretraining. SELFormer fills this gap by providing a direct comparison between SELFIES-based and SMILES-based chemical language models on standard benchmarks.

Masked Language Modeling on Guaranteed-Valid Molecular Strings

SELFormer uses byte-level Byte-Pair Encoding (BPE) to tokenize SELFIES strings, then pretrains a RoBERTa encoder using the standard MLM objective. 15% of input tokens are masked, and the model minimizes the cross-entropy loss over the masked positions:

$$ \mathcal{L}_{\text{MLM}} = -\frac{1}{|\mathcal{M}|} \sum_{i \in \mathcal{M}} \log P(x_i \mid x_{\setminus \mathcal{M}}; \theta) $$

where $\mathcal{M}$ is the set of masked token indices, $x_i$ is the true token at position $i$, $x_{\setminus \mathcal{M}}$ is the corrupted input context, and $\theta$ are the model parameters.

The key insight is that because SELFIES guarantees 100% validity, every masked token prediction corresponds to a valid molecular fragment. The model never wastes capacity predicting invalid chemistry. For fine-tuning, a two-layer classification or regression head is added on top of the encoder’s output embedding.

Two model sizes were trained. Notably, the larger SELFormer uses fewer attention heads (4) but more hidden layers (12) than SELFormer-Lite (12 heads, 8 layers). This counterintuitive configuration emerged from the authors’ hyperparameter search over ~100 models, where deeper architectures with fewer heads outperformed wider, shallower ones:

Configuration	SELFormer-Lite	SELFormer
Attention Heads	12	4
Hidden Layers	8	12
Batch Size	16	16
Learning Rate	5e-5	5e-5
Weight Decay	0.01	0.01
Pretraining Epochs	100	100
Parameters	58.3M	86.7M

Benchmarking Against SMILES Transformers and Graph Models

SELFormer was pretrained on 2.08M drug-like compounds from ChEMBL v30 (converted from SMILES to SELFIES), then fine-tuned on nine MoleculeNet tasks. All evaluations use scaffold splitting via the Chemprop library.

Classification tasks (ROC-AUC, scaffold split):

Model	BACE	BBBP	HIV	Tox21	SIDER
SELFormer	0.832	0.902	0.681	0.653	0.745
ChemBERTa-2	0.799	0.728	0.622	-	-
MolBERT	0.866	0.762	0.783	-	-
D-MPNN	0.809	0.710	0.771	0.759	0.570
MolCLR	0.890	0.736	0.806	0.787	0.652
GEM	0.856	0.724	0.806	0.781	0.672
KPGT	0.855	0.908	-	0.848	0.649

Regression tasks (RMSE, scaffold split, lower is better):

Model	ESOL	FreeSolv	Lipophilicity	PDBbind
SELFormer	0.682	2.797	0.735	1.488
ChemBERTa-2	-	-	0.986	-
D-MPNN	1.050	2.082	0.683	1.397
GEM	0.798	1.877	0.660	-
KPGT	0.803	2.121	0.600	-

The ablation study compared SELFormer vs. SELFormer-Lite across pretrained-only, 25-epoch, and 50-epoch fine-tuning configurations on randomly split datasets. SELFormer consistently outperformed SELFormer-Lite, confirming the benefit of the deeper (12-layer) architecture.

Strong Classification Performance with Compact Pretraining

SELFormer’s strongest results come on classification tasks where molecular substructure matters:

SIDER: Best overall ROC-AUC (0.745), outperforming the next best method (MolCLR at 0.652) by 9.3 percentage points. The authors attribute this to SELFIES’ ability to capture subtle structural differences relevant to drug side effects.
BBBP: Second best (0.902), behind only KPGT (0.908). SELFormer scored 17.4 percentage points above ChemBERTa-2 (0.728) on this task.
BACE/HIV vs. ChemBERTa-2: SELFormer outperformed ChemBERTa-2 by 3.3 points on BACE (0.832 vs 0.799), 17.4 on BBBP, and 5.9 on HIV (0.681 vs 0.622). Since both models use similar RoBERTa architectures, this comparison is suggestive of a SELFIES advantage, though differences in pretraining corpus (ChEMBL vs PubChem), corpus size, and training procedure confound a clean attribution to the input representation alone.
ESOL regression: Best RMSE (0.682) vs GEM (0.798), a 14.5% relative improvement.

Limitations are also apparent:

HIV and Tox21: SELFormer underperforms graph-based methods (MolCLR, GEM, KPGT) on these larger datasets. The authors attribute this to insufficient hyperparameter search given computational constraints.
FreeSolv and Lipophilicity regression: D-MPNN and graph-based methods maintain an edge, suggesting that explicit 2D/3D structural inductive biases remain valuable for certain property types.
Small pretraining corpus: At 2M molecules, SELFormer’s corpus is orders of magnitude smaller than MolFormer’s 1.1B. Despite this, SELFormer outperforms MolFormer on SIDER (0.745 vs 0.690), highlighting SELFIES’ representational advantage.
Single-task ablation scope: Some architectural claims rest on limited task coverage, and broader benchmarking would strengthen the conclusions.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pretraining	ChEMBL v30	2,084,725 compounds (2,084,472 after SELFIES conversion)	Drug-like bioactive small molecules
Classification	BACE	1,513	Beta-secretase 1 inhibitor binding
Classification	BBBP	2,039	Blood-brain barrier permeability
Classification	HIV	41,127	HIV replication inhibition
Classification	SIDER	1,427	Drug side effects (27 classes)
Classification	Tox21	7,831	Toxicity (12 targets)
Regression	ESOL	1,128	Aqueous solubility
Regression	FreeSolv	642	Hydration free energy
Regression	Lipophilicity	4,200	Octanol/water distribution coefficient
Regression	PDBbind	11,908	Binding affinity

Algorithms

Pretraining objective: Masked language modeling (MLM), 15% token masking
Tokenization: Byte-level Byte-Pair Encoding (BPE) on SELFIES strings
SMILES to SELFIES conversion: SELFIES API with Pandaral.lel for parallelization
Splitting: Scaffold splitting via Chemprop library (80/10/10 train/validation/test)
Fine-tuning: Two-layer classification/regression head on encoder output; up to 200 epochs with hyperparameter search

Models

Architecture: RoBERTa (HuggingFace Transformers)
SELFormer: 12 hidden layers, 4 attention heads, 86.7M parameters
SELFormer-Lite: 8 hidden layers, 12 attention heads, 58.3M parameters
Hyperparameter search: Sequential search over ~100 configurations on 100K molecule subset

Evaluation

Metric	Task Type	Details
ROC-AUC	Classification	Area under receiver operating characteristic curve
PRC-AUC	Classification	Area under precision-recall curve (reported for random splits)
RMSE	Regression	Root mean squared error

Results reported on scaffold split and random split datasets.

Hardware

Compute: 2x NVIDIA A5000 GPUs
Hyperparameter optimization time: ~11 days
Full pretraining: 100 epochs on 2.08M molecules

Artifacts

Artifact	Type	License	Notes
SELFormer GitHub	Code	GPL-3.0	Pretraining, fine-tuning, and evaluation scripts
SELFormer on HuggingFace	Model	GPL-3.0	Pretrained SELFormer weights
ChEMBL v30	Dataset	CC BY-SA 3.0	Source pretraining data
MoleculeNet	Benchmark	Unknown	Downstream evaluation tasks

Paper Information

Citation: Yüksel, A., Ulusoy, E., Ünlü, A., & Doğan, T. (2023). SELFormer: Molecular Representation Learning via SELFIES Language Models. Machine Learning: Science and Technology, 4(2), 025035. https://doi.org/10.1088/2632-2153/acdb30

Publication: Machine Learning: Science and Technology 2023

Additional Resources:

Citation

@article{yuksel2023selformer,
  title={{SELFormer}: Molecular Representation Learning via {SELFIES} Language Models},
  author={Y{\"u}ksel, Atakan and Ulusoy, Erva and {\"U}nl{\"u}, Atabey and Do{\u{g}}an, Tunca},
  journal={Machine Learning: Science and Technology},
  volume={4},
  number={2},
  pages={025035},
  year={2023},
  publisher={IOP Publishing},
  doi={10.1088/2632-2153/acdb30}
}

MoLFormer: Large-Scale Chemical Language Representations

Mon, 16 Mar 2026 00:00:00 +0000

A Billion-Scale Chemical Language Model

This is primarily a Method paper ($\Psi_{\text{Method}}$).

MoLFormer is a transformer encoder pretrained via masked language modeling on 1.1 billion SMILES strings from PubChem and ZINC. The key architectural choices are linear attention (for $O(N)$ complexity instead of $O(N^2)$) and rotary positional embeddings (RoPE). The resulting model, MoLFormer-XL, produces molecular embeddings that outperform or match GNN baselines across a wide range of MoleculeNet classification and regression tasks, including quantum-chemical property prediction from SMILES alone.

Bridging the Gap Between Molecular Languages and Graph Neural Networks

Prior chemical language models like ChemBERTa were pretrained on relatively small datasets (10M-77M molecules) and generally underperformed GNNs on molecular property prediction. The core question: does a transformer trained on a sufficiently large SMILES corpus learn enough chemical structure to compete with graph-based methods that have explicit topological inductive biases?

Two specific challenges motivated this work:

Scale: The chemical space spans $10^{60}$ to $10^{100}$ plausible molecules, yet labeled property data is scarce. Self-supervised pretraining on the ~1.1B unlabeled molecules available in public databases could provide a general-purpose representation.
Efficiency: Standard transformer attention is $O(N^2)$ in sequence length, making billion-scale pretraining impractical without architectural modifications.

Linear Attention with Rotary Positional Embeddings

MoLFormer’s two key architectural choices are its attention mechanism and positional encoding scheme.

Standard attention computes:

$$ \text{Attention}_m(Q, K, V) = \frac{\sum_{n=1}^{N} \exp(\langle q_m, k_n \rangle) v_n}{\sum_{n=1}^{N} \exp(\langle q_m, k_n \rangle)} $$

MoLFormer replaces this with linear attention using a generalized feature map $\varphi$, combined with rotary positional embeddings $R_m$ applied before the feature map:

$$ \text{Attention}_m(Q, K, V) = \frac{\sum_{n=1}^{N} \langle \varphi(R_m q_m), \varphi(R_n k_n) \rangle v_n}{\sum_{n=1}^{N} \langle \varphi(R_m q_m), \varphi(R_n k_n) \rangle} $$

This differs from the original RoFormer formulation, which applies the rotation after the feature map. The authors found that rotating the raw queries and keys before projection led to faster convergence and lower validation loss. The combination of linear attention and adaptive sequence-length bucketing reduces GPU requirements from ~1000 to 16 for training on the full 1.1B corpus.

The model uses masked language modeling (15% token masking, following BERT conventions) with a vocabulary of 2,362 SMILES tokens. Sequence length is capped at 202 tokens, covering 99.4% of all molecules.

Broad MoleculeNet Benchmarking with Scaling Ablations

MoLFormer-XL was evaluated on 11 MoleculeNet tasks against supervised GNNs, self-supervised GNNs, and prior language models.

Classification tasks (ROC-AUC, scaffold split; values reported as percentages in the original paper, converted to proportions here for consistency):

Model	BBBP	Tox21	ClinTox	HIV	BACE	SIDER
MoLFormer-XL	0.937	0.847	0.948	0.822	0.882	0.690
N-Gram	0.912	0.769	0.855	0.830	0.876	0.632
MolCLR	0.736	0.798	0.932	0.806	0.890	0.680
GEM	0.724	0.781	0.901	0.806	0.856	0.672
Hu et al.	0.708	0.787	0.789	0.802	0.859	0.652
GeomGCL	-	0.850	0.919	-	-	0.648
ChemBERTa	0.643	-	0.906	0.622	-	-

Regression tasks (RMSE for ESOL/FreeSolv/Lipophilicity, avg MAE for QM9/QM8):

Model	QM9	QM8	ESOL	FreeSolv	Lipophilicity
MoLFormer-XL	1.5894	0.0102	0.2787	0.2308	0.5289
A-FP	2.6355	0.0282	0.5030	0.736	0.578
MPNN	3.1898	0.0143	0.58	1.150	0.7190
GC	4.3536	0.0148	0.970	1.40	0.655

MoLFormer-XL also outperforms geometry-aware GNNs (DimeNet, GeomGCL, GEM) on ESOL (0.279 vs 0.575), FreeSolv (0.231 vs 0.866), and Lipophilicity (0.529 vs 0.541).

Key ablation findings:

Data scale matters: Performance improves monotonically from 10% subsets through the full 1.1B corpus. Training on 100% ZINC alone performed worst, likely due to its smaller vocabulary and less diverse molecule lengths.
Model depth matters: MoLFormer-Base (6 layers) underperforms MoLFormer-XL (12 layers) on most tasks.
Fine-tuning » frozen: Fine-tuning the full encoder consistently outperforms using frozen embeddings with a downstream classifier.
Rotary > absolute at scale: Rotary embeddings underperform absolute embeddings on smaller pretraining sets but overtake them once the corpus exceeds 1B molecules.

SMILES Transformers Learn Molecular Geometry

The most striking finding is that MoLFormer’s attention patterns correlate with 3D interatomic distances, despite training only on 1D SMILES strings.

Using QM9 molecules with known 3D geometries, the authors computed cosine similarity between attention maps and spatial distance matrices across three distance categories:

Distance Category	Range	Linear Attention (Rotary)	Full Attention (Rotary)
Short	$\leq$ 2 Å	0.594-0.602	0.598-0.615
Medium	2-4 Å	0.724-0.730	0.716-0.727
Long	4-10 Å	0.209-0.211	0.204-0.210

The strong correlation in the short and medium categories indicates the model captures covalent bond connectivity and near-neighbor spatial relationships. Linear attention shows marginally higher cosine similarity than full attention on medium-range distances (0.724-0.730 vs 0.716-0.727), though the differences are small.

MoLFormer-XL embeddings also correlate more strongly with molecular fingerprint similarity (0.64 vs 0.48 for ChemBERTa) and maximum common subgraph size (-0.60 vs -0.44), confirming that the representations encode structural information.

Limitations:

Quantum-chemical energies: SchNet and DimeNet (which encode explicit 3D geometry) outperform MoLFormer-XL on QM9 atomization energy tasks, with DimeNet achieving roughly 10x lower MAE on U0_atom (0.008 vs 0.083 eV). 3D information remains important for these properties.
Sequence length cap: The 202-token limit excludes 0.6% of molecules, potentially limiting applicability to larger structures.
SMILES canonicalization: The model depends on RDKit canonical SMILES; sensitivity to non-canonical forms is not evaluated.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pretraining	PubChem	111M molecules	Canonical SMILES via RDKit
Pretraining	ZINC	~1B molecules	Canonical SMILES via RDKit
Pretraining (combined)	PubChem + ZINC	~1.1B molecules	MoLFormer-XL training set
Classification	BBBP, Tox21, ClinTox, HIV, BACE, SIDER	1,427-41,127	MoleculeNet scaffold splits
Regression	QM9, QM8, ESOL, FreeSolv, Lipophilicity	642-133,885	MoleculeNet random splits (QM9/QM8), scaffold (others)

Algorithms

Pretraining objective: Masked language modeling (15% selection: 80% masked, 10% random, 10% unchanged)
Tokenization: SMILES tokenizer from Schwaller et al., vocabulary of 2,362 tokens
Sequence length: 1-202 tokens (99.4% coverage)
Optimizer: Fused LAMB (via APEX), chosen for stability with large batch sizes and no need for learning rate warm-up
Adaptive bucketing: Sequences grouped by length into buckets to minimize padding waste

Models

Architecture: Transformer encoder with linear attention and rotary positional embeddings
MoLFormer-XL: 12 layers, 12 attention heads, hidden size 768
MoLFormer-Base: 6 layers (ablation only)
Feature map size: 32 (generalized feature map for linear attention)
Frozen head: Fully connected model with hyperparameter sweep (learning rate, batch size, hidden dim, number of layers)

Evaluation

Metric	Task Type	Details
ROC-AUC	Classification	Scaffold splits per MoleculeNet
RMSE	Regression (ESOL, FreeSolv, Lipophilicity)	Scaffold splits
Avg MAE	Regression (QM9, QM8)	Random splits per MoleculeNet

QM9 results also reported with 5-fold cross-validation for robustness.

Hardware

Compute: GPU cluster with nodes containing either 8 NVIDIA Tesla V100 (32GB) or 8 Ampere A100 (40GB) GPUs connected via NVLink and InfiniBand
GPU reduction: Linear attention + bucketing reduced GPU requirements from ~1000 to 16

Artifacts

Artifact	Type	License	Notes
IBM/molformer	Code	Apache-2.0	Pretraining, fine-tuning, and attention visualization
MoLFormer-XL (HuggingFace)	Model	Apache-2.0	Pretrained weights (46.8M parameters)
PubChem	Dataset	Public domain	111M molecules
ZINC	Dataset	See ZINC terms	~1B molecules

Paper Information

Citation: Ross, J., Belgodere, B., Chenthamarakshan, V., Padhi, I., Mroueh, Y., & Das, P. (2022). Large-Scale Chemical Language Representations Capture Molecular Structure and Properties. Nature Machine Intelligence, 4, 1256-1264. https://doi.org/10.1038/s42256-022-00580-7

Publication: Nature Machine Intelligence 2022

Additional Resources:

Citation

@article{ross2022molformer,
  title={Large-Scale Chemical Language Representations Capture Molecular Structure and Properties},
  author={Ross, Jerret and Belgodere, Brian and Chenthamarakshan, Vijil and Padhi, Inkit and Mroueh, Youssef and Das, Payel},
  journal={Nature Machine Intelligence},
  volume={4},
  number={12},
  pages={1256--1264},
  year={2022},
  publisher={Nature Publishing Group},
  doi={10.1038/s42256-022-00580-7}
}

ChemBERTa-3: Open Source Chemical Foundation Models

Fri, 26 Dec 2025 00:00:00 +0000

Core Contribution: An Open-Source Framework

This is primarily a Resource ($\Psi_{\text{Resource}}$) paper, with secondary Method ($\Psi_{\text{Method}}$) contributions.

Resource Basis: The core contribution is “ChemBERTa-3,” an open-source framework integrated into DeepChem that standardizes the pretraining and benchmarking of chemical foundation models. The authors focus heavily on infrastructure (AWS/Ray integration) and correcting benchmarking inconsistencies in the field.
Method Basis: It trains models like “c3-MoLFormer” to reproduce and validate the infrastructure.

The Pretraining Scalability Challenge

Scalability Challenges: Building robust molecular models is difficult due to the vast size of chemical space and the computational intensity of pretraining on large datasets.
Proprietary Barriers: Many high-performing chemical foundation models (e.g., the full MoLFormer-XL) are partially closed-source or difficult to reproduce.
Benchmarking Inconsistencies: There is a lack of systematic comparison between architectures (e.g., Graph vs. Transformer) using unified protocols. Specifically, previous comparisons relied on reported results that used differing scaffold splitting algorithms, making them inaccurate.

Unified Infrastructure & Standardized Benchmarking

Unified Infrastructure: Integration of DeepChem with Ray for distributed, scalable pretraining and fine-tuning of both graph and transformer models.
Standardized Benchmarking: Identification that MoLFormer’s scaffold splitting algorithm differs from the standard DeepChem/MoleculeNet splitter, and the subsequent standardization of these benchmarks for fair comparison.
New DeepChem Tools: Introduction of the ModularTorchModel class for flexible loss computation and HuggingFaceModel wrappers to bridge ecosystems.

Benchmarking Transformers vs. Graph Models

Architecture Comparison: Benchmarked Transformers (ChemBERTa, MoLFormer) against Graph models (GROVER, InfoGraph, InfoMax3D, DMPNN, GCN) and baselines (Random Forest).
Pretraining Scale Disparity:
- Transformers were pretrained on ZINC20 subsets ranging from 10M to 1.1B molecules (combining ZINC and PubChem).
- Graph models were limited to 250K molecule subsets due to memory and computational overhead of message passing on large graphs. While this highlights the superior scalability of Transformer architectures, comparing a 1.1B-trained Transformer to a 250K-trained Graph model provides an unbalanced evaluation of architectural capacity.
Reproducibility Validation: Trained “c3-MoLFormer” (a reproduction of MoLFormer) on 1.1B molecules using two distinct hardware setups: AWS spot instances (Ray) and a local HPC cluster.
Scaffold Split Analysis: Compared performance metrics using “DeepChem scaffold splits” vs. “MoLFormer scaffold splits” to quantify the impact of data leakage/overlap.

Overcoming Scaffold Splitting Inconsistencies

Scaling Transformers vs. Graphs: Transformer-based models are significantly easier to scale to large datasets than current graph-based approaches, though performance is comparable at small scales.
Benchmarking sensitivity: MoLFormer’s reported superiority over baselines was partly inflated by its specific scaffold splitting method, which had higher structural overlap between train and test sets (yielding a lower Tanimoto distance, generally quantified via $1 - \frac{|A \cap B|}{|A \cup B|}$) than DeepChem splits. When standardized, baselines like DMPNN perform more competitively.
Infrastructure Viability: The framework successfully replicated large-scale training (MoLFormer-1.1B) on both cloud and on-premise HPC, confirming reproducibility.
Open Source Release: All code, configurations, and the c3-MoLFormer-1.1B model weights are released to facilitate future research.

Reproducibility Details

Data

Pretraining:
- Source: ZINC20 (1.4B compounds) and PubChem.
- Scale: Subsets of 10M, 100M, and 1.1B (100% ZINC20 + 100% PubChem) were used for Transformers. Graph models used a 250K subset.
Fine-tuning:
- Suite: MoleculeNet.
- Tasks: Classification (BACE, BBBP, Tox21, HIV, SIDER, ClinTox) and Regression (ESOL, FreeSolv, Lipophilicity, QM9).
- Splits: Critical distinction made between “DeepChem scaffold splits” (80/10/10) and “MoLFormer scaffold splits” (which can be downloaded from https://ibm.ent.box.com/v/MoLFormer-data). The paper notes these algorithms differ.

Algorithms

Framework: DeepChem integrated with Ray for distributed training. To recreate the environment, the repository relies on a nightly version of DeepChem (pip install --pre deepchem) and specific dependencies found within the requirements.txt. Pretraining scripts are available in the chemberta3_benchmarking/pretraining directory of the repository.
Data Preparation: Featurization workflows (e.g., CircularFingerprint, RDKitConformer) are documented under chemberta3_benchmarking/data/data_preprocessing/ in the codebase.
Modular Training: Uses ModularTorchModel to allow loss computation from intermediate values and flexible component connection.
Training Brittleness:
- Optimizer: Linear learning rate scheduler with warmup.
- Instability Handling: The authors observed significant loss spikes during warmup. Their primary mitigation strategy involved checkpointing frequently and restarting from the last stable state upon a spike, highlighting a persistent brittleness in optimizing these large chemical foundation models.
- Numerical Issues: Addressed NaN values by pretraining on a small dataset with low LR before scaling up.

Models

ChemBERTa: RoBERTa-based architecture trained with Masked Language Modeling (MLM) and Multitask Regression (MTR). Specific model identifiers (e.g., DeepChem/ChemBERTa-100M-MLM) are hosted on Hugging Face so researchers can pull them directly via the transformers library. The core pretraining objective minimized the standard MLM loss: $$ \mathcal{L}{\text{MLM}} = - \frac{1}{|\mathcal{M}|} \sum{i \in \mathcal{M}} \log \hat{y}{i} $$ where $\mathcal{M}$ represents the set of masked SMILES token indices, and $\hat{y}{i}$ is the model’s predicted probability for the correct token given the corrupted sequence context.
MoLFormer (c3-MoLFormer): Re-implementation of the MoLFormer architecture (Rotary embeddings, linear attention). Specific model identifiers (e.g., DeepChem/MoLFormer-c3-1.1B) are similarly available on Hugging Face.
- Tokenizer: ibm/MoLFormer-XL-both-10pct tokenizer.
Graph Models:
- GROVER: Graph Transformer with node/edge/graph level self-supervision.
- InfoGraph: Maximizes mutual information between graph-level and substructure representations.
- InfoMax3D: Incorporates 3D conformer data (via RDKit ETKDGv2) into contrastive pretraining.
- DMPNN: Directed Message Passing Neural Network (Chemprop variant).

Evaluation

Metrics: ROC-AUC for classification; RMSE for regression (MAE for QM9).
Baselines: Random Forest, GCN, DMPNN trained on fine-tuning splits only.
Protocol: Three independent runs per configuration to report mean and range (not a confidence interval), with the exception of the compute-heavy QM9 dataset, which only received a single run. Benchmarking execution scripts (e.g., GCN, RF, DMPNN, ChemBERTa) are stored in the repo under chemberta3_benchmarking/models_benchmarking/ and contain the specific fine-tuning hyperparameters and optimizer configurations used for each downstream task.
Key Results:
- c3-MoLFormer-1.1B achieved ~0.848 ROC-AUC on BACE and ~0.900 on BBBP (using MoLFormer splits). This closely matches the original IBM MoLFormer metrics, validating the reproducibility of the open-source framework.
- When constrained to the equivalent 250K subset, Graph models (InfoGraph, GROVER) performed comparably to Transformers, indicating that Transformer superiority in chemistry is largely driven by data scalability rather than an inherent architectural advantage at small scales.

Hardware

Cloud (AWS):
- Compute: 40 NVIDIA T4 GPUs (g4dn.12xlarge spot instances for pretraining, g4dn.2xlarge for benchmarking).
- Cost: ~$4000 for MoLFormer 1.1B pretraining.
- Time: ~10 days (260 hours) for 1.1B model pretraining.
- Setup: Setup scripts for single-node and multi-node spot EC2 clusters are provided in the GitHub repository’s infra/ and spot/ folders.
On-Premise HPC:
- Compute: 16 nodes (AMD EPYC), each with 4 AMD MI300A APUs.
- Environment: Ray multi-node multi-GPU framework.

Artifacts

Artifact	Type	License	Notes
ChemBERTa-3 GitHub Repository	Code	Unknown	Training, fine-tuning, and benchmarking framework
DeepChem/MoLFormer-c3-1.1B	Model	Unknown	MoLFormer re-implementation pretrained on 1.1B molecules
DeepChem/ChemBERTa-100M-MLM	Model	Unknown	ChemBERTa pretrained on 100M ZINC molecules
DeepChem/MoLFormer-c3-100M	Model	Unknown	MoLFormer pretrained on 100M molecules
DeepChem/MoLFormer-c3-550M	Model	Unknown	MoLFormer pretrained on 550M molecules

Paper Information

Citation: Singh, R. et al. (2026). ChemBERTa-3: an open source training framework for chemical foundation models. Digital Discovery, 5, 662-685. https://doi.org/10.1039/D5DD00348B

Publication: Digital Discovery 2026

Additional Resources:

@article{singhChemBERTa3OpenSource2026,
  author = {Singh, Riya and Barsainyan, Aryan Amit and Irfan, Rida and Amorin, Connor Joseph and He, Stewart and Davis, Tony and Thiagarajan, Arun and Sankaran, Shiva and Chithrananda, Seyone and Ahmad, Walid and Jones, Derek and McLoughlin, Kevin and Kim, Hyojin and Bhutani, Anoushka and Sathyanarayana, Shreyas Vinaya and Viswanathan, Venkat and Allen, Jonathan E. and Ramsundar, Bharath},
  title = {{{ChemBERTa-3}}: an open source training framework for chemical foundation models},
  journal = {Digital Discovery},
  year = {2026},
  volume = {5},
  pages = {662-685},
  publisher = {The Royal Society of Chemistry},
  doi = {10.1039/D5DD00348B},
  url = {https://doi.org/10.1039/D5DD00348B}
}

ChemBERTa-2: Scaling Molecular Transformers to 77M

Thu, 25 Dec 2025 00:00:00 +0000

Classifying ChemBERTa-2’s Methodological Contributions

This is primarily a Methodological paper with a secondary Resource contribution.

It fits the Method classification because it focuses on optimizing the architecture and pretraining pipeline for molecular transformers. The authors perform extensive ablation studies (varying dataset size from 5M to 77M, comparing MLM vs. MTR objectives) to determine “how well” these strategies work compared to baselines. The secondary Resource classification applies because they open-source the trained models and establish a benchmark on a massive 77M compound dataset.

Key methodological indicators:

Baseline comparison: The paper explicitly compares ChemBERTa-2 against standard baselines (D-MPNN, Random Forest, GCN) and its predecessor (ChemBERTa-1) with prominent benchmark tables
Ablation studies: Extensive experiments comparing multi-task and self-supervised pretraining by varying hyperparameters and pretraining dataset size
Scaling analysis: Systematic investigation of whether larger datasets (up to 77M compounds) yield better performance

Motivations for Scaling Molecular Transformers

The authors aim to bridge the gap between NLP success stories (like GPT-3) and molecular machine learning by developing a “chemical foundation model”.

Key motivations:

Label scarcity: Experimental labels for molecular properties are rare and expensive, but unlabeled SMILES strings are abundant
Scaling hypothesis: Testing if scaling pretraining data (up to 77M compounds) yields consistent downstream improvements, similar to scaling laws in NLP
Efficiency: Optimizing the pretraining process introduced in the original ChemBERTa by comparing self-supervised (MLM) and weakly supervised (MTR, using RDKit computed properties as labels) approaches

Novelty in Multi-Task Regression Objectives

Scale: Training on 77M unique SMILES from PubChem, which is one of the largest molecular pretraining datasets used to date (compared to 10M for ChemBERTa-1 or 18.7M for SMILES-BERT).

Pipeline optimization: A direct, controlled comparison of Masked Language Modeling (MLM) vs. Multi-Task Regression (MTR) pretraining objectives on identical datasets.

Proxy selection: The finding that MLM loss correlates well with MTR loss, allowing the cheaper MLM task to be used for hyperparameter tuning before running the expensive MTR pretraining.

Experimental Pretraining Setup on 77M Compounds

Pretraining Setup

Datasets: Subsets of PubChem containing 5M, 10M, and 77M unique SMILES.

Tasks:

MLM: Masking 15% of tokens (following RoBERTa procedure). The model is optimized by minimizing the cross-entropy loss over the predicted masked tokens: $$ \mathcal{L}_{MLM} = -\sum_{i \in \mathcal{M}} \log P(x_i \mid \mathbf{x}_{\setminus \mathcal{M}}) $$ where $\mathcal{M}$ represents the set of masked token indices.
MTR: Predicting 200 calculated molecular properties (via RDKit) simultaneously using a mean squared error objective: $$ \mathcal{L}_{MTR} = \frac{1}{200} \sum_{j=1}^{200} \frac{1}{N} \sum_{i=1}^{N} \left( \hat{y}_{ij} - y_{ij} \right)^2 $$ Continuous target labels $y_{ij}$ are mean-normalized prior to training to equilibrate the disparate scales of different chemical properties.

Hyperparameter search: Ran 50 random configurations on the 5M dataset; selected the top 5 to scale up to 10M and 77M.

Downstream Validation

Finetuning: Evaluated on 8 tasks from MoleculeNet (BACE, BBBP, ClinTox, Delaney, etc.) using scaffold splits (80/10/10).

Analysis: Used UMAP to visualize embeddings from MLM, MTR, and ECFP to check for clustering by label without finetuning.

Key Performance Outcomes and Scaling Realities

Highly competitive performance: ChemBERTa-2 outperforms the D-MPNN baseline (chemprop) on 6 out of 8 MoleculeNet tasks, though the margins demonstrate that task-specific baselines remain notably robust.

MTR superiority: Models pretrained on Multi-Task Regression (MTR) consistently perform better on downstream tasks than those pretrained on MLM on every finetuning task evaluated. MTR is substantially slower than MLM due to the larger input size from the 200-element label vector, but MLM loss serves as a reliable proxy for MTR loss, enabling cheaper architecture search before committing to full MTR pretraining.

Scaling laws versus downstream utility: Pretraining loss improved by 25-35% when increasing the dataset from 5M to 77M compounds. However, this improvement in pretraining loss does not uniformly transfer to downstream tasks. For MTR models, SR-p53 ROC-AUC decreases monotonically from 0.834 (5M) to 0.827 (10M) to 0.817 (77M), and Lipophilicity RMSE is worse at 77M (0.798) than at 5M (0.758), despite a dip at 10M (0.744). This variability in transfer challenges the assumption that pretraining improvements always yield downstream gains.

Transfer learning: The correlation between pretraining loss and downstream performance is task-dependent; it is strong for Lipophilicity but weaker for BACE classification.

Reproducibility Details

Data

The pretraining corpus is derived from PubChem.

Purpose	Dataset	Size	Notes
Pretraining	PubChem	77M SMILES	Canonicalized and globally shuffled. Subsets of 5M and 10M used. Note: Exact splits and datasets are not published.
Validation	PubChem	100k SMILES	A fixed set held out from the 77M corpus. Note: Exact 100k subset is not published.
MTR Labels	RDKit	200 props	200 molecular properties calculated from SMILES using RDKit. Labels are mean-normalized. Note: Calculated labels are not published and must be re-computed.
Finetuning	MoleculeNet	1.5k - 8k	Tasks: BACE, Clearance, Delaney, Lipophilicity, BBBP, ClinTox, HIV, Tox21. Split 80/10/10 via scaffold splitter.

Algorithms

Pretraining Objectives:

Masked Language Modeling (MLM): Follows RoBERTa procedure. Masks 15% of tokens. Max sequence length 512.
Multi-Task Regression (MTR): Predicting 200 RDKit properties. Labels are mean-normalized.

Tokenizer:

Dictionary of common SMILES characters
Maximum vocabulary size: 591 tokens

Optimization:

Patience: Early stopping set to one pass through the dataset to ensure full coverage
Hyperparameter search: Random search (50 configs) varying hidden size, attention heads, dropout, intermediate size, hidden layers, and learning rate. Note: The precise configuration of the winning models that were scaled to 77M is absent from the paper.

Models

Architecture: Based on RoBERTa (HuggingFace implementation)
Parameter scale: Models ranged between 5M and 46M parameters
Selection: Top 5 configurations from the 5M-dataset random search were trained on the full 77M dataset
Checkpoints: Pre-trained weights are hosted by DeepChem on Hugging Face. Direct links include DeepChem/ChemBERTa-77M-MTR and DeepChem/ChemBERTa-77M-MLM (Note: Model cards are currently empty).
Code Reference: While the DeepChem repository is referenced for code, isolated training scripts tailored to recreate ChemBERTa-2’s exact pipeline are not separated from the generalized deepchem library tooling.

Evaluation

Benchmarks were performed on MoleculeNet using DeepChem.

Metric	Tasks	Baseline	Notes
RMSE ($\downarrow$)	Delaney, Lipo, BACE (Reg), Clearance	D-MPNN	ChemBERTa-2 outperformed D-MPNN on Delaney (0.889 vs 1.105) and Clearance (48.5 vs 49.8).
ROC-AUC ($\uparrow$)	BBBP, ClinTox, HIV, Tox21, BACE (Cls)	D-MPNN	ChemBERTa-2 generally competitive; MTR-77M achieved 0.728 on BBBP vs D-MPNN 0.697.

Hardware

Compute: AWS EC2 instances with Nvidia T4 GPUs
Strategy: AWS Spot instances were used to reduce cost; implemented frequent checkpointing to handle interruptions.
Note: For MTR, they wrote a custom data loader wrapper around HuggingFace’s text loader to handle CSV parsing efficiency, as the default CSV loader was a major bottleneck for the 200-element target vectors.

Paper Information

Citation: Ahmad, W., Simon, E., Chithrananda, S., Grand, G., & Ramsundar, B. (2022). ChemBERTa-2: Towards Chemical Foundation Models. arXiv preprint arXiv:2209.01712. https://doi.org/10.48550/arXiv.2209.01712

Publication: arXiv 2022 (Presented at 2021 ELLIS ML for Molecule Discovery Workshop)

Additional Resources:

ChemBERTa-1 Paper

@misc{ahmadChemBERTa2ChemicalFoundation2022,
  title = {{{ChemBERTa-2}}: {{Towards Chemical Foundation Models}}},
  shorttitle = {{{ChemBERTa-2}}},
  author = {Ahmad, Walid and Simon, Elana and Chithrananda, Seyone and Grand, Gabriel and Ramsundar, Bharath},
  year = 2022,
  month = sep,
  number = {arXiv:2209.01712},
  eprint = {2209.01712},
  primaryclass = {cs},
  publisher = {arXiv},
  doi = {10.48550/arXiv.2209.01712},
  urldate = {2025-12-25},
  archiveprefix = {arXiv}
}

ChemBERTa: Molecular Property Prediction via Transformers

Tue, 23 Dec 2025 00:00:00 +0000

Taxonomy and Paper Contributions

This is primarily a Method paper ($\Psi_{\text{Method}}$), with a significant Resource component ($\Psi_{\text{Resource}}$).

It is a methodological investigation because it systematically evaluates a specific architecture (Transformers/RoBERTa) against established State-of-the-Art (SOTA) baselines like directed Message Passing Neural Networks (D-MPNNs) to determine “how well does this work?” in the chemical domain. It ablates dataset size, tokenization, and input representation.

It is also a resource paper as it introduces “PubChem-77M,” a curated dataset of 77 million SMILES strings designed to facilitate large-scale self-supervised pretraining for the community.

Overcoming Data Scarcity in Property Prediction

The primary motivation is data scarcity in molecular property prediction. Graph Neural Networks (GNNs) achieve strong performance on property prediction tasks when provided with sufficient labeled data. Generating these labels requires costly and time-consuming laboratory testing, leading to severe data scarcity in specialized chemical domains.

Massive quantities of unlabeled chemical structure data exist in the form of SMILES strings. Inspired by the success of Transformers in NLP, where self-supervised pretraining on large corpora yields strong transfer learning, the authors aim to use these unlabeled datasets to learn effective molecular representations. Additionally, Transformers benefit from a mature software ecosystem (HuggingFace) that offers efficiency advantages over GNNs.

Pretraining Scaling Laws and Novelty

Previous works applied Transformers to SMILES strings. This paper advances the field by systematically evaluating scaling laws and architectural components for this domain. Specifically:

Scaling Analysis: It explicitly tests how pretraining dataset size (100K to 10M) impacts downstream performance.
Tokenizer Comparison: It compares standard NLP Byte-Pair Encoding (BPE) against a chemically-aware “SmilesTokenizer”.
Representation Comparison: It evaluates if the robust SELFIES string representation offers advantages over standard SMILES in a Transformer context.

Experimental Setup: Pretraining and Finetuning

The authors trained ChemBERTa (based on RoBERTa) using Masked Language Modeling (MLM) on subsets of the PubChem dataset. The core training objective minimizes the cross-entropy loss over a corrupted input where a subset of basic tokens, denoted by $\mathcal{M}$, are masked:

$$ \mathcal{L}_{\text{MLM}} = - \frac{1}{|\mathcal{M}|} \sum_{i \in \mathcal{M}} \log P(x_i \mid x_{\setminus \mathcal{M}}; \theta) $$

where $x_i$ is the exact masked token, $x_{\setminus \mathcal{M}}$ is the corrupted SMILES context string, and $\theta$ represents the network parameters.

Pretraining: Models were pretrained on dataset sizes of 100K, 250K, 1M, and 10M compounds.
Baselines: Performance was compared against D-MPNN (Graph Neural Network), Random Forest (RF), and SVM using 2048-bit Morgan Fingerprints.
Downstream Tasks: Finetuning was performed individually on small MoleculeNet classification tasks: BBBP (blood-brain barrier), ClinTox (clinical toxicity), HIV, and Tox21 (p53 stress-response). This poses a transfer learning challenge, as the model must adapt from pretraining on 10 million molecules to classifying datasets ranging from ~1.5K to ~41K examples.
Ablations:
- Tokenization: BPE vs. SmilesTokenizer on the 1M dataset, evaluated on Tox21.
- Input: SMILES vs. SELFIES strings on the Tox21 task.

Results vs. Graph Neural Network Baselines

The main comparison between ChemBERTa (pretrained on 10M compounds) and Chemprop baselines on MoleculeNet tasks is summarized below (Table 1 from the paper):

Model	BBBP ROC	BBBP PRC	ClinTox ROC	ClinTox PRC	HIV ROC	HIV PRC	Tox21 ROC	Tox21 PRC
ChemBERTa 10M	0.643	0.620	0.733	0.975	0.622	0.119	0.728	0.207
D-MPNN	0.708	0.697	0.906	0.993	0.752	0.152	0.688	0.429
RF	0.681	0.692	0.693	0.968	0.780	0.383	0.724	0.335
SVM	0.702	0.724	0.833	0.986	0.763	0.364	0.708	0.345

Scaling Improvements & Training Dynamics: Performance scales predictably with pretraining data size. Increasing data from 100K to 10M improved ROC-AUC by +0.110 and PRC-AUC by +0.059 on average across BBBP, ClinTox, and Tox21 (HIV was omitted due to resource constraints). Notably, researchers had to halt pretraining on the 10M subset after just 3 epochs due to overfitting, suggesting that simple 15% token masking might not provide a sufficiently difficult learning curvature for large-scale chemical representation.
Performance Limits vs. GNNs: ChemBERTa generally performs below the D-MPNN baseline. On the Tox21 dataset, ChemBERTa-10M achieved a higher ROC-AUC (0.728) than D-MPNN (0.688); nonetheless, it recorded a substantially lower PRC-AUC (0.207 vs 0.429). This gap indicates that current Transformer iterations lack the explicit inductive biases of graph algorithms and struggle with the severe class imbalances typical of chemical datasets.
Ablation Limitations (Tokenization & SELFIES): The authors’ ablation studies for tokenization (SmilesTokenizer narrowly beating BPE) and input representation (SELFIES performing comparably to SMILES) were evaluated exclusively on the single Tox21 task. Deriving broad architectural conclusions regarding “semantically-aware tokenization” or string robustness from an $N=1$ empirical evaluation is a significant limitation of the study. Broader benchmarking is required to validate these findings.
Interpretability: Attention heads organically learn to track chemically relevant substructures (like specific functional groups and aromatic rings), mimicking the inductive biases of graph convolutions.

Reproducibility Details

Data

The authors curated a massive dataset for pretraining and utilized standard benchmarks for evaluation.

Pretraining Data: PubChem-77M.
- Source: 77 million unique SMILES from PubChem.
- Preprocessing: Canonicalized and globally shuffled.
- Subsets used: 100K, 250K, 1M, and 10M subsets.
- Availability Note: The authors provided a direct link to the canonicalized 10M compound subset used for their largest experiments. Full reproducibility of the smaller (100K, 250K, 1M) or full 77M sets may require re-extracting from PubChem.
Evaluation Data: MoleculeNet.
- Tasks: BBBP (2,039), ClinTox (1,478), HIV (41,127), Tox21 (7,831).
- Splitting: 80/10/10 train/valid/test split using a scaffold splitter to ensure chemical diversity between splits.

Algorithms

The core training methodology mirrors standard BERT/RoBERTa procedures adapted for chemical strings.

Objective: Masked Language Modeling (MLM) with 15% token masking.
Tokenization:
- BPE: Byte-Pair Encoder (vocab size 52K).
- SmilesTokenizer: Regex-based custom tokenizer available in DeepChem (documented here).
Sequence Length: Maximum sequence length of 512 tokens.
Finetuning: Appended a linear classification layer; backpropagated through the base model for up to 25 epochs with early stopping on ROC-AUC.

Models

Architecture: RoBERTa (via HuggingFace).
- Layers: 6
- Attention Heads: 12 (72 distinct mechanisms total).
- Implementation Note: The original training notebooks and scripts are maintained in the authors’ bert-loves-chemistry repository, alongside the primary downstream tasks integrated into DeepChem. A full Tox21 transfer learning tutorial has been incorporated into the DeepChem repository.
Baselines (via Chemprop library):
- D-MPNN: Directed Message Passing Neural Network with default hyperparameters.
- RF/SVM: Scikit-learn Random Forest and SVM using 2048-bit Morgan fingerprints (RDKit).

Evaluation

Performance is measured using dual metrics to account for class imbalance common in toxicity datasets.

Metric	Details
ROC-AUC	Area Under Receiver Operating Characteristic Curve
PRC-AUC	Area Under Precision-Recall Curve (vital for imbalanced data)

Hardware

Compute: Single NVIDIA V100 GPU.
Training Time: Approximately 48 hours for the 10M compound subset.
Carbon Footprint: Estimated 17.1 kg $\text{CO}_2\text{eq}$ (offset by Google Cloud).

Artifacts

Artifact	Type	License	Notes
bert-loves-chemistry	Code	MIT	Training notebooks and finetuning scripts
DeepChem	Code	MIT	Integration of ChemBERTa and SmilesTokenizer
ChemBERTa-zinc-base-v1	Model	Unknown	Pre-trained RoBERTa on 100K ZINC SMILES
PubChem-10M subset	Dataset	Unknown	Canonicalized 10M compound subset used for largest experiments

Reproducibility status: Partially Reproducible. Code and pre-trained models are available, and the 10M pretraining subset is downloadable. However, smaller subsets (100K, 250K, 1M) may need re-extraction from PubChem, and exact hyperparameter details for finetuning (learning rate, batch size) are not fully specified in the paper.

Paper Information

Citation: Chithrananda, S., Grand, G., & Ramsundar, B. (2020). ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction. arXiv preprint arXiv:2010.09885. https://doi.org/10.48550/arXiv.2010.09885

Publication: arXiv 2020 (Preprint)

Additional Resources:

HuggingFace Model Hub (ChemBERTa-zinc-base-v1) - Additional pre-trained variations on PubChem & ZINC datasets are available on the author’s seyonec HF profile.
bert-loves-chemistry GitHub Repository - Notebooks and scripts used for MLM pretraining and finetuning evaluations.

BibTeX

@misc{chithranandaChemBERTaLargeScaleSelfSupervised2020,
  title = {{{ChemBERTa}}: {{Large-Scale Self-Supervised Pretraining}} for {{Molecular Property Prediction}}},
  shorttitle = {{{ChemBERTa}}},
  author = {Chithrananda, Seyone and Grand, Gabriel and Ramsundar, Bharath},
  year = 2020,
  month = oct,
  number = {arXiv:2010.09885},
  eprint = {2010.09885},
  primaryclass = {cs},
  publisher = {arXiv},
  doi = {10.48550/arXiv.2010.09885},
  urldate = {2025-12-24},
  archiveprefix = {arXiv}
}