Property Prediction on Hunter Heidenreich | ML Research Scientist

MTL-BERT: Multitask BERT for Property Prediction

Fri, 27 Mar 2026 00:00:00 +0000

A Multitask BERT Framework for Molecular Property Prediction

MTL-BERT is a Method paper that introduces a multitask learning framework built on BERT for predicting molecular properties from SMILES strings. The primary contribution is the combination of three strategies to address data scarcity in drug discovery: (1) masked token pretraining on 1.7 million unlabeled molecules from ChEMBL, (2) multitask fine-tuning across 60 property prediction datasets simultaneously, and (3) SMILES enumeration as a data augmentation technique applied during pretraining, fine-tuning, and inference. The model achieves strong performance across 60 ADMET and molecular property datasets (44 classification and 16 regression), outperforming baselines including GNNs, XGBoost with molecular fingerprints, and prior SMILES-BERT approaches.

Data Scarcity in Molecular Property Prediction

Deep learning methods for molecular property prediction face a fundamental tension: they require large amounts of labeled data to learn effectively, but labeled bioactivity data is scarce due to the cost and time of laboratory experiments. Existing approaches at the time of publication addressed this in isolation. Graph neural networks (GNNs) learn from molecular graphs but are typically shallow (2-3 layers) and prone to overfitting on small datasets. The original SMILES-BERT model applied masked language modeling to SMILES strings but fine-tuned separately for each task, missing opportunities to share information across related properties. Fixed molecular representations like CDDD (continuous and data-driven descriptors) cannot be further optimized for specific downstream tasks.

The authors identify three specific gaps: (1) single-task fine-tuning wastes the correlations between related ADMET properties (e.g., lipophilicity relates to many ADMET endpoints), (2) using only canonical SMILES limits the model’s ability to learn robust molecular features, and (3) no prior work had combined pretraining, multitask learning, and SMILES enumeration into a unified framework.

Three Strategies Combined: Pretraining, Multitask Learning, and SMILES Enumeration

The core innovation of MTL-BERT is the synergistic combination of three strategies in a single pipeline.

Masked SMILES Pretraining

Following the BERT paradigm, MTL-BERT pretrains on 1.7 million unlabeled molecules from ChEMBL using a masked token recovery task. For each SMILES string, 15% of tokens are randomly selected: 80% are replaced with a [MASK] token, 10% are replaced with a random token, and 10% remain unchanged. The loss is computed only at masked positions. Unlike the original BERT, MTL-BERT omits the next-sentence prediction task since there is no sequential relationship between SMILES strings (following the RoBERTa finding that this task is unnecessary).

SMILES strings are tokenized with a regular expression that captures multi-character tokens (e.g., Si, Br, Cl) and common SMILES syntax. The model uses positional encoding to capture token order.

Transformer Architecture

The model uses a standard Transformer encoder with multihead self-attention. The scaled dot-product attention computes:

$$\mathbf{O}_h = \text{softmax}\left(\frac{\mathbf{Q}_h \mathbf{K}_h^T}{\sqrt{d_k}}\right) \mathbf{V}_h$$

where $\mathbf{Q}_h$, $\mathbf{K}_h$, and $\mathbf{V}_h$ are the query, key, and value matrices for head $h$, and $\sqrt{d_k}$ is a scaling factor. The outputs from all heads are concatenated and projected. Each attention sublayer is followed by a position-wise feedforward network with GELU activation, layer normalization, and residual connections.

Three model sizes were compared:

Model	Layers	Heads	Embedding Size	FFN Size	Recovery Accuracy	Fine-tuning Performance
MTL-BERT_SMALL	4	4	128	512	0.931	0.826
MTL-BERT_MEDIUM	8	8	256	1,024	0.962	0.852
MTL-BERT_LARGE	12	12	576	2,304	0.974	0.848

The medium model was selected for its best fine-tuning performance with lower computational cost, despite the large model achieving higher pretraining recovery accuracy. The slight performance drop for the large model suggests mild overfitting.

Multitask Fine-tuning with Task Tokens

During fine-tuning, task tokens ([T0], [T1], …) are prepended to each input SMILES string. The Transformer output at each task token position is passed through a task-specific two-layer feedforward network for the corresponding prediction task. An attention mask prevents direct information exchange between task tokens, allowing each task to learn directly from SMILES tokens without interference. This design also reduces the discrepancy between pretraining (no task tokens visible) and fine-tuning.

Cross-entropy loss is used for classification tasks and mean squared error for regression tasks. The total multitask loss is a simple sum of per-task losses without learned weighting.

SMILES Enumeration as Data Augmentation

A molecule can be represented by multiple valid SMILES strings by varying starting atoms and traversal orders. MTL-BERT applies SMILES enumeration at all three stages:

Pretraining: Enumerated SMILES increase diversity of the self-supervised training data.
Fine-tuning: Each dataset is augmented 20x with random SMILES variants, increasing data diversity and helping the model learn position-invariant features.
Inference: Multiple SMILES are generated per test molecule, predictions are fused (averaged) for a more robust final prediction.

The 20x augmentation factor was chosen based on prior work showing diminishing returns beyond this level while significantly increasing computational cost.

Experimental Evaluation Across 60 Datasets

Setup

MTL-BERT was evaluated on 60 datasets (44 classification, 16 regression) covering ADMET properties and common molecular benchmarks. Datasets were sourced from ADMETlab and MoleculeNet. Each dataset was split 8:1:1 (train/validation/test), and experiments were repeated 10 times with random splits, reporting mean and standard deviation.

Classification tasks were evaluated with ROC-AUC and accuracy; regression tasks with $R^2$ and RMSE.

Baselines

Five baselines were compared:

ECFP4-XGBoost: Extended-connectivity fingerprints (diameter 4) with gradient boosting
Graph Attention Network (GAT)
Graph Convolutional Network (GCN)
AttentiveFP: A GNN with attention for molecular property prediction
CDDD: Continuous and data-driven descriptors from a pretrained RNN auto-encoder

Ablation Study

Three model variants were compared to isolate contributions:

MTL-BERT: Full model (pretraining + multitask + SMILES enumeration)
STL-BERT: Single-task fine-tuning with SMILES enumeration (no multitask)
Cano-BERT: Canonical SMILES only, single-task fine-tuning (equivalent to SMILES-BERT)

Cano-BERT showed more than 10% degradation on several datasets (CL, Fu, LC50DM) compared to STL-BERT, demonstrating the importance of SMILES enumeration. MTL-BERT outperformed STL-BERT on most datasets, with improvements exceeding 5% on $F_{20\%}$, SR-ARE, and SR-ATAD5, confirming that multitask learning provides additional benefit on top of enumeration.

Results vs. Baselines

MTL-BERT outperformed all baselines on nearly all 60 datasets. Specific findings:

ECFP4-XGBoost performed inconsistently, doing well on some tasks (e.g., $F_{30\%}$, BACE, CL) but poorly on others, reflecting the limitation of fixed-length fingerprint representations.
GNNs generally improved over fingerprints but still suffered from data scarcity, falling behind ECFP4-XGBoost by more than 3% on $F_{30\%}$, Carcinogenicity, CL, and VD.
MTL-BERT surpassed all baselines except on CYP2C19-sub and BACE (by less than 1.1%).
On 14 tasks (NR-ER, NR-PPAR-gamma, SR-ARE, SR-ATAD5, SR-HSE, SR-MMP, Bioconcentration Factor, Fu, LC50FM, Lipophilicity, CL, PPB, VD, LC50DM), MTL-BERT exceeded the best baseline by more than 5-10%.
Improvements were statistically significant at the 95% confidence level (paired t-test, $P \leq 0.001$).

Representation Analysis

t-SNE visualization of pretrained token embeddings (from 1,000 randomly selected molecules, approximately 35,000 tokens) showed that:

Tokens of the same type cluster together (capturing atomic type information).
Within type clusters, sub-groups correspond to different chemical environments (e.g., oxygen atoms in nitrate groups vs. carbonyl groups).
Nearby embeddings share similar molecular neighborhood environments.

Attention-based Interpretability

The model’s attention weights provide interpretability for predictions:

For a solubility task (LogS/LogD), attention concentrated on polar groups, which are known determinants of aqueous solubility.
For AMES (mutagenicity), attention focused on azide, nitrosamide, acylchloride, and nitrite groups, which are known mutagenic structural alerts.

Performance Gains from Combined Strategies with Interpretable Attention

MTL-BERT demonstrates that the combination of pretraining, multitask learning, and SMILES enumeration is more effective than any individual strategy for molecular property prediction. The ablation study provides clear evidence for the additive benefit of each component.

Key strengths include the breadth of evaluation (60 datasets covering diverse ADMET endpoints), the consistent improvement over multiple baseline types (fingerprints, GNNs, pretrained representations), and the interpretable attention mechanism that highlights chemically meaningful substructures.

Limitations to note: the simple sum of multitask losses (no learned task weighting) may not be optimal when tasks have very different scales or when some tasks are unrelated. The authors observe slight degradation on a few datasets (AMES, CYP1A2-Sub, FreeSolv), suggesting negative transfer in those cases. The 20x SMILES enumeration significantly increases computational cost during fine-tuning and inference. The paper does not report wall-clock training times or GPU hours, making it difficult to assess the practical cost of the enumeration strategy. Hardware details are not specified beyond acknowledgment of the High-Performance Computing Center at Central South University.

The hierarchical clustering of task representations reveals meaningful task groupings (e.g., LogD and LogP cluster together due to their shared relationship with water solubility), supporting the premise that multitask learning captures cross-task correlations.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pretraining	ChEMBL	1.7M molecules	Unlabeled SMILES; 10% held out for evaluation
Fine-tuning/Evaluation	ADMETlab + MoleculeNet	60 datasets (44 classification, 16 regression)	8:1:1 train/val/test split

Algorithms

Pretraining: Masked token prediction (15% masking rate: 80% [MASK], 10% random, 10% unchanged). Adam optimizer, learning rate 1e-4, batch size 512, 50 epochs.
Fine-tuning: Adam optimizer, learning rate 5e-5, batch size 64, dropout 0.1. Cross-entropy for classification, MSE for regression. Early stopping with patience 20, max 200 epochs.
SMILES enumeration: 20x augmentation. Repeated search up to 100 times if enumerated SMILES is identical to a previous one.
Inference fusion: Predictions from multiple enumerated SMILES are averaged.

Models

MTL-BERT_MEDIUM (selected model): 8 layers, 8 attention heads, 256 embedding size, 1,024 FFN size
Pretraining recovery accuracy: 0.962
1,000 task tokens pre-allocated for future tasks

Evaluation

Metric	Task Type	Notes
ROC-AUC	Classification	Primary metric
Accuracy	Classification	Secondary metric
$R^2$	Regression	Primary metric
RMSE	Regression	Secondary metric

All experiments repeated 10 times with random splits; mean and standard deviation reported.

Hardware

Hardware specifications are not reported in the paper. The authors acknowledge the High-Performance Computing Center of Central South University.

Artifacts

Artifact	Type	License	Notes
MTL-BERT	Code	Not specified	Official implementation
ChEMBL	Dataset	CC BY-SA 3.0	Pretraining data source
MoleculeNet	Dataset	MIT	Fine-tuning benchmark
ADMETlab	Dataset	Free for academic use	ADMET property datasets

Paper Information

Citation: Zhang, X.-C., Wu, C.-K., Yi, J.-C., Zeng, X.-X., Yang, C.-Q., Lu, A.-P., Hou, T.-J., & Cao, D.-S. (2022). Pushing the boundaries of molecular property prediction for drug discovery with multitask learning BERT enhanced by SMILES enumeration. Research, 2022, Article 0004. https://doi.org/10.34133/research.0004

@article{zhang2022mtlbert,
  title={Pushing the Boundaries of Molecular Property Prediction for Drug Discovery with Multitask Learning BERT Enhanced by SMILES Enumeration},
  author={Zhang, Xiao-Chen and Wu, Cheng-Kun and Yi, Jia-Cai and Zeng, Xiang-Xiang and Yang, Can-Qun and Lu, Ai-Ping and Hou, Ting-Jun and Cao, Dong-Sheng},
  journal={Research},
  volume={2022},
  pages={Article 0004},
  year={2022},
  doi={10.34133/research.0004},
  publisher={American Association for the Advancement of Science (AAAS)}
}

Maxsmi: SMILES Augmentation for Property Prediction

Fri, 27 Mar 2026 00:00:00 +0000

Systematic Benchmarking of SMILES Data Augmentation

This is an Empirical paper that systematically evaluates how SMILES augmentation affects deep learning molecular property prediction. The primary contribution is a comprehensive comparison of five augmentation strategies across three neural network architectures and four datasets, producing the “Maxsmi” models that maximize prediction performance. The study also demonstrates that test-time augmentation provides a practical confidence measure for predictions.

The Data Scarcity Problem in QSAR Modeling

Deep learning models require large training sets to perform well, but experimental physico-chemical and bioactivity datasets remain small, typically ranging from hundreds to a few thousand compounds. SMILES augmentation, where the non-unique SMILES representation of a molecule is exploited to generate multiple training examples per compound, has been shown to help in prior work by Bjerrum (2017), Kimber et al. (2018), and Li and Fourches (2020). However, no prior study had systematically compared different augmentation strategies, analyzed how much augmentation is needed, or examined the relationship between augmentation factor and prediction confidence. Most previous work chose augmentation numbers a priori without justification. Maxsmi fills this gap by providing a systematic analysis and practical guidelines.

Five Augmentation Strategies and Test-Time Ensemble Learning

The core insight is twofold. First, the authors define five distinct strategies for generating augmented SMILES:

No augmentation: use only the canonical SMILES (baseline)
Augmentation with duplication: generate $m$ random SMILES per compound, allowing duplicates; dataset grows to $N \times m$
Augmentation without duplication: generate $m$ random SMILES and discard exact duplicates
Augmentation with reduced duplication: keep only $f(m) = \sqrt{m}$ copies of each duplicate, a compromise between the above
Augmentation with estimated maximum: sample random SMILES until the same string has been generated 10 times, attempting to cover most of the valid SMILES space

Second, the authors formalize test-time augmentation as ensemble learning. Given a trained model $M_{\Theta}$, each test compound $C$ is represented by $k$ random SMILES $S_1(C), \ldots, S_k(C)$. The per-SMILES predictions are:

$$ \hat{y}_i(C) = M_{\Theta}(S_i(C)) $$

The compound-level prediction is an aggregation (mean) over these:

$$ \hat{y}(C) = A\big(\hat{y}_1(C), \ldots, \hat{y}_k(C)\big) $$

The standard deviation of the per-SMILES predictions serves as a confidence measure: high variance indicates the model is uncertain about a compound.

Experimental Design: Three Architectures, Four Datasets

Datasets

Dataset	Size (after preprocessing)	Train / Test	Task	Provenance
ESOL	1,128	902 / 226	Water solubility	MoleculeNet
ESOL_small	1,068	854 / 214	Solubility (max 25 heavy atoms)	MoleculeNet
FreeSolv	642	513 / 129	Hydration free energy	MoleculeNet
Lipophilicity	4,199	3,359 / 840	Octanol/water distribution	ChEMBL
Affinity (EGFR)	5,849	4,679 / 1,170	pIC50 against EGFR kinase	Kinodata

Architectures

Three shallow neural networks are compared:

CONV1D: 1D convolution (kernel size 10, stride 1) followed by two fully connected layers
CONV2D: 2D convolution on the one-hot encoded SMILES matrix, followed by two fully connected layers
RNN: LSTM layer followed by two fully connected layers (128 and 64 units)

All models are trained for 250 epochs with batch size 16, MSE loss, SGD optimizer, and learning rate 0.001. A Random Forest baseline with Morgan fingerprints (radius 2, length 1024) is also included.

Augmentation sweep

The augmentation number $m$ is varied from 1 to 20 (step 1) and from 20 to 100 (step 10) for three strategies (with, without, and reduced duplication). The estimated maximum strategy is tested on the smaller datasets. Both training and test sets receive the same augmentation.

Key Findings: Augmentation Consistently Improves RMSE

Augmentation always helps

Across all datasets and architectures, SMILES augmentation reduces test RMSE compared to the no-augmentation baseline. Performance improves sharply in the low augmentation range (1 to 10) and reaches a plateau around 40 to 70, after which additional augmentation provides diminishing returns.

Best models (Maxsmi)

Dataset	Model	Augmentation Number	Strategy	Test RMSE
ESOL	CONV1D	70	Reduced duplication	0.569
FreeSolv	CONV1D	70	With duplication	1.032
Lipophilicity	CONV1D	80	Without duplication	0.593

The CONV1D architecture consistently outperforms RNN and CONV2D. For ESOL, the CONV1D model improves from 0.839 RMSE (no augmentation) to 0.569 RMSE (70x reduced duplication), a 32% reduction.

No single best augmentation strategy

The three main augmentation strategies (with, without, and reduced duplication) perform similarly. Generating the estimated maximum number of unique SMILES does not yield the best results, suggesting a saturation point exists where additional SMILES diversity stops helping.

Canonical SMILES outperform single random SMILES

When augmentation is limited to a single representation ($m = 1$), the canonical SMILES consistently outperforms a single random SMILES. On ESOL with CONV1D, the canonical model achieves 0.839 RMSE versus 0.964 for a random SMILES. The authors attribute this to the simpler, more readable structure of canonical SMILES (fewer branches and brackets).

Comparison to prior work

Study	ESOL	FreeSolv	Lipophilicity	Model
Maxsmi	0.569	1.032	0.593	CNN
MoleculeNet	0.58 +/- 0.03	1.15 +/- 0.12	0.655 +/- 0.036	GNN
CNF	0.62	1.11	0.67	CNN
MolPMoFiT	N/A	1.197 +/- 0.127	0.565 +/- 0.037	RNN

Maxsmi outperforms or matches MoleculeNet’s graph neural networks and the CNF model on all three tasks. MolPMoFiT slightly outperforms Maxsmi on lipophilicity (0.565 vs 0.593) but performs worse on FreeSolv.

Confidence estimation

The standard deviation of per-SMILES predictions correlates with prediction error. Confidence curves show that sequentially removing compounds with the highest uncertainty leads to monotonically decreasing mean prediction error. For ESOL, keeping only the top 10% most confident predictions yields errors below 0.25.

EGFR affinity test case

Applying the Maxsmi approach (CONV1D, 70x augmentation, reduced duplication) to EGFR kinase affinity prediction yields test RMSE of 0.777 and R2 of 0.712, compared to 1.031 RMSE and 0.494 R2 for the canonical model (a 25% RMSE improvement). The Random Forest baseline (0.758 RMSE, 0.726 R2) performs comparably, which the authors note without further explanation.

Limitations

All experiments use a single train/test split (80/20) without cross-validation, due to the computational cost of the full augmentation sweep. This means reported RMSE values lack uncertainty estimates for the Maxsmi models.
The study uses shallow networks only. Whether the same augmentation benefits apply to deeper architectures or pre-trained models is untested.
The EGFR test case shows the Random Forest baseline performing comparably to the Maxsmi model, raising questions about when SMILES augmentation provides a meaningful advantage over traditional fingerprint-based methods.
The comparison to prior work uses different splits, preprocessing, and evaluation protocols across studies, which the authors acknowledge limits direct comparability.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Training/Evaluation	ESOL	1,128	MoleculeNet, water solubility
Training/Evaluation	FreeSolv	642	MoleculeNet, hydration free energy
Training/Evaluation	Lipophilicity	4,199	ChEMBL, logD
Test case	EGFR Affinity	5,849	Kinodata (ChEMBL v28), pIC50

All datasets are publicly available through MoleculeNet/DeepChem and Kinodata.

Algorithms

SMILES generation via RDKit’s random SMILES enumeration
One-hot encoding of SMILES characters with padding to max length
Five augmentation strategies applied to both training and test sets
Mean aggregation for compound-level predictions

Models

Model	Architecture	Parameters
CONV1D	1D conv (kernel 10, stride 1) + 2 FC layers	Not specified
CONV2D	2D conv (single channel) + 2 FC layers	Not specified
RNN	LSTM + FC(128) + FC(64)	Not specified
RF Baseline	Random Forest (default sklearn)	Morgan FP, radius 2, length 1024

Training: 250 epochs, batch size 16, MSE loss, SGD, lr=0.001.

Evaluation

Metric	Best Value	Baseline	Notes
RMSE (ESOL)	0.569	1.102 (RF)	CONV1D, 70x reduced dup
RMSE (FreeSolv)	1.032	2.563 (RF)	CONV1D, 70x with dup
RMSE (Lipophilicity)	0.593	0.860 (RF)	CONV1D, 80x without dup
RMSE (EGFR)	0.777	0.758 (RF)	CONV1D, 70x reduced dup

Hardware

Training was performed on a GeForce GTX 1080 Ti, provided by the HPC cluster at Freie Universitat Berlin. Training CONV1D on ESOL with 100x augmentation (keeping duplicates, 90,200 data points) takes approximately 3 hours. Training with 19x augmentation achieves RMSE of 0.605 in under 30 minutes.

Artifacts

Artifact	Type	License	Notes
volkamerlab/maxsmi	Code	MIT	Full source code, trained models, CLI for prediction
Documentation	Docs	N/A	Read the Docs documentation
Kinodata	Dataset	N/A	Curated kinase bioactivity data from ChEMBL v28

Reproducibility status: Highly Reproducible. Code, data, trained models, and a command-line prediction tool are all publicly available under the MIT license.

Paper Information

Citation: Kimber, T. B., Gagnebin, M., & Volkamer, A. (2021). Maxsmi: Maximizing molecular property prediction performance with confidence estimation using SMILES augmentation and deep learning. Artificial Intelligence in the Life Sciences, 1, 100014. https://doi.org/10.1016/j.ailsci.2021.100014

@article{kimber2021maxsmi,
  title={Maxsmi: Maximizing molecular property prediction performance with confidence estimation using SMILES augmentation and deep learning},
  author={Kimber, Talia B. and Gagnebin, Maxime and Volkamer, Andrea},
  journal={Artificial Intelligence in the Life Sciences},
  volume={1},
  pages={100014},
  year={2021},
  publisher={Elsevier},
  doi={10.1016/j.ailsci.2021.100014}
}

Transformers for Molecular Property Prediction Review

Thu, 26 Mar 2026 00:00:00 +0000

A Systematization of Transformers for Molecular Property Prediction

This is a Systematization paper. Sultan et al. provide the first comprehensive, structured review of sequence-based transformer models applied to molecular property prediction (MPP). The review catalogs 16 models published between 2019 and 2023, organizes them by architecture type (encoder-decoder, encoder-only, decoder-only), and systematically examines seven key design decisions that arise when building a transformer for MPP. The paper’s primary contribution is identifying gaps in current evaluation practices and articulating what standardization the field needs for meaningful progress.

The Problem: Inconsistent Evaluation Hinders Progress

Molecular property prediction is essential for drug discovery, crop protection, and environmental science. Deep learning approaches, including transformers, have been increasingly applied to this task by learning molecular representations from string notations like SMILES and SELFIES. However, the field faces several challenges:

Small labeled datasets: Labeled molecular property datasets typically contain only hundreds or thousands of molecules, making supervised learning alone insufficient.
No standardized evaluation protocol: Different papers use different data splits (scaffold vs. random), different splitting implementations, different numbers of repetitions (3 to 50), and sometimes do not share their test sets. This makes direct comparison across models infeasible.
Unclear design choices: With many possible configurations for pre-training data, chemical language, tokenization, positional embeddings, model size, pre-training objectives, and fine-tuning approaches, the field lacks systematic analyses to guide practitioners.

The authors note that standard machine learning methods with fixed-size molecular fingerprints remain strong baselines for real-world datasets, illustrating that the promise of transformers for MPP has not yet been fully realized.

Seven Design Questions for Molecular Transformers

The central organizing framework of this review addresses seven questions practitioners must answer when building a transformer for MPP. For each, the authors synthesize findings across the 16 reviewed models.

Reviewed Models

The paper catalogs 16 models organized by architecture:

Architecture	Base Model	Models
Encoder-Decoder	Transformer, BART	ST, Transformer-CNN, X-Mol, ChemFormer
Encoder-Only	BERT	SMILES-BERT, MAT, MolBERT, Mol-BERT, Chen et al., K-BERT, FP-BERT, MolFormer
Encoder-Only	RoBERTa	ChemBERTa, ChemBERTa-2, SELFormer
Decoder-Only	XLNet	Regression Transformer (RT)

The core attention mechanism shared by all these models is the scaled dot-product attention:

$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^{T}}{\sqrt{d_{k}}}\right)V $$

where $Q$, $K$, and $V$ are the query, key, and value matrices, and $d_{k}$ is the dimension of the key vectors.

Question 1: Which Database and How Many Molecules?

Pre-training data sources vary considerably. The three main databases are ZINC (37 billion molecules in ZINC22), ChEMBL (2.4 million unique molecules with 20 million bioactivity measurements), and PubChem (111 million unique molecules). Pre-training set sizes ranged from 900K (ST on ChEMBL) to 1.1B molecules (MolFormer on ZINC + PubChem).

Model	Database	Size	Language
ST	ChEMBL	900K	SMILES
MolBERT	ChEMBL (GuacaMol)	1.6M	SMILES
ChemBERTa	PubChem	100K-10M	SMILES, SELFIES
ChemBERTa-2	PubChem	5M-77M	SMILES
MAT	ZINC	2M	List of atoms
MolFormer	ZINC + PubChem	1.1B	SMILES
Chen et al.	C, CP, CPZ	2M-775M	SMILES

A key finding is that larger pre-training datasets do not consistently improve downstream performance. MolFormer showed minimal difference between models trained on 100M vs. 1.1B molecules. ChemBERTa-2 found that the model trained on 5M molecules using MLM performed comparably to 77M molecules for BBBP (both around 0.70 ROC-AUC). Chen et al. reported comparable $R^{2}$ values of $0.925 \pm 0.01$, $0.917 \pm 0.012$, and $0.915 \pm 0.01$ for ESOL across datasets of 2M, 103M, and 775M molecules, respectively. The data composition and covered chemical space appear to matter more than raw size.

Question 2: Which Chemical Language?

Most models use SMILES. ChemBERTa, RT, and SELFormer also explored SELFIES. MAT uses a simple list of atoms with structural features, while Mol-BERT and FP-BERT use circular fingerprints.

Direct comparisons between SMILES and SELFIES (by ChemBERTa on Tox21 SR-p53 and RT for drug-likeness prediction) found no significant performance difference. The RT authors reported that SELFIES models performed approximately $0.004 \pm 0.01$ better on RMSE, while SMILES models performed approximately $0.004 \pm 0.01$ better on Pearson correlation. The choice of chemical language does not appear to be a major factor in prediction performance, and even non-string representations (atom lists in MAT, fingerprints in Mol-BERT) perform competitively.

Question 3: How to Tokenize?

Tokenization methods span atom-level (42-66 vocabulary tokens), regex-based (47-2,362 tokens), BPE (509-52K tokens), and substructure-based (3,357-13,325 tokens) approaches. No systematic comparison of tokenization strategies exists in the literature. The vocabulary size varied dramatically, from 42 tokens for MolBERT to over 52K for ChemBERTa. The authors argue that chemically meaningful tokenization (e.g., functional group-based fragmentation) could improve both performance and explainability.

Question 4: How to Add Positional Embeddings?

Most models inherited the absolute positional embedding from their NLP base models. MolBERT and RT adopted relative positional embeddings. MolFormer combined absolute and Rotary Positional Embedding (RoPE). MAT incorporated spatial information (inter-atomic 3D distances and adjacency) alongside self-attention.

MolFormer’s comparison showed that RoPE became superior to absolute embeddings only when the pre-training dataset was very large. The performance difference (MAE on QM9) between absolute and RoPE embeddings for models trained on 111K, 111M, and 1.1B molecules was approximately $-0.20 \pm 0.18$, $-0.44 \pm 0.22$, and $0.27 \pm 0.12$, respectively.

The authors highlight that SMILES and SELFIES are linearizations of a 2D molecular graph, so consecutive tokens in a sequence are not necessarily spatially close. Positional embeddings that reflect 2D or 3D molecular structure remain underexplored.

Question 5: How Many Parameters?

Model sizes range from approximately 7M (ST, Mol-BERT) to over 100M parameters (MAT). Most chemical language models operate with 100M parameters or fewer, much smaller than NLP models like BERT (110M-330M) or GPT-3 (175B).

Model	Dimensions	Heads	Layers	Parameters
ST	256	4	4	7M
MolBERT	768	12	12	85M
MolFormer	768	12	6, 12	43M, 85M
SELFormer	768	12, 4	8, 12	57M, 85M
MAT	1024	16	8	101M
ChemBERTa	768	12	6	43M

SELFormer and MolFormer both tested different model sizes. SELFormer’s larger model (approximately 86M parameters) showed approximately 0.034 better ROC-AUC for BBBP compared to the smaller model. MolFormer’s larger model (approximately 87M parameters) performed approximately 0.04 better ROC-AUC on average for BBBP, HIV, BACE, and SIDER. The field lacks the systematic scaling analyses (analogous to Kaplan et al. and Hoffmann et al. in NLP) needed to establish proper scaling laws for chemical language models.

Question 6: Which Pre-training Objectives?

Pre-training objectives fall into domain-agnostic and domain-specific categories:

Model	Pre-training Objective	Fine-tuning
MolFormer	MLM	Frozen, Update
SMILES-BERT	MLM	Update
MolBERT	MLM, PhysChemPred, SMILES-EQ	Frozen, Update
K-BERT	Atom feature, MACCS prediction, CL	Update last layer
ChemBERTa-2	MLM, MTR	Update
MAT	MLM, 2D Adjacency, 3D Distance	Update
ChemFormer	Denoising Span MLM, Augmentation	Update
RT	PLM (Permutation Language Modeling)	-

Domain-specific objectives (predicting physico-chemical properties, atom features, or MACCS keys) showed promising but inconsistent results. MolBERT’s PhysChemPred performed closely to the full three-objective model (approximately $0.72 \pm 0.06$ vs. $0.71 \pm 0.06$ ROC-AUC in virtual screening). The SMILES-EQ objective (identifying equivalent SMILES) was found to lower performance when combined with other objectives. K-BERT’s contrastive learning objective did not significantly change performance (average ROC-AUC of 0.806 vs. 0.807 with and without CL).

ChemBERTa-2’s Multi-Task Regression (MTR) objective performed noticeably better than MLM-only for almost all four classification tasks across pre-training dataset sizes.

Question 7: How to Fine-tune?

Fine-tuning through weight updates generally outperforms frozen representations. SELFormer showed this most dramatically, with a difference of 2.187 RMSE between frozen and updated models on FreeSolv. MolBERT showed a much smaller difference (0.575 RMSE on FreeSolv), likely because its domain-specific pre-training objectives already produced representations closer to the downstream tasks.

Benchmarking Challenges and Performance Comparison

Downstream Datasets

The review focuses on nine benchmark datasets across three categories from MoleculeNet:

Dataset	Molecules	Tasks	Type	Application
ESOL	1,128	1 regression	Physical chemistry	Aqueous solubility
FreeSolv	642	1 regression	Physical chemistry	Hydration free energy
Lipophilicity	4,200	1 regression	Physical chemistry	LogD at pH 7.4
BBBP	2,050	1 classification	Physiology	Blood-brain barrier
ClinTox	1,484	2 classification	Physiology	Clinical trial toxicity
SIDER	1,427	27 classification	Physiology	Drug side effects
Tox21	7,831	12 classification	Physiology	Nuclear receptor/stress pathways
BACE	1,513	1 classification	Biophysics	Beta-secretase 1 binding
HIV	41,127	1 classification	Biophysics	Anti-HIV activity

Inconsistencies in Evaluation

The authors document substantial inconsistencies that prevent fair model comparison:

Data splitting: Models used different splitting methods (scaffold vs. random) and different implementations even when using the same method. Not all models adhered to scaffold splitting for classification tasks as recommended.
Different test sets: Even models using the same split type may not evaluate on identical test molecules due to different random seeds.
Varying repetitions: Repetitions ranged from 3 (RT) to 50 (Chen et al.), making some analyses more statistically robust than others.
Metric inconsistency: Most use ROC-AUC for classification and RMSE for regression, but some models report only averages without standard deviations, while others report standard errors.

Performance Findings

When comparing only models evaluated on the same test sets (Figure 2 in the paper), the authors observe that transformer models show comparable, but not consistently superior, performance to existing ML and DL models. The performance varies considerably across models and datasets.

For BBBP, the Mol-BERT model reported lower ROC-AUC than its corresponding MPNN (approximately 0.88 vs. 0.91), while MolBERT outperformed its corresponding CDDD model (approximately 0.86 vs. 0.76 ROC-AUC) and its SVM baseline (approximately 0.86 vs. 0.70 ROC-AUC). A similar mixed pattern appeared for HIV: ChemBERTa performed worse than its corresponding ML models, while MolBERT performed better than its ML (approximately 0.08 higher ROC-AUC) and DL (approximately 0.03 higher ROC-AUC) baselines. For SIDER, Mol-BERT performed approximately 0.1 better ROC-AUC than its corresponding MPNN. For regression, MAT and MolBERT showed improved performance over their ML and DL baselines on ESOL, FreeSolv, and Lipophilicity. For example, MAT performed approximately 0.2 lower RMSE than an SVM model and approximately 0.03 lower RMSE than the Weave model on ESOL.

Key Takeaways and Future Directions

The review concludes with six main takeaways:

Performance: Transformers using SMILES show comparable but not consistently superior performance to existing ML and DL models for MPP.
Scaling: No systematic analysis of model parameter scaling relative to data size exists for chemical language models. Such analysis is essential.
Pre-training data: Dataset size alone is not the sole determinant of downstream performance. Composition and chemical space coverage matter.
Chemical language: SMILES and SELFIES perform similarly. Alternative representations (atom lists, fingerprints) also work when the architecture is adjusted.
Domain knowledge: Domain-specific pre-training objectives show promise, but tokenization and positional encoding remain underexplored.
Benchmarking: The community needs standardized data splitting, fixed test sets, statistical analysis, and consistent reporting to enable meaningful comparison.

The authors also highlight the need for attention visualization and explainability analysis, investigation of NLP-originated techniques (pre-training regimes, fine-tuning strategies like LoRA, explainability methods), and adaptation of these techniques to the specific characteristics of chemical data (smaller vocabularies, shorter sequences).

Reproducibility Details

Data

This is a review paper. No new data or models are introduced. All analyses use previously reported results from the 16 reviewed papers, with additional visualization and comparison. The authors provide a GitHub repository with the code and data used to generate their comparative figures.

Algorithms

Not applicable (review paper). The paper describes training strategies at a conceptual level, referencing the original publications for implementation details.

Models

Not applicable (review paper). The paper catalogs 16 models with their architecture details, parameter counts, and training configurations across Tables 1, 4, 5, 6, and 7.

Evaluation

The paper compiles performance across nine MoleculeNet datasets. Key comparison figures (Figures 2 and 7) restrict to models evaluated on the same test sets for fair comparison, using ROC-AUC for classification and RMSE for regression.

Hardware

Not applicable (review paper).

Artifact	Type	License	Notes
Transformers4MPP_review	Code	MIT	Figure generation code and compiled data

Paper Information

Citation: Sultan, A., Sieg, J., Mathea, M., & Volkamer, A. (2024). Transformers for Molecular Property Prediction: Lessons Learned from the Past Five Years. Journal of Chemical Information and Modeling, 64(16), 6259-6280. https://doi.org/10.1021/acs.jcim.4c00747

@article{sultan2024transformers,
  title={Transformers for Molecular Property Prediction: Lessons Learned from the Past Five Years},
  author={Sultan, Afnan and Sieg, Jochen and Mathea, Miriam and Volkamer, Andrea},
  journal={Journal of Chemical Information and Modeling},
  volume={64},
  number={16},
  pages={6259--6280},
  year={2024},
  publisher={American Chemical Society},
  doi={10.1021/acs.jcim.4c00747}
}

Transformer-CNN: SMILES Embeddings for QSAR Modeling

Thu, 26 Mar 2026 00:00:00 +0000

Transformer-Based SMILES Embeddings for Property Prediction

This is a Method paper that introduces Transformer-CNN, a two-stage architecture for QSAR (Quantitative Structure-Activity Relationship) modeling. The primary contribution is a transfer learning approach: a Transformer model is first trained on the task of SMILES canonicalization (mapping non-canonical SMILES to canonical forms), and the encoder’s internal representations are then used as “dynamic SMILES embeddings” for downstream property prediction via a convolutional neural network (TextCNN). The authors also contribute an interpretability framework based on Layer-wise Relevance Propagation (LRP) that traces predictions back to individual atom contributions.

From Descriptors to Learned Embeddings in QSAR

Traditional QSAR methods rely on hand-engineered molecular descriptors (fragment counts, physicochemical features) coupled with feature selection and classical ML algorithms. While deep learning approaches that operate on raw SMILES strings or molecular graphs have reduced the need for manual feature engineering, they typically require large training datasets to learn effective representations from scratch. QSAR datasets, in contrast, often contain only hundreds of molecules, making it difficult to train end-to-end deep models.

The authors identify two specific gaps. First, existing SMILES-based autoencoders such as CDDD (Continuous and Data-Driven molecular Descriptors) produce fixed-length latent vectors, discarding positional information that could be useful for property prediction and interpretation. Second, QSAR models built on deep architectures generally lack interpretability, making it hard to verify that predictions rely on chemically meaningful structural features rather than spurious correlations.

Dynamic SMILES Embeddings via Canonicalization Pre-training

The core insight is that training a Transformer to perform SMILES canonicalization (a Seq2Seq task mapping non-canonical SMILES to canonical SMILES) produces an encoder whose internal states serve as information-rich, position-dependent molecular embeddings.

Pre-training on SMILES Canonicalization

The Transformer encoder-decoder is trained on approximately 17.7 million canonicalization pairs derived from the ChEMBL database (SMILES with length up to 110 characters). Each molecule is augmented 10 times by generating non-canonical SMILES variants, plus one identity pair where both sides are canonical. The training uses character-level tokenization with a 66-symbol vocabulary covering drug-like molecules including stereochemistry, charges, and inorganic ions.

The Transformer architecture follows Vaswani et al. with 3 layers and 10 self-attention heads. The learning rate schedule follows:

$$\lambda = \text{factor} \cdot \min(1.0,; \text{step} / \text{warmup}) / \max(\text{step},; \text{warmup})$$

where factor = 20, warmup = 16,000 steps, and $\lambda$ is clipped at a minimum of $10^{-4}$. Training runs for 10 epochs (275,907 batches per epoch) without early stopping.

On validation with 500,000 generated ChEMBL-like SMILES, the model correctly canonicalizes 83.6% of all samples. Performance drops for stereochemistry (37.2% for @-containing SMILES) and cis/trans notation (73.9%).

From Encoder States to QSAR Predictions

After pre-training, the encoder’s output for a molecule with $N$ characters is a matrix of dimensions $(N, \text{EMBEDDINGS})$. Unlike fixed-length CDDD descriptors, these “dynamic embeddings” preserve positional information, meaning equivalent characters receive different embedding values depending on their context and position.

To handle variable-length embeddings, the authors use a TextCNN architecture (from DeepChem) with 1D convolutional filters at kernel sizes (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20) producing (100, 200, 200, 200, 200, 100, 100, 100, 100, 100, 160, 160) filters respectively. After GlobalMaxPool and concatenation, the features pass through Dropout (rate = 0.25), a Dense layer ($N = 512$), a Highway layer, and finally an output layer (1 neuron for regression, 2 for classification).

The Transformer weights are frozen during QSAR training. The Adam optimizer is used with a fixed learning rate of $10^{-4}$ and early stopping on a 10% held-out validation set. Critically, SMILES augmentation ($n = 10$) is applied during both training and inference, with the final prediction being the average over augmented SMILES for each molecule.

Interpretability via Layer-wise Relevance Propagation

The LRP algorithm propagates relevance scores from the output back through the CNN layers to the Transformer encoder output (which is position-wise). The relevance conservation property holds:

$$y = R = f(x) = \sum_{l \in (L)} R_{l} = \sum_{l \in (L-1)} R_{l} = \cdots = \sum_{l \in (1)} R_{l}$$

In practice, biases absorb some relevance, so the total propagated to the input is less than the output:

$$\sum_{l \in (L)} R_{l} = \sum_{l \in (L-1)} R_{l} + B$$

For gated connections in the Highway block, the authors implement the signal-take-all redistribution rule. The interpretation algorithm generates one SMILES per non-hydrogen atom (each drawn starting from that atom), runs LRP on each, and averages contributions. If more than 50% of relevance dissipates on biases, the interpretation may be unreliable, serving as an applicability domain indicator.

Benchmarks Across 18 Regression and Classification Datasets

The authors evaluate on the same 18 datasets (9 regression, 9 classification) used in their previous SMILES augmentation study, enabling direct comparison. All experiments use five-fold cross-validation.

Regression Results ($r^2$)

Dataset	Descriptor-based	SMILES-based (augm=10)	Transformer-CNN (no augm)	Transformer-CNN (augm=10)	CDDD
MP (19,104)	0.83	0.85	0.83	0.86	0.85
BP (11,893)	0.98	0.98	0.97	0.98	0.98
BCF (378)	0.85	0.85	0.71	0.85	0.81
FreeSolv (642)	0.94	0.93	0.72	0.91	0.93
LogS (1,311)	0.92	0.92	0.85	0.91	0.91
Lipo (4,200)	0.70	0.72	0.60	0.73	0.74
BACE (1,513)	0.73	0.72	0.66	0.76	0.75
DHFR (739)	0.62	0.63	0.46	0.67	0.61
LEL (483)	0.19	0.25	0.20	0.27	0.23

Classification Results (AUC)

Dataset	Descriptor-based	SMILES-based (augm=10)	Transformer-CNN (no augm)	Transformer-CNN (augm=10)	CDDD
HIV (41,127)	0.82	0.78	0.81	0.83	0.74
AMES (6,542)	0.86	0.88	0.86	0.89	0.86
BACE (1,513)	0.88	0.89	0.89	0.91	0.90
ClinTox (1,478)	0.77	0.76	0.71	0.77	0.73
Tox21 (7,831)	0.79	0.83	0.81	0.82	0.82
BBBP (2,039)	0.90	0.91	0.90	0.92	0.89
JAK3 (886)	0.79	0.80	0.70	0.78	0.76
BioDeg (1,737)	0.92	0.93	0.91	0.93	0.92
RP AR (930)	0.85	0.87	0.83	0.87	0.86

Key Comparisons

Baselines include descriptor-based methods (the best from LibSVM, Random Forest, XGBoost, ASNN, and DNNs), direct SMILES-based models with augmentation, and CDDD descriptors analyzed by the same classical ML methods. CDDD descriptors come from the Sml2canSml autoencoder approach, which produces fixed 512-dimensional vectors.

Transformer-CNN with augmentation matches or exceeds all baselines on 14 of 18 datasets. The effect of augmentation is dramatic: without it, Transformer-CNN underperforms substantially (e.g., BCF drops from 0.85 to 0.71, JAK3 from 0.78 to 0.70). This confirms that the internal consensus from multiple SMILES representations is essential to the method’s effectiveness.

A practical advantage over CDDD is that Transformer-CNN imposes no constraints on molecular properties (CDDD requires logP in (-5, 7), molecular weight under 12,600, 3-50 heavy atoms, and organic molecules only), since the Transformer was trained on the full diversity of ChEMBL.

Interpretability Case Studies

For AMES mutagenicity, the LRP analysis of 1-Bromo-4-nitrobenzene correctly identifies the nitro group and halogen as structural alerts, consistent with known mutagenicity rules. For aqueous solubility of haloperidol, the model assigns positive contributions to hydroxyl, carbonyl, and aliphatic nitrogen groups (which increase solubility) and negative contributions to aromatic carbons (which decrease it). Both cases align with established chemical knowledge, supporting the trustworthiness of the model.

Effective Transfer Learning for Small QSAR Datasets

Transformer-CNN achieves competitive or superior QSAR performance across 18 diverse benchmarks by combining three ingredients: (1) Transformer-based pre-training via SMILES canonicalization, (2) SMILES augmentation during training and inference, and (3) a lightweight CNN head. The method requires minimal hyperparameter tuning, as the Transformer weights are frozen and the CNN architecture is fixed.

The authors acknowledge several limitations and future directions:

Stereochemistry canonicalization accuracy is low (37.2%), which could impact models for stereo-sensitive properties
The LRP interpretability depends on sufficient relevance propagation (at least 50% reaching the input layer)
The variance among augmented SMILES predictions could serve as a confidence estimate, but this is left to future work
Applicability domain assessment based on SMILES reconstruction quality is proposed but not fully developed

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pre-training	ChEMBL (SMILES <= 110 chars)	17.7M pairs	10x augmentation + 1 identity pair per molecule
Validation (canon.)	Generated ChEMBL-like SMILES	500,000	From a molecular generator
QSAR benchmarks	9 regression + 9 classification	378-41,127	Available on OCHEM (https://ochem.eu)

Algorithms

Transformer: 3 layers, 10 self-attention heads, character-level tokenization (66 symbols)
TextCNN: 12 kernel sizes (1-10, 15, 20) with 100-200 filters each, GlobalMaxPool, Dense(512), Highway, Dropout(0.25)
Augmentation: n=10 non-canonical SMILES per molecule during training and inference
LRP: signal-take-all redistribution for Highway gates, standard LRP for Dense and Conv layers

Models

Transformer encoder weights pre-trained on canonicalization task (frozen during QSAR training)
QSAR CNN trained with Adam optimizer, learning rate $10^{-4}$, early stopping
Pre-trained embeddings and standalone prediction models available in the GitHub repository

Evaluation

Regression: coefficient of determination $r^2 = 1 - SS_{\text{res}} / SS_{\text{tot}}$
Classification: Area Under the ROC Curve (AUC)
Five-fold cross-validation with bootstrap standard errors

Hardware

NVIDIA Quadro P6000, Titan Xp, and Titan V GPUs (donated by NVIDIA)
TensorFlow v1.12.0, RDKit v2018.09.2

Artifacts

Artifact	Type	License	Notes
transformer-cnn	Code	MIT	Source code, pre-trained embeddings, standalone prediction models
OCHEM	Other	N/A	Online platform hosting the method, training datasets, and models

Paper Information

Citation: Karpov, P., Godin, G., & Tetko, I. V. (2020). Transformer-CNN: Swiss knife for QSAR modeling and interpretation. Journal of Cheminformatics, 12, 17. https://doi.org/10.1186/s13321-020-00423-w

@article{karpov2020transformer,
  title={Transformer-{CNN}: Swiss knife for {QSAR} modeling and interpretation},
  author={Karpov, Pavel and Godin, Guillaume and Tetko, Igor V.},
  journal={Journal of Cheminformatics},
  volume={12},
  number={1},
  pages={17},
  year={2020},
  publisher={Springer},
  doi={10.1186/s13321-020-00423-w}
}

SMILES2Vec: Interpretable Chemical Property Prediction

Thu, 26 Mar 2026 00:00:00 +0000

A General-Purpose RNN for Chemical Property Prediction from SMILES

SMILES2Vec is a Method paper that introduces a deep recurrent neural network architecture for predicting chemical properties directly from SMILES text representations. The primary contributions are: (1) a Bayesian-optimized CNN-GRU architecture that serves as a general-purpose predictor for diverse chemical properties (toxicity, activity, solubility, solvation energy), (2) an explanation mask mechanism that provides interpretable predictions by identifying which SMILES characters drive the network’s decisions, and (3) evidence that representation learning from raw SMILES can match or outperform models using hand-crafted molecular descriptors.

Motivation: Beyond Engineered Features in Chemical Modeling

At the time of writing (2017), deep learning models in chemistry relied heavily on engineered molecular descriptors and fingerprints as input features. Over 5,000 molecular descriptors had been developed since the late 1940s, and QSAR/QSPR modeling remained the dominant paradigm. The authors identified two key limitations with this approach:

Restricted search space: Engineered features limit the neural network’s ability to discover potentially useful representations that domain experts have not anticipated.
Incomplete domain knowledge: For complex properties where first-principles understanding is incomplete, the lack of appropriate descriptors constrains model performance.

In contrast, computer vision and NLP had shown that deep learning models trained on raw data (unaltered images, raw text) could learn powerful representations without feature engineering. The chemical SMILES notation, a text-based encoding of molecular structure that serves as the standard interchange format in cheminformatics, provided a natural analog to text data for NLP-style modeling.

A secondary motivation was interpretability. Most ML and DL models for chemistry operated as black boxes, which posed particular problems for regulated applications like FDA drug approval where mechanistic explanations are required.

Core Innovation: CNN-GRU Architecture with Explanation Masks

Architecture Design via Bayesian Optimization

SMILES2Vec treats SMILES strings as character-level text input. The network processes one-hot encoded characters (padded to length 250, covering 99.9% of the ChEMBL database) through three stages:

Embedding layer: Maps one-hot character vectors to a learned embedding space (size 50)
1D convolutional layer: 192 filters with kernel size 3, stride 1
Bidirectional GRU layers: Two layers with 224 and 384 units respectively

The authors explored four architectural classes (GRU, LSTM, CNN-GRU, CNN-LSTM) using Bayesian optimization via SigOpt. Each class was evaluated over 60 trials, optimizing embedding size, convolutional filter count, and RNN layer widths. The CNN-GRU class was selected as the best compromise: CNN-LSTM performed best on classification (Tox21), while GRU-based networks excelled at regression (FreeSolv). The final architecture is summarized by the hyperparameters:

Component	Parameter	Value
Embedding	Size	50
Conv1D	Filters	192
BiGRU Layer 1	Units	224
BiGRU Layer 2	Units	384

Explanation Mask for Interpretability

The explanation mask is a post-hoc interpretability mechanism. Given a trained (frozen) SMILES2Vec base model, a separate explanation network learns to produce a per-character mask over the input SMILES string. The mask is trained to preserve the base model’s output while masking as much input as possible. The loss function for a single sample is:

$$ \text{Loss}_i = | f(\text{SMILES}_i, \theta) - \text{Sol}(\text{SMILES}_i) |_2 + 10^{-6} | \text{MASK}_i |_2 + 0.05 , H(\text{MASK}_i) $$

where $f(\text{SMILES}_i, \theta)$ is the base network prediction, $\text{Sol}(\text{SMILES}_i)$ is the ground truth solubility, $H$ is the entropy of the normalized mask, and $\text{MASK}_i$ is the per-character mask vector. The L2 term encourages sparsity and the entropy term penalizes uniform attention distributions.

The explanation network itself is a 20-layer residual network with SELU activations, ending in a 1D convolution of length 1, batch normalization, and a softplus activation. The softplus output ranges from 0 (fully masked) to infinity (amplified attention), allowing the mask to both suppress and emphasize specific SMILES characters.

Experimental Setup and Baseline Comparisons

Datasets

The model was evaluated on four datasets from the MoleculeNet benchmark and the ESOL solubility dataset:

Dataset	Property	Task	Size
Tox21	Toxicity	Multi-task classification	8,014
HIV	Activity	Single-task classification	41,193
FreeSolv	Solvation energy	Single-task regression	643
ESOL	Solubility	Single-task regression	1,128

SMILES strings longer than 250 characters were excluded. Classification datasets (Tox21, HIV) used 1/6 test split with minority class oversampling; regression datasets (FreeSolv, ESOL) used 1/10 test split. All experiments used 5-fold cross-validation.

Training Protocol

Optimizer: RMSprop with learning rate $10^{-3}$, $\rho = 0.9$, $\epsilon = 10^{-8}$
Batch size: 32
Epochs: 250 with early stopping (patience of 25 epochs based on validation loss)
Classification loss: Binary cross-entropy
Regression loss: Mean absolute error
Metrics: AUC for classification, RMSE for regression

Baselines

SMILES2Vec was compared against:

MLP with engineered features: Standard multi-layer perceptron using molecular fingerprints (from MoleculeNet)
Molecular graph convolutions: Graph-based neural network from MoleculeNet
Chemception: CNN operating on 2D chemical images

Bayesian Optimization Protocol

Only two datasets were used for architecture optimization: the nr-ahr toxicity task from Tox21 (classification) and FreeSolv (regression). The remaining datasets (full Tox21, HIV, ESOL) served purely for generalization evaluation. A fixed test set was held out during optimization, and correlation between validation and test metrics (0.54 for Tox21, 0.78 for FreeSolv) confirmed limited overfitting to the validation set.

Results: Competitive Accuracy with Interpretable Predictions

Property Prediction Performance

SMILES2Vec achieved the following validation metrics (with a pre-training approach from ChemNet improving performance slightly):

Dataset	Metric	SMILES2Vec	SMILES2Vec + Pre-training	Graph Conv
Tox21	AUC	0.80	0.81	0.81
HIV	AUC	0.78	0.80	0.80
FreeSolv	RMSE (kcal/mol)	1.4	1.2	1.3
ESOL	RMSE	0.63	-	-

Exact numbers for MLP and Chemception baselines were reported only in a bar chart (Figure 6) and not as precise values. The paper states that MLP with fingerprints performed worst across all tasks, and Chemception fell between MLP and the graph/SMILES methods.

Key findings:

SMILES2Vec outperformed MLP models using engineered features across all tasks, despite using no feature engineering.
Against graph convolutions (the state-of-the-art at the time), SMILES2Vec matched on classification (Tox21: 0.81 vs 0.81, HIV: 0.80 vs 0.80) and outperformed on regression (FreeSolv: 1.2 vs 1.3).
SMILES2Vec outperformed Chemception (2D image CNN) on classification tasks but slightly underperformed on regression, which the authors attributed to SMILES lacking explicit atomic number information.

Interpretability Evaluation

On the ESOL solubility dataset, the explanation mask was evaluated against first-principles chemical knowledge. The authors separated compounds into soluble (> 1.0) and insoluble (< -5.0) categories and defined ground truth: soluble compounds should attend to hydrophilic atoms (O, N) while insoluble compounds should attend to hydrophobic atoms (C, F, Cl, Br, I). The top-3 character accuracy was 88%, confirming that SMILES2Vec learned representations consistent with known functional group chemistry.

Qualitative analysis of the masks showed that for low-solubility molecules, characters corresponding to hydrophobic groups (c, C, Cl) received high attention, while high-solubility molecules showed attention focused on hydrophilic groups (O, N).

Limitations

The interpretability evaluation was limited to solubility, a well-understood property with simple first-principles rules. The authors acknowledged that quantifying interpretability for complex properties (toxicity, activity) where no simple ground truth exists is nontrivial.
The Bayesian optimization used only a subset of datasets, so the architecture may not be globally optimal across all chemical tasks.
SMILES strings lack explicit atomic number information, which may limit performance on physical property prediction compared to image or graph representations.
The explanation mask approach requires training a separate 20-layer network per property, adding computational overhead.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Architecture optimization	Tox21 (nr-ahr task)	8,014	Single toxicity task for Bayesian optimization
Architecture optimization	FreeSolv	643	Solvation free energy regression
Evaluation	Tox21 (full, 12 tasks)	8,014	Multi-task classification
Evaluation	HIV	41,193	Single-task classification
Evaluation	ESOL	1,128	Solubility regression, also used for interpretability

All datasets are publicly available through MoleculeNet. The ESOL dataset is from Delaney (2004).

Algorithms

Bayesian optimization via SigOpt (60 trials per architectural class, 4 classes, 6 manually seeded initial designs per class)
RMSprop optimizer with standard settings
Explanation mask trained with Adam, learning rate annealed from $10^{-2}$ to $10^{-6}$

Models

Final architecture: Embedding(50) -> Conv1D(192, kernel=3, stride=1) -> BiGRU(224) -> BiGRU(384)
Explanation network: 20-layer residual network with SELU activations
No pre-trained weights or code were released

Evaluation

Metric	Dataset	Value	Notes
AUC	Tox21	0.81	With pre-training
AUC	HIV	0.80	With pre-training
RMSE	FreeSolv	1.2 kcal/mol	With pre-training
RMSE	ESOL	0.63	Base model
Top-3 accuracy	ESOL interpretability	88%	Explanation mask

Hardware

The authors report using TensorFlow with GPU acceleration via NVIDIA cuDNN libraries. Specific GPU models and training times were not reported.

Artifacts

No code, models, or data artifacts were released by the authors. The datasets used are publicly available through MoleculeNet.

Paper Information

Citation: Goh, G. B., Hodas, N. O., Siegel, C., & Vishnu, A. (2017). SMILES2Vec: An Interpretable General-Purpose Deep Neural Network for Predicting Chemical Properties. arXiv preprint arXiv:1712.02034.

@article{goh2017smiles2vec,
  title={SMILES2Vec: An Interpretable General-Purpose Deep Neural Network for Predicting Chemical Properties},
  author={Goh, Garrett B. and Hodas, Nathan O. and Siegel, Charles and Vishnu, Abhinav},
  journal={arXiv preprint arXiv:1712.02034},
  year={2017},
  doi={10.48550/arxiv.1712.02034}
}

MolPMoFiT: Inductive Transfer Learning for QSAR

Thu, 26 Mar 2026 00:00:00 +0000

Transfer Learning Meets Molecular Property Prediction

This is a Method paper that introduces MolPMoFiT (Molecular Prediction Model Fine-Tuning), a transfer learning approach for QSPR/QSAR modeling. The primary contribution is adapting the ULMFiT framework from NLP to molecular property prediction by treating SMILES strings as a chemical language. A general-purpose molecular structure prediction model (MSPM) is pre-trained on one million ChEMBL molecules via self-supervised next-token prediction, then fine-tuned for specific QSAR endpoints. The approach achieves competitive or superior results to graph neural networks and descriptor-based methods across four benchmark datasets, with particular benefits for small datasets.

The Small Data Problem in QSAR Modeling

Deep learning models for molecular property prediction typically require large labeled training sets to learn useful structural representations. While methods like graph convolutional neural networks and SMILES-based models have achieved strong results on well-studied endpoints, they must be trained from scratch for each new task. This presents a challenge for small chemical datasets with limited labeled data, which remain common in drug discovery for specialized endpoints like allosteric inhibition, renal clearance, and inhibitor residence times.

Transfer learning had already shown transformative impact in computer vision (ImageNet pre-training) and NLP (ELMo, BERT, ULMFiT). In chemistry, prior transfer learning efforts included ChemNet (supervised pre-training on computed descriptors), Mol2vec (unsupervised substructure embeddings), and pre-trained graph neural networks. However, a systematic application of the ULMFiT self-supervised pre-training pipeline to SMILES-based molecular models had not been explored. MolPMoFiT fills this gap by treating the vast corpus of unlabeled molecular structures as a self-supervised training signal, analogous to how language models learn from unlabeled text.

Core Innovation: ULMFiT Adapted for SMILES

MolPMoFiT adapts ULMFiT’s three-stage transfer learning pipeline to molecular property prediction:

Stage 1: General-Domain MSPM Pre-training. A molecular structure prediction model is trained on one million curated ChEMBL molecules to predict the next token in a SMILES string. This is purely self-supervised: the SMILES string provides its own labels. The model learns general chemical syntax and structural patterns.

Stage 2: Task-Specific MSPM Fine-tuning (Optional). The general MSPM is further fine-tuned on the unlabeled SMILES of the target task dataset. This adapts the language model to the specific chemical distribution of interest (e.g., HIV inhibitors vs. general bioactive molecules). Discriminative fine-tuning adjusts learning rates per layer:

$$\eta^{layer-1} = \eta^{layer} / 2.6$$

where higher layers (containing more task-specific features) receive higher learning rates.

Stage 3: QSAR/QSPR Model Fine-tuning. The embedding and encoder weights from the pre-trained MSPM are transferred to a new model with a task-specific classifier head. Fine-tuning uses three key techniques from ULMFiT:

Discriminative fine-tuning: Different learning rates per layer group
Gradual unfreezing: Layers are unfrozen sequentially (classifier first, then progressively deeper LSTM layers)
One cycle policy: Learning rate scheduling following Smith’s approach

The model architecture is AWD-LSTM (ASGD Weight-Dropped LSTM) with an embedding dimension of 400, three LSTM layers with 1152 hidden units, and dropouts applied at every layer (embedding, input, weights, hidden). The QSAR classifier concatenates max pooling, mean pooling, and the last hidden state $h_T$ from the final LSTM layer, feeding this into two feedforward layers.

SMILES Augmentation. Since multiple valid SMILES can represent the same molecule through different atom orderings, the authors use SMILES enumeration as data augmentation. For regression tasks, Gaussian noise ($\sigma_{noise}$) is added to labels of augmented SMILES to simulate experimental error. Test-time augmentation (TTA) averages predictions across the canonical SMILES and four randomized SMILES.

Benchmarks Across Four QSAR Datasets

Datasets

Dataset	Size	Task	Metric
Lipophilicity	4,200	Regression (logD)	RMSE
FreeSolv	642	Regression (solvation energy)	RMSE
HIV	41,127	Classification (replication inhibition)	AUROC
BBBP	2,039	Classification (blood-brain barrier)	AUROC

All datasets use the same 10 random 80:10:10 splits from Yang et al. (2019) for fair comparison. Both random and scaffold splits were evaluated, with scaffold splits representing a more realistic test of generalization to novel chemical scaffolds.

Baselines

Models were compared against results reported by Yang et al. (2019): directed message passing neural network (D-MPNN), D-MPNN with RDKit features, random forest on Morgan fingerprints, feed-forward networks on Morgan fingerprints, and feed-forward networks on RDKit descriptors.

Hyperparameters

The same set of fine-tuning hyperparameters was used across all four tasks (tuned on the HIV dataset):

Layer Group	Base Learning Rate	Epochs
Linear head only	3e-2	4
+ Final LSTM layer	5e-3	4
+ Final two LSTM layers	5e-4	4
Full model	5e-5	6

Data augmentation settings were task-specific: lipophilicity training SMILES augmented 25x ($\sigma_{noise} = 0.3$); FreeSolv augmented 50x ($\sigma_{noise} = 0.5$); HIV active class augmented 60x and inactive 2x; BBBP positive class 10x and negative 30x.

Key Findings and Limitations

Benchmark Results

Lipophilicity (random split): MolPMoFiT achieved RMSE of $0.565 \pm 0.037$ with TTA and $0.625 \pm 0.032$ without, outperforming D-MPNN and other baselines.

FreeSolv (random split): RMSE of $1.197 \pm 0.127$ with TTA. The small dataset size (642 compounds) led to high variance across splits.

BBBP (random split): AUROC of $0.950 \pm 0.020$, outperforming all comparison models. Task-specific MSPM fine-tuning showed no clear benefit over the general MSPM.

HIV (random split): General MolPMoFiT achieved AUROC of $0.828 \pm 0.029$ with TTA. Task-specific fine-tuning yielded a slightly higher $0.834 \pm 0.025$ with TTA.

Scaffold splits consistently produced lower performance than random splits across all datasets, as expected for out-of-distribution generalization.

Transfer Learning Impact

Across all four datasets and varying training set sizes, MolPMoFiT consistently outperformed models trained from scratch with the same architecture. The improvement was most pronounced at smaller training set sizes, confirming the utility of pre-trained representations for low-data regimes.

SMILES Augmentation Analysis

Training data augmentation provided significant improvements across all tasks. For classification (HIV, BBBP), augmentation improved performance regardless of whether class re-balancing was applied. For regression (lipophilicity, FreeSolv), both SMILES augmentation and label noise were beneficial, with optimal noise levels varying by dataset.

Limitations

The authors note a fundamental limitation: the model learns mappings from individual SMILES strings to properties rather than from molecular structures to properties. SMILES augmentation acts as a regularization technique to mitigate this, making the model more robust to different SMILES representations of the same molecule. The task-specific MSPM fine-tuning stage did not consistently improve results, requiring further investigation. All hyperparameters were tuned on one dataset (HIV) and applied uniformly, which may not be optimal for all endpoints.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pre-training	ChEMBL (curated)	1M molecules	Filtered: no mixtures, max 50 heavy atoms, standardized with MolVS, canonized with RDKit
Evaluation	Lipophilicity	4,200	MoleculeNet benchmark
Evaluation	FreeSolv	642	MoleculeNet benchmark
Evaluation	HIV	41,127	MoleculeNet benchmark
Evaluation	BBBP	2,039	MoleculeNet benchmark

Algorithms

AWD-LSTM architecture with embedding dim 400, three LSTM layers (1152 hidden units), dropouts at all layers
ULMFiT fine-tuning: discriminative learning rates ($\eta^{layer-1} = \eta^{layer}/2.6$), gradual unfreezing, one cycle policy
SMILES character-level tokenization with special handling for two-character tokens (Cl, Br) and bracket-enclosed tokens
SMILES enumeration for data augmentation with optional Gaussian label noise for regression

Models

General-domain MSPM pre-trained on 1M ChEMBL molecules (10 epochs)
Task-specific MSPMs fine-tuned per dataset (optional stage)
QSAR models fine-tuned with transferred embeddings and encoder

Evaluation

Dataset	Split	Metric	MolPMoFiT (TTA)	Best Baseline
Lipophilicity	Random	RMSE	$0.565 \pm 0.037$	D-MPNN
Lipophilicity	Scaffold	RMSE	$0.635 \pm 0.031$	D-MPNN
FreeSolv	Random	RMSE	$1.197 \pm 0.127$	D-MPNN
FreeSolv	Scaffold	RMSE	$2.082 \pm 0.460$	D-MPNN
BBBP	Random	AUROC	$0.950 \pm 0.020$	D-MPNN
BBBP	Scaffold	AUROC	$0.931 \pm 0.025$	D-MPNN
HIV	Random	AUROC	$0.828 \pm 0.029$	D-MPNN
HIV	Scaffold	AUROC	$0.816 \pm 0.022$	D-MPNN

Hardware

NVIDIA Quadro P4000 GPU (single GPU)
General-domain MSPM pre-training: approximately 1 day
Pre-training needs to be done only once; fine-tuning is fast per task

Artifacts

Artifact	Type	License	Notes
MolPMoFiT	Code	Not specified	PyTorch + fastai v1 implementation with curated datasets

Paper Information

Citation: Li, X., & Fourches, D. (2020). Inductive transfer learning for molecular activity prediction: Next-Gen QSAR Models with MolPMoFiT. Journal of Cheminformatics, 12, 27. https://doi.org/10.1186/s13321-020-00430-x

@article{li2020molpmofit,
  title={Inductive transfer learning for molecular activity prediction: Next-Gen QSAR Models with MolPMoFiT},
  author={Li, Xinhao and Fourches, Denis},
  journal={Journal of Cheminformatics},
  volume={12},
  number={1},
  pages={27},
  year={2020},
  doi={10.1186/s13321-020-00430-x}
}

LLM-Prop: Predicting Crystal Properties from Text

Thu, 26 Mar 2026 00:00:00 +0000

Text-Based Crystal Property Prediction with LLMs

LLM-Prop is a Method paper that proposes using the encoder portion of T5 (a general-purpose language model) fine-tuned on crystal text descriptions to predict physical and electronic properties of crystalline materials. The primary contribution is demonstrating that text-based representations of crystals, generated by Robocrystallographer, can serve as effective inputs for property prediction, outperforming graph neural network (GNN) baselines on several tasks despite using a non-domain-specific pre-trained model with fewer parameters.

Why Text Instead of Crystal Graphs?

Graph neural networks have been the dominant approach for crystal property prediction. Models like CGCNN, MEGNet, and ALIGNN represent crystals as graphs where atoms are nodes and bonds are edges. However, GNNs face several fundamental challenges for crystals:

Periodicity encoding: Crystals have repetitive unit cell arrangements that are distinct from standard molecular graphs, and GNNs struggle to encode this periodicity efficiently.
Information incorporation: Critical structural information like bond angles, space group symmetry, and Wyckoff sites is difficult to incorporate into graph representations.
Expressiveness: Graphs may lack the expressiveness needed to convey complex crystal information relevant to property prediction.

Meanwhile, textual descriptions of crystals (generated by tools like Robocrystallographer) naturally encode space group information, bond geometries, coordination environments, and symmetry details in human-readable form. Despite this richness, text-based approaches for crystal property prediction had been largely unexplored.

Core Innovation: T5 Encoder with Careful Fine-Tuning

The key insight of LLM-Prop is to take a pre-trained encoder-decoder model (T5-small) and discard the decoder entirely, using only the encoder with a linear prediction head. This design has several advantages:

Cutting the network in half (from ~60M to ~37M parameters) allows processing of longer input sequences
Longer sequences mean more crystal information can be included
The encoder-only approach avoids T5’s known weakness at regression in text-to-text format

The framework applies several preprocessing strategies to the crystal text descriptions:

Stopword removal: Standard English stopwords are removed, except digits and symbols carrying chemical information
Numerical token replacement: Bond distances are replaced with a [NUM] token and bond angles with [ANG], reducing sequence length while preserving structural cues
[CLS] token prepending: A classification token is added at the start, and its learned embedding is used as input to the prediction layer
Label scaling: For regression tasks, targets are normalized using z-score, min-max, or log normalization

The normalization schemes are defined as:

$$ \hat{Y}_{i}(\text{z-score}) = \frac{Y_{i} - \mu}{\sigma} $$

$$ \hat{Y}_{i}(\text{min-max}) = \frac{Y_{i} - Y_{\min}}{Y_{\max} - Y_{\min}} $$

$$ \hat{Y}_{i}(\text{log-norm}) = \log(Y_{i} + 1) $$

The tokenizer is also retrained on the crystal text corpus with a vocabulary size of 32k, and the special tokens [NUM], [ANG], and [CLS] are added to the vocabulary.

Experimental Setup and Baselines

Dataset: TextEdge

The authors collected data from the Materials Project database (as of November 2022), yielding 144,931 crystal structure-description pairs split into 125,098 training, 9,945 validation, and 9,888 test samples. Crystal text descriptions were generated using Robocrystallographer. The dataset covers six prediction tasks:

Task	Type	Metric
Band gap (eV)	Regression	MAE (lower is better)
Unit cell volume (A^3/cell)	Regression	MAE (lower is better)
Formation energy per atom (eV/atom)	Regression	MAE (lower is better)
Energy per atom (eV/atom)	Regression	MAE (lower is better)
Energy above hull (eV/atom)	Regression	MAE (lower is better)
Is-gap-direct	Classification	AUC (higher is better)

Baselines

Seven baselines were compared:

GNN-based: CGCNN, MEGNet, ALIGNN, DeeperGATGNN
Classic ML: XGBoost, Random Forest (on Robocrystallographer features)
Text-based: MatBERT (domain-specific pre-trained BERT, ~110M parameters)

All models were trained and evaluated on the same dataset splits for fair comparison. GNN models were retrained on the new data rather than using results from older, smaller Materials Project versions.

Main Results: LLM-Prop vs. GNN Baselines

When using crystal text descriptions as input, LLM-Prop achieved:

Model	Band gap (eV)	Volume (A^3/cell)	FEPA (eV/atom)	EPA (eV/atom)	Ehull (eV/atom)	Is-gap-direct (AUC)
CGCNN	0.293	188.834	0.046	0.082	0.040	0.830
MEGNet	0.304	297.948	0.077	0.056	0.051	N/A
ALIGNN	0.250	129.580	0.027	0.059	0.028	0.678
DeeperGATGNN	0.291	111.857	0.081	0.116	0.045	N/A
LLM-Prop (Descr.)	0.231	39.252	0.056	0.067	0.047	0.857

LLM-Prop outperformed the best GNN baseline (ALIGNN) by approximately 8% on band gap prediction, 65% on volume prediction, and 3% on band gap classification (Is-gap-direct). For formation energy per atom, energy per atom, and energy above hull, ALIGNN retained an advantage.

LLM-Prop vs. MatBERT

LLM-Prop also outperformed MatBERT (a domain-specific pre-trained BERT) across all tasks despite having roughly 3x fewer parameters. The table below shows the best result for each model across the three input preprocessing strategies (w/ Numbers, w/o Numbers, w/ [NUM]&[ANG]):

Model	Band gap (eV)	Volume (A^3/cell)	FEPA (eV/atom)	EPA (eV/atom)	Ehull (eV/atom)	Is-gap-direct (AUC)
MatBERT (best)	0.258	54.969	0.071	0.098	0.050	0.722
LLM-Prop (best)	0.231	39.138	0.056	0.067	0.047	0.857

Note: LLM-Prop’s best band gap (0.231) comes from the “w/o Numbers” configuration, while the best volume (39.138) comes from “w/ Numbers”. The best Is-gap-direct AUC (0.857) uses the “[NUM]&[ANG]” configuration.

Ablation Studies

The contribution of each preprocessing strategy was evaluated:

Configuration	Band gap	Volume	Is-gap-direct (AUC)
LLM-Prop (baseline)	0.256	69.352	0.796
+ modified tokenizer	0.247	78.632	0.785
+ label scaling	0.242	44.515	N/A
+ [CLS] token	0.231	39.520	0.842
+ [NUM] token	0.251	86.090	0.793
+ [ANG] token	0.242	64.965	0.810
- stopwords	0.252	56.593	0.779
LLM-Prop+all (no space group)	0.235	97.457	0.705
LLM-Prop+all	0.229	42.259	0.857

The [CLS] token provided the single largest improvement across all tasks. Label scaling was critical for volume prediction (reducing MAE from 69.352 to 44.515). Removing space group information from descriptions degraded volume prediction dramatically (from 42.259 to 97.457), confirming that space group symmetry is a key factor.

Data Efficiency and Transfer Learning

LLM-Prop achieved SOTA results on band gap and volume prediction with only about 90k training samples (35k fewer than baselines). For volume prediction specifically, LLM-Prop outperformed all GNN baselines with just 30k training samples.

Transfer learning experiments showed that LLM-Prop transferred well between band gap and volume prediction tasks:

Model	Volume-to-Band gap (Test)	Band gap-to-Volume (Test)
CGCNN-transfer	0.295	182.997
ALIGNN-transfer	0.322	136.164
MatBERT-transfer	0.266	54.289
LLM-Prop-transfer	0.244	50.753

Key Findings, Limitations, and Future Directions

Key findings:

Text descriptions of crystals carry rich structural information (space groups, Wyckoff sites, coordination geometries) that is difficult to encode in graphs but naturally expressed in text
A carefully fine-tuned general-purpose LLM encoder can outperform domain-specific pre-trained models, challenging the assumption that in-domain pre-training is always necessary
Removing numerical information (bond distances and angles) from descriptions often improves performance, because current LLMs treat numbers as regular tokens without understanding their quantitative meaning
Longer input sequences correlate with better performance, with 888 tokens as the default maximum on the hardware used

Limitations acknowledged by the authors:

The origin of LLM-Prop’s performance advantage over GNNs is not fully understood. It remains unclear whether the boost comes from additional structured information in text or from the different data modality itself
LLM-Prop cannot perform zero-shot predictions since T5 was not pre-trained on materials science data
The approach depends on Robocrystallographer to generate text descriptions, adding a preprocessing dependency
Current LLMs’ inability to reason about numerical values limits the use of quantitative information in descriptions

Future directions suggested by the authors include investigating techniques to use CIF files directly as LLM inputs, developing new GNN architectures that incorporate space group and Wyckoff site information, and further exploring which information in crystal descriptions contributes most to each property prediction task.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Training/Eval	TextEdge	144,931 crystals	From Materials Project (Nov 2022), text generated by Robocrystallographer
Training split	TextEdge	125,098	Random split
Validation split	TextEdge	9,945	Random split
Test split	TextEdge	9,888	Random split

Algorithms

Optimizer: Adam with one-cycle learning rate scheduler
Learning rate: 1e-3 for LLM-Prop, 5e-5 for MatBERT
Dropout: 0.2 for LLM-Prop, 0.5 for MatBERT
Batch size: 64 (888 tokens) or 16 (2000 tokens) for LLM-Prop
Epochs: 200-300 depending on task
Loss: MAE for regression, BCE for classification
Evaluation: MAE for regression, AUC for classification
Each model run 5 times on test set, averaged MAE reported

Models

Base model: T5-small encoder (~60M parameters total, ~37M after discarding decoder and adding prediction head)
Vocabulary size: 32k (retrained tokenizer)
Max input tokens: 888 (default) or 2000
Special tokens: [CLS], [NUM], [ANG]

Artifacts

Artifact	Type	License	Notes
LLM-Prop	Code	MIT	Official implementation
TextEdge + Checkpoints	Dataset + Model	Not specified	Benchmark dataset and trained model checkpoints

Hardware

GPUs: NVIDIA RTX A6000
Training time: ~40 minutes per epoch for LLM-Prop
Inference: ~1 minute for 10,000 materials on one GPU

Paper Information

Citation: Rubungo, A. N., Arnold, C. B., Rand, B. P., & Dieng, A. B. (2025). LLM-Prop: predicting the properties of crystalline materials using large language models. npj Computational Materials, 11, 186. https://doi.org/10.1038/s41524-025-01536-2

@article{rubungo2025llmprop,
  title={LLM-Prop: predicting the properties of crystalline materials using large language models},
  author={Rubungo, Andre Niyongabo and Arnold, Craig B. and Rand, Barry P. and Dieng, Adji Bousso},
  journal={npj Computational Materials},
  volume={11},
  number={1},
  pages={186},
  year={2025},
  publisher={Nature Publishing Group},
  doi={10.1038/s41524-025-01536-2}
}

Perplexity for Molecule Ranking and CLM Bias Detection

Wed, 25 Mar 2026 00:00:00 +0000

A Method for Intrinsic Scoring and Bias Detection in Chemical Language Models

This is a Method paper that introduces two contributions to the chemical language model (CLM) pipeline for de novo molecular design. First, the authors propose using perplexity as a model-intrinsic score to rank generated SMILES strings by how well they match the design objectives encoded in the fine-tuning data. Second, they introduce a “delta score” that compares molecule rankings from pretrained and fine-tuned CLMs to detect pretraining bias, where molecules are generated primarily based on generic pretraining knowledge rather than task-specific fine-tuning objectives.

The Ranking and Bias Problem in CLM-Based Molecule Generation

Chemical language models generate new molecules as SMILES strings by iteratively predicting the next character based on learned probability distributions. After training, CLMs can produce large virtual libraries of candidate molecules via multinomial sampling. However, two key challenges remain: (1) the generated molecules lack a natural ranking, requiring external scoring methods such as similarity assessment or activity prediction for prioritization, and (2) transfer learning (pretraining on a large corpus followed by fine-tuning on a small target set) can introduce “pretraining bias,” where some generated molecules reflect generic chemical knowledge from pretraining rather than the specific design objectives of the fine-tuning data.

Beam search offers an alternative sampling approach that produces inherently ranked molecules by greedily selecting the most probable SMILES strings. However, beam search explores only a narrow portion of chemical space. The authors sought to combine the ranking advantage of beam search with the chemical space exploration of multinomial sampling by applying perplexity scoring as a post-hoc ranking criterion.

Perplexity Scoring and the Delta Score for Bias Estimation

The core innovation is the application of perplexity, a standard evaluation metric from natural language processing, to score SMILES strings generated by CLMs. For a SMILES string of length $N$ with character probabilities $p_i$ assigned by the CLM, perplexity is computed as:

$$ \text{perplexity} = 2^{-\frac{1}{N} \sum_{i=1}^{N} \log(p_{i})} $$

Low perplexity indicates that the CLM assigns high probability to each character in the SMILES string, suggesting the molecule closely matches the learned distribution of the fine-tuning data. The metric is normalized by string length, making it comparable across molecules of different sizes.

To address pretraining bias, the authors introduce a delta score. For each generated molecule, the perplexity-based rank from the fine-tuned model ($\text{rank}_{ft}$) is compared against the rank from the pretrained model ($\text{rank}_{pt}$):

$$ \text{delta} = \text{rank}_{ft} - \text{rank}_{pt} $$

A positive delta score indicates that the fine-tuned model ranks the molecule higher than the pretrained model, suggesting the molecule was generated based on task-specific fine-tuning knowledge. A negative delta score flags molecules that may have been generated primarily from pretraining information, which do not necessarily match the design objectives.

The multinomial sampling probability for each character is computed via the softmax function:

$$ p_{i} = \frac{e^{z_{i}/T}}{\sum_{j} e^{z_{j}/T}} $$

where $z_{i}$ is the CLM output logit for the $i$th character, $j$ runs over all dictionary characters, and $T$ is the temperature parameter (set to $T = 1$ in this study).

Experimental Setup: 10 Protein Targets Across Four Data Regimes

The authors systematically evaluated perplexity scoring across 10 macromolecular targets and four low-data fine-tuning regimes (5, 10, 20, and 40 molecules per target).

Model architecture: A four-layer LSTM-based RNN (5,820,515 parameters) with batch normalization layers, LSTM layers of 1024 and 256 units, trained using the Adam optimizer with a learning rate of $10^{-4}$.

Pretraining: The model was pretrained on 1,683,181 molecules from ChEMBL (version 28), encoded as canonical SMILES (20-90 characters), for 90 epochs.

Fine-tuning: For each of 10 randomly selected protein targets (Table 1), bioactive ligands with pChEMBL > 6 were selected. Fine-tuning sets of 5, 10, 20, and 40 molecules were compiled for each target. Fine-tuning ran for 100 epochs, with 1,000 SMILES strings sampled every second epoch via multinomial sampling ($T = 1$).

CHEMBL ID	Target	Protein Classification
CHEMBL1836	Prostanoid EP4 receptor	G protein-coupled receptor
CHEMBL1945	Melatonin receptor 1A	G protein-coupled receptor
CHEMBL1983	Serotonin 1D (5-HT1D) receptor	Family A GPCR
CHEMBL202	Dihydrofolate reductase	Oxidoreductase
CHEMBL3522	Cytochrome P450 17A1	Cytochrome P450
CHEMBL4029	Interleukin-8 receptor A	Family A GPCR
CHEMBL5073	CaM kinase I delta	Kinase
CHEMBL5137	Metabotropic glutamate receptor 2	G protein-coupled receptor
CHEMBL5408	Serine/threonine-protein kinase TBK1	Kinase
CHEMBL5608	NT-3 growth factor receptor	Kinase

Sampling comparison: Beam search sampling was performed with beam widths $k = 10$ and $k = 50$ for comparison against multinomial sampling.

Molecular similarity: Tanimoto similarity was computed using Morgan fingerprints (radius 2, length 1024) and 2D pharmacophore fingerprints via RDKit (2019.03.2).

Key Findings: Multinomial Sampling Outperforms Beam Search

Perplexity correlates with molecular similarity. The Pearson correlation between perplexity and Tanimoto distance to the fine-tuning set stabilized at approximately 0.5 across all data regimes. This correlation emerged earlier with larger fine-tuning sets. The result confirms that perplexity captures both substructural and pharmacophore features while also incorporating additional CLM-learned information.

Multinomial sampling produces better-ranked molecules than beam search. With the smallest fine-tuning sets (5 molecules), the top 50 molecules from multinomial sampling consistently exhibited lower (better) perplexity values than beam search at $k = 10$ or $k = 50$. Increasing the beam width from 10 to 50 did not markedly improve beam search performance. For novel molecules (Tanimoto similarity below 50% to the nearest fine-tuning compound), multinomial sampling identified lower-perplexity molecules in 72% of cases with the smallest fine-tuning sets.

Perplexity scoring narrows the quality distribution. The top 50 molecules selected by perplexity from multinomial sampling spanned a narrower range of perplexity values compared to beam search, suggesting a more consistent pool of high-quality candidates for follow-up synthesis.

Pretraining bias is substantial. The delta score analysis revealed that more than 40% of sampled molecules had negative delta scores during the first 20 fine-tuning epochs, meaning they were ranked higher by the pretrained model than the fine-tuned model. This fraction remained above 10% even at the end of 100 fine-tuning epochs across all data regimes, confirming that 10-40% of generated molecules reflect “generic” pretraining rather than task-focused fine-tuning.

Perplexity alone partially mitigates bias. Among the top 50 molecules selected by perplexity from multinomial sampling, only up to 3% had negative delta scores, compared to 10-40% in the unfiltered population. This suggests that perplexity-based ranking already reduces pretraining bias, though the delta score provides additional filtering power.

SMILES validity remained high. Mean SMILES string validity consistently exceeded 90% across all fine-tuned models and fine-tuning epochs.

Limitations

The authors note several limitations and future directions. The study used a fixed temperature of $T = 1$ for multinomial sampling; combining perplexity with temperature tuning or SMILES augmentation remains unexplored. The evaluation focused on 10 protein targets, and broader validation across diverse target classes would strengthen the conclusions. The authors also suggest that combining CLMs with perplexity scoring could be applied to screen large collections of commercially available compounds, which has not yet been tested.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pretraining	ChEMBL v28	1,683,181 molecules	Canonical SMILES, 20-90 characters, salts and duplicates removed
Validation	ChEMBL v28 (split)	84,160 molecules	Random split from pretraining set
Fine-tuning	ChEMBL v28 (per target)	5, 10, 20, or 40 molecules	pChEMBL > 6, 10 targets

Algorithms

LSTM-based CLM with character-level SMILES prediction
Multinomial sampling at $T = 1$
Beam search at $k = 10$ and $k = 50$
Perplexity computed per Equation 1; delta score per Equation 2
Adam optimizer, learning rate $10^{-4}$, 90 pretraining epochs, 100 fine-tuning epochs

Models

4-layer LSTM RNN: batch normalization, LSTM (1024 units), LSTM (256 units), batch normalization
5,820,515 parameters total
One-hot encoded SMILES input
Pretrained weights available in the GitHub repository

Evaluation

Metric	Description	Notes
Perplexity	Model confidence in generated SMILES	Lower is better
Delta score	Rank difference between fine-tuned and pretrained models	Positive indicates task-relevant generation
Tanimoto similarity	Morgan and pharmacophore fingerprints	Compared to fine-tuning set
Pearson correlation	Perplexity vs. Tanimoto distance	Stabilizes at ~0.5
SMILES validity	Fraction of valid SMILES strings	Consistently > 90%

Hardware

Hardware specifications are not reported in the paper. The implementation uses Keras (v2.2.0) with TensorFlow GPU backend (v1.9.0).

Artifacts

Artifact	Type	License	Notes
CLM_perplexity	Code	MIT	Framework, pretrained weights, and training data
Beam search implementation	Code	Unknown	Referenced beam search implementation

Paper Information

Citation: Moret, M., Grisoni, F., Katzberger, P., & Schneider, G. (2022). Perplexity-Based Molecule Ranking and Bias Estimation of Chemical Language Models. Journal of Chemical Information and Modeling, 62(5), 1199-1206. https://doi.org/10.1021/acs.jcim.2c00079

Publication: Journal of Chemical Information and Modeling, 2022

Additional Resources:

GitHub: CLM_perplexity (MIT License)

Citation

@article{moret2022perplexity,
  title={Perplexity-Based Molecule Ranking and Bias Estimation of Chemical Language Models},
  author={Moret, Michael and Grisoni, Francesca and Katzberger, Paul and Schneider, Gisbert},
  journal={Journal of Chemical Information and Modeling},
  volume={62},
  number={5},
  pages={1199--1206},
  year={2022},
  publisher={American Chemical Society},
  doi={10.1021/acs.jcim.2c00079}
}

MoleculeNet: Benchmarking Molecular Machine Learning

Wed, 25 Mar 2026 00:00:00 +0000

A Resource Paper for Molecular Machine Learning Benchmarking

This is a Resource paper. MoleculeNet provides a standardized benchmark suite for evaluating molecular machine learning methods. Its primary contribution is the curation of 17 public datasets spanning four categories of molecular properties, together with standardized evaluation metrics, multiple dataset splitting strategies, and open-source implementations of featurization and learning algorithms via the DeepChem library.

Why Molecular ML Needed a Unified Benchmark

Prior to MoleculeNet, algorithmic progress in molecular machine learning was difficult to measure. Individual papers benchmarked proposed methods on different datasets with different metrics, making cross-method comparison unreliable. Several factors make molecular ML particularly challenging:

Data scarcity: Molecular datasets are much smaller than those available for computer vision or NLP, since obtaining accurate chemical property measurements requires specialized instruments and expert supervision.
Heterogeneous outputs: Properties of interest range from quantum mechanical characteristics to macroscopic physiological effects on the human body.
Variable input structures: Molecules have arbitrary size, variable connectivity, and many possible 3D conformers, all of which must be encoded into fixed-length representations for conventional ML algorithms.
No standard evaluation protocol: Without prescribed metrics, splits, or data subsets, two methods using the same underlying database (e.g., PubChem) could be entirely incomparable.

Existing databases like PubChem, ChEMBL, and the Quantum Machine collections provided raw data but did not define evaluation protocols suitable for machine learning development. MoleculeNet bridges this gap, following the precedent set by ImageNet in computer vision and WordNet in NLP.

Core Design: Datasets, Splits, Metrics, and Featurizations

MoleculeNet is organized around four components: curated datasets, splitting methods, evaluation metrics, and molecular featurizations.

Datasets Across Four Property Categories

The benchmark includes 17 datasets covering over 700,000 compounds and more than 800 tasks. These are organized into four categories reflecting different levels of molecular properties:

Category	Dataset	Tasks	Compounds	Task Type	Rec. Split	Rec. Metric
Quantum Mechanics	QM7	1	7,165	Regression	Stratified	MAE
	QM7b	14	7,211	Regression	Random	MAE
	QM8	12	21,786	Regression	Random	MAE
	QM9	12	133,885	Regression	Random	MAE
Physical Chemistry	ESOL	1	1,128	Regression	Random	RMSE
	FreeSolv	1	643	Regression	Random	RMSE
	Lipophilicity	1	4,200	Regression	Random	RMSE
Biophysics	PCBA	128	439,863	Classification	Random	PRC-AUC
	MUV	17	93,127	Classification	Random	PRC-AUC
	HIV	1	41,913	Classification	Scaffold	ROC-AUC
	PDBbind	1	11,908	Regression	Time	RMSE
	BACE	1	1,522	Classification	Scaffold	ROC-AUC
Physiology	BBBP	1	2,053	Classification	Scaffold	ROC-AUC
	Tox21	12	8,014	Classification	Random	ROC-AUC
	ToxCast	617	8,615	Classification	Random	ROC-AUC
	SIDER	27	1,427	Classification	Random	ROC-AUC
	ClinTox	2	1,491	Classification	Random	ROC-AUC

Quantum mechanics datasets (QM7, QM7b, QM8, QM9) contain DFT-computed electronic properties for subsets of the GDB database. Physical chemistry datasets cover solubility (ESOL), hydration free energy (FreeSolv), and lipophilicity. Biophysics datasets include high-throughput screening results (PCBA, MUV), HIV inhibition activity, protein-ligand binding affinity (PDBbind), and BACE-1 inhibition. Physiology datasets cover blood-brain barrier penetration (BBBP), toxicity (Tox21, ToxCast), side effects (SIDER), and clinical trial toxicity (ClinTox).

Data Splitting Strategies

MoleculeNet implements four splitting methods, all using an 80/10/10 train/validation/test ratio:

Random splitting: Standard random assignment to subsets.
Scaffold splitting: Separates molecules by their 2D structural frameworks (Bemis-Murcko scaffolds), providing a harder generalization test since structurally different molecules appear in different subsets.
Stratified splitting: Ensures each subset contains the full range of label values (used for QM7).
Time splitting: Trains on older data and tests on newer data to mimic real-world development (used for PDBbind).

Evaluation Metrics

Regression tasks use MAE or RMSE depending on the dataset. Classification tasks use either ROC-AUC or PRC-AUC. The choice between ROC-AUC and PRC-AUC depends on class imbalance: PRC-AUC is recommended for datasets with positive rates below 2% (PCBA, MUV), since precision-recall curves better capture performance under extreme imbalance.

The false positive rate and precision are defined as:

$$ \text{FPR} = \frac{\text{false positive}}{\text{false positive} + \text{true negative}} $$

$$ \text{precision} = \frac{\text{true positive}}{\text{false positive} + \text{true positive}} $$

When positive samples form a small fraction of the data, false positives influence precision much more than FPR, making PRC-AUC more informative than ROC-AUC.

Featurization Methods

MoleculeNet implements six molecular featurization approaches:

ECFP (Extended-Connectivity Fingerprints): Fixed-length binary fingerprints capturing topological substructures via hashing.
Coulomb Matrix: Encodes nuclear charges and 3D coordinates through atomic self-energies and Coulomb repulsion:

$$ M_{IJ} = \begin{cases} 0.5 Z_{I}^{2.4} & \text{for } I = J \\ \frac{Z_{I} Z_{J}}{|\mathbf{R}_{I} - \mathbf{R}_{J}|} & \text{for } I \neq J \end{cases} $$

Grid Featurizer: Designed for PDBbind, incorporating both ligand and protein structural information including salt bridges, hydrogen bonds, and SPLIF fingerprints.
Symmetry Functions: Preserve rotational and permutation symmetry through radial and angular functions between atom pairs and triplets.
Graph Convolutions: Compute initial atom feature vectors and neighbor lists from molecular graphs.
Weave: Similar to graph convolutions but also computes pairwise atom features encoding bond properties, graph distance, and ring information.

Benchmarked Models and Experimental Setup

MoleculeNet benchmarks 12 learning algorithms divided into conventional methods and graph-based methods.

Conventional Methods

Logistic Regression (classification only)
Kernel SVM with radial basis function kernel
Kernel Ridge Regression (KRR)
Random Forests
Gradient Boosting (XGBoost)
Singletask/Multitask Networks: Fully connected networks with shared layers across tasks
Bypass Networks: Multitask networks augmented with per-task “bypass” layers that directly connect inputs to outputs
Influence Relevance Voting (IRV): Refined K-nearest neighbor classifiers using Jaccard-Tanimoto similarity:

$$ S(\vec{A}, \vec{B}) = \frac{A \cap B}{A \cup B} $$

Graph-Based Methods

Graph Convolutional Models (GC): Extend circular fingerprints with learnable convolutions over molecular graphs.
Weave Models: Update atom features using information from all other atoms and their pairwise features.
Directed Acyclic Graph (DAG) Models: Define directed bonds toward a central atom and propagate features through the directed graph.
Deep Tensor Neural Networks (DTNN): Use nuclear charges and distance matrices directly, updating atom embeddings based on pairwise physical distances.
ANI-1: Learns transferable potentials using symmetry function features with atom-type-specific neural networks.
Message Passing Neural Networks (MPNN): Generalized framework with edge-dependent message functions and set2set readout.

Experimental Protocol

Gaussian process hyperparameter optimization was applied to each dataset-model combination, followed by three independent runs with different random seeds. All results are reported as means with standard deviations. Variable training-size experiments were conducted on Tox21, FreeSolv, and QM7 to study data efficiency.

Key Findings Across Property Categories

Biophysics and Physiology

Graph convolutional and weave models showed strong performance on larger datasets with less overfitting than conventional methods. Graph-based models outperformed multitask networks at 30% training data compared to 90% on Tox21. However, for smaller single-task datasets (under 3,000 samples), kernel SVM and ensemble tree methods were more robust. On highly imbalanced datasets like MUV (0.20% positive rate), graph-based models struggled to control false positives.

Multitask training had a regularizing effect, reducing the gap between train and test scores compared to single-task models. Bypass networks consistently matched or exceeded vanilla multitask networks, confirming that per-task layers add explanatory power.

Physical Chemistry

Graph-based methods (GC, DAG, MPNN, Weave) provided significant improvements over single-task networks for predicting solubility, solvation energy, and lipophilicity. The best models achieved accuracy comparable to ab initio predictions (within 0.5 RMSE for ESOL, within 1.5 kcal/mol for FreeSolv). On FreeSolv, a weave model trained on approximately 200 samples matched the accuracy of alchemical free energy calculations.

Quantum Mechanics

Models incorporating 3D distance information (DTNN, MPNN, KRR with Coulomb matrix) substantially outperformed models using only topological features. DTNN and MPNN covered the best-performing models on 28 of 39 tasks across QM datasets. The choice of physics-aware featurization proved more important than the choice of learning algorithm for these tasks.

Summary of Best Performances

Graph-based models outperformed conventional methods on 11 of 17 datasets. Key results on the test set:

Dataset	Metric	Best Conventional	Best Graph-Based
QM7	MAE	KRR (CM): 10.22	DTNN: 8.75
QM9	MAE	Multitask (CM): 4.35	DTNN: 2.35
ESOL	RMSE	XGBoost: 0.99	MPNN: 0.58
FreeSolv	RMSE	XGBoost: 1.74	MPNN: 1.15
PCBA	PRC-AUC	Logreg: 0.129	GC: 0.136
Tox21	ROC-AUC	KernelSVM: 0.822	GC: 0.829
HIV	ROC-AUC	KernelSVM: 0.792	GC: 0.763
BACE	ROC-AUC	RF: 0.867	Weave: 0.806

Conventional methods (KernelSVM, RF) still won on several smaller or scaffold-split datasets (HIV, BACE, MUV, PDBbind, BBBP, SIDER), highlighting that graph-based models are not universally superior, particularly under data scarcity or challenging splits.

Conclusions and Limitations

MoleculeNet demonstrated that learnable representations broadly offer the best performance for molecular machine learning. However, the authors identify several important caveats:

Data scarcity: Graph-based methods are not robust enough on complex tasks with limited training data.
Class imbalance: On heavily imbalanced classification datasets, conventional methods such as kernel SVM outperform learnable featurizations with respect to recall of positives.
Task-specific featurizations: For quantum mechanical and biophysical datasets, incorporating physics-aware features (Coulomb matrix, 3D coordinates) is more important than the choice of learning algorithm.
Data-driven physical chemistry: On FreeSolv, data-driven methods outperformed ab initio calculations with moderate data, suggesting data-driven approaches will become increasingly important as methods and datasets mature.

The authors express hope that MoleculeNet will stimulate algorithmic development similar to how ImageNet catalyzed breakthroughs in computer vision. Future directions include extending coverage to 3D protein structure prediction, DNA topological modeling, and other areas of molecular science.

Reproducibility Details

Data

All 17 datasets are publicly available and integrated into the DeepChem Python package. Users can load any dataset with a single library call.

Purpose	Dataset	Size	Notes
QM benchmark	QM7/QM7b/QM8/QM9	7K-134K compounds	DFT-computed properties from GDB subsets
Physical chemistry	ESOL/FreeSolv/Lipophilicity	643-4,200 compounds	Experimental measurements
Biophysics	PCBA/MUV/HIV/PDBbind/BACE	1.5K-440K compounds	Bioassay and binding data
Physiology	BBBP/Tox21/ToxCast/SIDER/ClinTox	1.4K-8.6K compounds	Toxicity and drug safety data

Algorithms

All splitting methods (random, scaffold, stratified, time) and featurizations (ECFP, Coulomb matrix, grid, symmetry functions, graph convolutions, weave) are implemented in DeepChem. Hyperparameters were tuned via Gaussian process optimization. Three random seeds were used per experiment.

Models

All 12 models are implemented in DeepChem, built on Scikit-Learn and TensorFlow. No pretrained weights are provided; models are trained from scratch on each dataset.

Evaluation

Metrics include MAE, RMSE, ROC-AUC, and PRC-AUC as specified per dataset. Multi-task datasets report mean metric values across all tasks.

Hardware

The authors used Stanford’s Sherlock and Xstream GPU nodes. Specific GPU types and training times per model are provided in Table S1 of the supplementary material.

Artifacts

Artifact	Type	License	Notes
DeepChem	Code	MIT	Open-source library with all datasets, featurizations, and models

Paper Information

Citation: Wu, Z., Ramsundar, B., Feinberg, E. N., Gomes, J., Geniesse, C., Pappu, A. S., Leswing, K., & Pande, V. (2018). MoleculeNet: a benchmark for molecular machine learning. Chemical Science, 9(2), 513-530. https://doi.org/10.1039/c7sc02664a

@article{wu2018moleculenet,
  title={MoleculeNet: a benchmark for molecular machine learning},
  author={Wu, Zhenqin and Ramsundar, Bharath and Feinberg, Evan N. and Gomes, Joseph and Geniesse, Caleb and Pappu, Aneesh S. and Leswing, Karl and Pande, Vijay},
  journal={Chemical Science},
  volume={9},
  number={2},
  pages={513--530},
  year={2018},
  publisher={Royal Society of Chemistry},
  doi={10.1039/c7sc02664a}
}

Benchmarking Molecular Property Prediction at Scale

Wed, 25 Mar 2026 00:00:00 +0000

A Large-Scale Empirical Study of Molecular Property Prediction

This is an Empirical paper that systematically benchmarks molecular property prediction across multiple dimensions: molecular representations, model architectures, evaluation metrics, data splitting strategies, and chemical space generalization. The primary contribution is a rigorous, large-scale comparison (62,820 trained models) showing that traditional machine learning models on fixed molecular representations frequently outperform recent deep representation learning approaches, and that several overlooked evaluation factors (statistical testing, metric choice, activity cliffs, dataset size) significantly influence conclusions about model performance.

Motivation: Overlooked Evaluation Pitfalls in Molecular Property Prediction

Molecular property prediction is a core task in AI-driven drug discovery, and recent years have seen a proliferation of representation learning methods (transformers on SMILES, GNNs on molecular graphs) claiming improved performance on MoleculeNet benchmark datasets. However, the authors identify several systemic problems in how these methods are evaluated:

Heavy reliance on MoleculeNet benchmarks, which may not reflect real-world drug discovery challenges. Some benchmark tasks (e.g., SIDER, ClinTox) are arguably unreasonable because they try to predict outcomes from chemical structure alone when other factors (food-drug interactions, patient-level variables) dominate.
Lack of statistical rigor. Most papers report mean metrics over 3 or 10 splits without statistical tests. Without rigorous analysis, improved metrics could be statistical noise.
Inconsistent data splits. Across studies, the actual splits vary because seeds and splitting implementations differ, making cross-paper comparisons unreliable.
Inappropriate metrics. AUROC, the default for classification, can overestimate performance, especially on imbalanced datasets. Precision-oriented metrics (PPV, NPV) may be more relevant for virtual screening.
Neglect of activity cliffs. Most studies only evaluate inter-scaffold generalization via scaffold splits, ignoring intra-scaffold generalization where structurally similar molecules exhibit drastically different activities (activity cliffs).

Core Contribution: Fixed Representations Often Outperform Learned Representations

The central finding is that traditional ML models (RF, SVM, XGBoost) operating on fixed molecular representations (RDKit2D descriptors, Morgan fingerprints, MACCS keys, AtomPairs) frequently outperform recent self-supervised pretrained models (MolBERT, GROVER) across diverse datasets. The authors frame the paper around a central thesis:

“A model cannot save an unqualified dataset which cannot remedy an improper evaluation for an ambiguous chemical space generalization claim.”

Key findings on representations and models:

RF on RDKit2D descriptors achieves the best performance on BACE, BBBP, ESOL, and Lipop under scaffold split. MolBERT only matches RF in HIV.
Concatenating RDKit2D descriptors to GROVER’s learned embeddings (GROVER_RDKit) significantly improves performance, suggesting the learned representations alone are insufficient and that fixed descriptors carry substantial predictive signal.
For binding activity datasets (opioid receptors MOR, DOR, KOR), MorganBits fingerprints outperform other representations, consistent with the structural nature of binding.
PhysChem descriptors excel on datasets where properties correlate strongly with simple molecular features (e.g., ESOL has a near-linear relationship between MolLogP and solubility), but perform poorly on binding activity datasets where the relationship is more complex.

Experimental Setup: 62,820 Models Across Diverse Datasets

Models evaluated

The study evaluates nine models across three categories:

Traditional ML: Random Forest (RF), Support Vector Machine (SVM), XGBoost
Regular neural networks: RNN (GRU variant), GCN, GIN
Pretrained models: MolBERT (SMILES-based, ~85M parameters, pretrained on 1.6M molecules), GROVER (graph-based, ~48M parameters, pretrained on ~10M molecules), and GROVER_RDKit (GROVER with concatenated RDKit2D descriptors)

Molecular representations

Six fixed representations are evaluated: RDKit2D descriptors (200 features), PhysChem descriptors (11 features), MACCS keys, MorganBits fingerprints, MorganCounts fingerprints, and AtomPairs fingerprints. Morgan fingerprints use radius 2 and 2048 bits after testing showed little difference between common parameter choices.

Datasets

Category	Datasets	Task Type	Source
MoleculeNet benchmarks	BACE, BBBP, HIV	Classification	MoleculeNet
MoleculeNet benchmarks	ESOL, FreeSolv, Lipop	Regression	MoleculeNet
Opioids-related	MDR1, CYP2D6, CYP3A4, MOR, DOR, KOR	Classification + Regression	ChEMBL
Activity datasets	24 targets	Regression	Cortes-Ciriano et al.
Activity datasets	30 targets (MoleculeACE)	Regression	Tilborg et al.
Descriptor datasets	MolWt, NumAtoms (16 sizes each)	Regression	ZINC250k

Evaluation protocol

Both scaffold and random splits (80:10:10 ratio)
30 different random seeds per experiment for statistical rigor
Mann-Whitney U test for pairwise significance ($p < 0.05$, two-sided)
Multiple metrics per task: AUROC, AUPRC, PPV, NPV for classification; RMSE, MAE, $R^2$, Pearson $R$ for regression

Key metrics

Classification:

$$ \text{PPV} = \frac{\text{TP}}{\text{TP} + \text{FP}} $$

$$ \text{NPV} = \frac{\text{TN}}{\text{TN} + \text{FN}} $$

Regression:

$$ \text{RMSE} = \sqrt{\frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{y}_i)^2} $$

$$ \text{MAE} = \frac{1}{N} \sum_{i=1}^{N} |y_i - \hat{y}_i| $$

$$ \text{Pearson}_R = \frac{\sum_{i=1}^{N} (y_i - \bar{y}_{obs})(\hat{y}_i - \bar{y}_{pred})}{\sqrt{\sum_{i=1}^{N} (y_i - \bar{y}_{obs})^2 \sum_{i=1}^{N} (\hat{y}_i - \bar{y}_{pred})^2}} $$

$$ R^2 = 1 - \frac{\sum_{i=1}^{N} (y_i - \hat{y}_i)^2}{\sum_{i=1}^{N} (y_i - \bar{y}_{obs})^2} $$

Key Findings: Metrics, Activity Cliffs, and Dataset Size

Statistical testing is essential

Without statistical tests, there is a real risk of drawing incorrect conclusions. Analysis of individual splits shows that in certain splits, MolBERT or GROVER can appear to outperform RF, even though on aggregate with proper statistical testing, RF is significantly better. For example, in BBBP, RF dominates in 20 of 30 splits, but the remaining 10 could mislead a researcher using only a single split.

Metric choice changes conclusions

Different evaluation metrics can lead to contradictory conclusions about the same models:

In BBBP under scaffold split, RF significantly outperforms other models by AUROC, but shows similar performance when evaluated by PPV or NPV.
In FreeSolv, GROVER outperforms RF by Pearson $R$ ($p < 0.05$) but shows similar performance by $R^2$.
Pearson $R$ can overestimate $R^2$: even when $R^2$ drops to zero or negative, Pearson $R$ can remain around 0.5.
AUROC can be over-optimistic, especially on imbalanced datasets like CYP2D6 and CYP3A4.

The authors argue that PPV and NPV are more practically relevant for virtual screening than AUROC or AUPRC, since the goal is to identify true hits among predicted positives (or true non-binders among predicted negatives).

Activity cliffs pose a major challenge

Activity cliffs, defined as IC50 values spanning at least two orders of magnitude within one scaffold, are prevalent in the opioid-related datasets. Although AC scaffolds represent only about 10% of scaffolds, they encompass 25-46% of all molecules:

Dataset	AC scaffolds (%)	AC molecules (%)
MDR1	62 (10.2%)	594 (41.3%)
CYP2D6	124 (9.3%)	710 (31.0%)
CYP3A4	146 (7.2%)	926 (25.2%)
MOR	213 (13.1%)	1627 (46.1%)
DOR	178 (11.6%)	1342 (41.6%)
KOR	218 (13.1%)	1502 (45.2%)

Prediction performance is consistently worse for AC molecules, indicating limited intra-scaffold generalization. Removing edge-case molecules (those sharing scaffolds with pIC50 spanning 5 to 7) from test sets generally improves classification performance, confirming that activity cliffs are a key source of prediction error.

Dataset size is critical for representation learning

Experiments on descriptor datasets (predicting MolWt and NumAtoms) reveal clear patterns:

With fewer than 1K data points, traditional ML on fixed representations outperforms all neural network models except pretrained GROVER, which shows competitive performance in the low-data regime.
MolBERT shows severely limited performance (RMSE > 200 for MolWt) with fewer than 10K data points.
RNN achieves the best performance when dataset size exceeds 10K, demonstrating the promise of representation learning in the “big-data” regime.
SVM achieves near-perfect RMSE (close to zero) on datasets larger than 10K when paired with AtomPairs fingerprints.
GROVER’s performance does not substantially improve with increasing dataset size, while MolBERT improves at 100K but is slow to benefit from more data.

Representation learning models show higher metric variability

Representation learning models, particularly GROVER, exhibit higher variability in performance metrics across splits. This variability correlates negatively with mean performance: models with higher variability tend to perform worse on average. The authors emphasize the importance of reporting metric variability alongside means.

Scaffold split versus random split

Prediction performance under scaffold split is consistently worse than under random split, confirming the inter-scaffold generalization challenge. Notably, random split alleviates the intra-scaffold generalization challenge because some AC scaffolds are seen during training.

Descriptors correlate with specific properties

PhysChem descriptors excel on datasets where molecular properties correlate with simple descriptors (e.g., MolLogP has near $-1$ correlation with ESOL labels). For binding activity datasets, correlation coefficients mostly fall within $[-0.5, 0.5]$, explaining why PhysChem descriptors show limited performance on those tasks, while structural fingerprints are more useful.

Limitations and Future Directions

The authors acknowledge several limitations:

Uncertainty from model training (random initialization, mini-batch shuffling) was not fully addressed. Ensembling was not evaluated due to computational cost.
Experimental uncertainty in labels (noise, measurement error in pIC50 values) was not modeled, though it can be heteroscedastic and impact performance.
Model explainability was not covered, although it is important for building trust in AI tools for drug discovery.
The study focused on GROVERbase only (not GROVERlarge) due to computational constraints.

Future directions include: exploring better ways to use fixed representations alongside learned ones, developing techniques for chemical space generalization (both inter- and intra-scaffold), incorporating experimental uncertainty into model training and evaluation, and generating larger high-quality datasets to fully harness representation learning models.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Benchmark	MoleculeNet (BACE, BBBP, HIV, ESOL, FreeSolv, Lipop)	642-41,127 molecules	Downloaded from MolMapNet; max length < 400
Activity	Opioids-related (MDR1, CYP2D6, CYP3A4, MOR, DOR, KOR)	Varies	Collected from ChEMBL27; pIC50 values
Activity	Cortes-Ciriano et al. 24 targets	Varies	Activity data for drug targets
Activity	MoleculeACE 30 targets	Varies	Activity cliffs emphasis
Descriptor	MolWt, NumAtoms from ZINC250k	0.1K to 100K	16 dataset sizes per descriptor

Algorithms

RF: 500 trees (following Chemprop)
SVM: linear kernel
XGBoost: gradient boosting regressor/classifier with default hyperparameters
RNN: GRU variant, hidden size 512, 3 fully connected layers
GCN/GIN: embedding dimension 300, 5 convolutional layers, hidden size 512
MolBERT: BERTBase architecture, 768 embedding, 12 layers, 12 heads, ~85M parameters (769 fine-tuned)
GROVER: GROVERbase, ~48M parameters (~5.2M fine-tuned)
All splits repeated 30 times with seeds 0-29

Models

All model configurations, splits, and raw predictions are available in the GitHub repository.

Evaluation

Metrics: AUROC, AUPRC, PPV, NPV (classification); RMSE, MAE, $R^2$, Pearson $R$ (regression). Statistical testing via Mann-Whitney U test ($p < 0.05$, two-sided). Youden’s $J$ statistic used to determine classification threshold for PPV/NPV.

Hardware

All neural network experiments run on a single NVIDIA V100 GPU for 100 epochs. Batch size 32 for most experiments; 256 for GROVER on HIV due to compute time (MolBERT takes ~3 hours per split on HIV at batch size 32; GROVER takes ~5 hours at batch size 256). The study is partially funded by Stony Brook University OVPR Seed Grant, using the AI Institute at Stony Brook for computational resources.

Artifact	Type	License	Notes
Respite_MPP	Code	Unknown	Code, data, and raw predictions
Nature Communications article	Paper	CC-BY-4.0	Open access

Paper Information

Citation: Deng, J., Yang, Z., Wang, H., Ojima, I., Samaras, D., & Wang, F. (2023). A systematic study of key elements underlying molecular property prediction. Nature Communications, 14, 6395. https://doi.org/10.1038/s41467-023-41948-6

Publication: Nature Communications 2023

Additional Resources:

Respite_MPP GitHub Repository

Citation

@article{deng2023systematic,
  title={A systematic study of key elements underlying molecular property prediction},
  author={Deng, Jianyuan and Yang, Zhibo and Wang, Hehe and Ojima, Iwao and Samaras, Dimitris and Wang, Fusheng},
  journal={Nature Communications},
  volume={14},
  number={1},
  pages={6395},
  year={2023},
  doi={10.1038/s41467-023-41948-6}
}

ROGI-XD: Roughness of Pretrained Molecular Representations

Tue, 24 Mar 2026 00:00:00 +0000

Evaluating Chemical Foundation Models Through Surface Roughness

This is a Systematization paper that introduces a metric reformulation (ROGI-XD) and uses it to evaluate whether pretrained chemical models (PCMs) learn representations that produce smoother quantitative structure-property relationship (QSPR) surfaces than simple baselines. The key finding is negative: pretrained representations are no smoother than molecular fingerprints or descriptors, offering a principled explanation for their inconsistent performance on property prediction benchmarks.

The Smoothness Gap in Chemical Foundation Models

Chemical foundation models like ChemBERTa, ChemGPT, and graph-based pretrained networks promise to learn meaningful molecular representations from large unlabeled datasets via self-supervised learning. However, empirical benchmarks consistently show mixed results: these learned representations sometimes match and sometimes underperform simple baselines like Morgan fingerprints or RDKit descriptors.

Prior work by Deng et al. demonstrated that a random forest trained on 2048-bit Morgan fingerprints was competitive with, or superior to, pretrained models like MolBERT and GROVER on MoleculeNet and opioid bioactivity tasks. The authors sought to explain this pattern through the lens of QSPR surface roughness: if pretrained representations do not produce smoother mappings from molecular structure to property, they cannot consistently outperform baselines.

ROGI-XD: A Dimensionality-Independent Roughness Metric

The original ROuGhness Index (ROGI) captures global surface roughness by measuring the loss in property dispersion as a dataset is progressively coarse-grained through hierarchical clustering. However, ROGI values are not comparable across representations of different dimensionalities because distances between randomly sampled points increase with dimension, artificially deflating ROGI for high-dimensional representations.

ROGI-XD addresses this by changing the integration variable. Instead of integrating over normalized distance threshold $t$, ROGI-XD integrates over $1 - \log N_{\text{clusters}} / \log N$, where $N_{\text{clusters}}$ is the number of clusters at a given dendrogram step and $N$ is the dataset size. This variable captures the degree of coarse-graining independent of representation dimensionality, producing comparable roughness values across representations ranging from 14 dimensions (descriptors) to 2048 dimensions (ChemGPT).

The procedure follows five steps: (1) cluster molecules using complete linkage at distance threshold $t$, (2) coarse-grain by replacing each property label $y_i$ with its cluster mean $\bar{y}_j$, (3) compute the standard deviation $\sigma_t$ of the coarse-grained dataset, (4) repeat for all dendrogram steps, and (5) compute the area under the curve of $2(\sigma_0 - \sigma_t)$ versus the new integration variable.

Representations and Tasks Evaluated

The study compares seven molecular representations:

Representation	Type	Dimensionality	Source
Descriptors	Fixed	14	RDKit (14 properties)
Morgan FP	Fixed	512	Radius 2, 512-bit
VAE	Pretrained	128	Character-based SMILES VAE, ZINC 250k
GIN	Pretrained	300	Node attribute masking, ZINC 250k
ChemBERTa	Pretrained	384	77M molecules, masked LM
ChemGPT	Pretrained	2048	PubChem 10M, causal LM
Random	Baseline	128	Uniform $[0,1]^{128}$

These are evaluated on 17 regression tasks drawn from two sources: ADMET datasets from the Therapeutics Data Commons (TDC) and toy datasets generated using GuacaMol oracle functions. Five ML models are used for cross-validation: KNN, MLP, PLS, random forest, and SVR.

Pretrained Representations Are Not Smoother

ROGI-XD correlates strongly with cross-validated RMSE across representations (median Pearson $r = 0.72$-$0.88$ depending on model), compared to the original ROGI which produces weak cross-representation correlations (median $r \in [-0.32, 0.28]$). When correlating over both representations and tasks simultaneously, ROGI-XD achieves $r = 0.91$-$0.99$ versus $r = 0.68$-$0.84$ for the original ROGI.

Using this validated metric, the authors find that pretrained representations do not produce smoother QSPR surfaces than fingerprints or descriptors. In more than 50% of tasks, both descriptors and fingerprints generate smoother surfaces. The median relative ROGI-XD increase for pretrained representations is 9.1-21.3% compared to descriptors and 2.3-10.1% compared to fingerprints, indicating rougher surfaces.

As a practical tool, ROGI-XD can guide representation selection without exhaustive benchmarking. Selecting the representation with the lowest ROGI-XD for each task and then optimizing over model architecture results in only a 6.8% average relative increase in best-case model error across the 17 tasks. In 8 of 17 tasks, the lowest ROGI-XD correctly identifies the optimal representation.

Fine-tuning can improve smoothness. On the Lipophilicity task ($N_{\text{tot}} = 4200$), fine-tuning the VAE with a contrastive loss reduces ROGI-XD from 0.254 to 0.107 ($\pm 0.02$), well below the descriptor baseline of 0.227. On the smaller CACO2 task ($N_{\text{tot}} = 910$), fine-tuning yields ROGI-XD of 0.143 ($\pm 0.05$), comparable to descriptors at 0.132. The impact of fine-tuning is sensitive to both the task and the amount of labeled data.

Implications for Chemical Foundation Model Development

The lack of smoothness in pretrained QSPR surfaces explains the inconsistent empirical performance of chemical foundation models. The authors note that ROGI-XD is thematically similar to a contrastive loss, as both scale proportionally with the frequency and severity of activity cliffs. This connection suggests that imposing stronger smoothness assumptions during pretraining, for example through weak supervision on calculable molecular properties, could help produce representations that generalize better to downstream property prediction. ROGI-XD provides a practical tool for evaluating new pretraining strategies without exhaustive benchmark testing: a representation with lower ROGI-XD on a given task is likely to yield lower model error.

A limitation is that the study treats pretrained representations as static (frozen features). Fine-tuning introduces many additional design choices and can substantially improve representation quality, but this evaluation is left for future work. Additionally, the survey of pretrained models is not exhaustive and focuses on four representative architectures.

Reproducibility

Artifacts

Artifact	Type	License	Notes
coleygroup/rogi-xd	Code	MIT	Official implementation with pretrained models and notebooks; results reproducible via `make all`

Data

Purpose	Dataset	Size	Notes
Pretraining (VAE, GIN)	ZINC 250k	250,000	80/20 train/val split
Pretraining (ChemBERTa)	PubChem	77M	Masked language modeling
Pretraining (ChemGPT)	PubChem 10M	10M	Causal language modeling
Evaluation	TDC ADMET	~900-10,000 per task	12 regression tasks
Evaluation	GuacaMol oracles	10,000 per task	5 synthetic tasks

Algorithms

ROGI-XD: Hierarchical clustering (complete linkage) with integration over $1 - \log N_{\text{clusters}} / \log N$
Cross-validation: 5-fold CV with KNN, MLP, PLS, RF (n_estimators=50), SVR from scikit-learn
Fine-tuning loss: $\mathscr{L} = \mathscr{L}_{\text{CE}} + \beta \cdot \mathscr{L}_{\text{KL}} + \gamma \cdot \mathscr{L}_{\text{cont}}$ with $\beta = 0.1$, $\gamma = 50$; contrastive term uses cosine distance in latent space and absolute value in target space

Hardware

Two AMD Ryzen Threadripper PRO 3995WX CPUs, four NVIDIA A5000 GPUs, 512 GB RAM, Ubuntu 20.04 LTS.

Paper Information

Citation: Graff, D. E., Pyzer-Knapp, E. O., Jordan, K. E., Shakhnovich, E. I., & Coley, C. W. (2023). Evaluating the roughness of structure-property relationships using pretrained molecular representations. Digital Discovery, 2(5), 1452-1460. https://doi.org/10.1039/d3dd00088e

Publication: Digital Discovery 2023

Additional Resources:

ROGI-XD Code Repository

Citation

@article{graff2023roughness,
  title={Evaluating the roughness of structure--property relationships using pretrained molecular representations},
  author={Graff, David E. and Pyzer-Knapp, Edward O. and Jordan, Kirk E. and Shakhnovich, Eugene I. and Coley, Connor W.},
  journal={Digital Discovery},
  volume={2},
  number={5},
  pages={1452--1460},
  year={2023},
  publisher={Royal Society of Chemistry},
  doi={10.1039/d3dd00088e}
}

Regression Transformer: Prediction Meets Generation

Sun, 22 Mar 2026 00:00:00 +0000

A Multitask Model That Unifies Regression and Generation

The Regression Transformer (RT) is a Method paper. It introduces a single model architecture that can both predict continuous molecular properties and conditionally generate molecules with desired property values. The core idea is to reformulate regression as a sequence modelling task: instead of training a dedicated regression head, continuous property values are tokenized into sequences of digits and predicted alongside molecular tokens using a cross-entropy loss.

Closing the Gap Between Predictors and Generators

Existing transformer-based approaches in computational chemistry develop property predictors and generative models as separate systems. Even when a single architecture like Chemformer (Irwin et al., 2022) addresses both tasks, it does so through task-specific heads. This means the two capabilities remain disjoint, and the generative model cannot use its own property prediction ability during generation.

The RT addresses three specific gaps:

No true multitask entanglement: Prior work either tunes separate heads for prediction and generation or limits communication between modules to a reward signal.
No inductive bias for continuous properties: Molecular generative models lack mechanisms to condition generation on floating-point property values.
Disconnected workflows: Property predictors cannot generate molecules, and generators cannot assess whether their outputs satisfy property constraints.

Core Innovation: Regression as Conditional Sequence Modelling

The RT’s key insight is that regression can be cast as sequential classification over digit tokens while preserving predictive accuracy. This is achieved through three components:

Numerical Tokenization

Floating-point property values are split into individual digit tokens that preserve decimal order. Each token $t_{v,p}$ encodes a digit value $v \in [0, 9]$ and its decimal place $p \in \mathbb{Z}$. For example, the value 12.3 becomes the token sequence [1_1, 2_0, 3_-1].

Numerical Encodings

To provide an inductive bias about the semantic proximity of digit tokens (which cross-entropy loss cannot convey), the RT introduces Numerical Encodings (NEs), analogous to positional encodings. For a token $t_{v,p}$ at embedding dimension $j$:

$$ \text{NE}_{\text{Float}}(v, p, j) = (-1)^j \cdot \frac{v \cdot 10^p}{j + 1} $$

These encodings ensure that pairwise distances between digit tokens decay monotonically with their floating-point proximity. The model can also learn digit orderings from data alone, but NEs provide a useful inductive bias.

Alternating Training with Self-Consistency

The RT uses an XLNet backbone trained with permutation language modelling (PLM). The key is that the same model serves two roles depending on which tokens are masked:

Mask numerical tokens: the model performs property prediction (regression)
Mask textual tokens: the model performs conditional sequence generation

The base PLM objective is:

$$ \mathcal{L}_{\text{PLM}} = \mathbb{E}_{\mathbf{z} \sim \mathcal{Z}_T} \left[ \sum_{i=c+1}^{T} \log p_\theta(x_{z_i} \mid \mathbf{x}_{\mathbf{z}_{< i}}) \right] $$

This is refined into two specialized objectives: a property prediction objective $\mathcal{L}_P$ that masks only numerical tokens, and a generation objective $\mathcal{L}_G$ that masks only textual tokens. Training alternates between these every 50 steps.

The self-consistency (SC) loss adds a critical feedback loop. After generating a candidate molecule $\hat{\mathbf{x}}$, the model re-evaluates it by predicting the property of the generated sequence:

$$ \mathcal{L}_{\text{SC}} = \mathcal{L}_G(\mathbf{x}) + \alpha \cdot \mathcal{L}_P(\hat{\mathbf{x}}) $$

This rewards generating molecules whose predicted properties match the primed property value, exploiting the RT’s dual capability as both predictor and generator.

Experiments Across Molecules, Proteins, and Reactions

Drug Likeness (QED)

Initial validation on a synthetic QED dataset (~1.4M molecules from ChEMBL) demonstrated that the RT can simultaneously learn to predict QED scores (RMSE < 0.06) and generate novel molecules conditioned on desired QED values (Spearman’s $\rho$ up to 0.517 between primers and generated molecule properties). Novelty exceeded 99% across all configurations. The alternating training scheme with SC loss outperformed both single-task models and the vanilla PLM objective.

SELFIES representations proved comparable to SMILES for property prediction and far superior for generation (~100% validity vs. ~40% for SMILES).

MoleculeNet Regression Benchmarks

On MoleculeNet benchmarks ESOL, FreeSolv, and Lipophilicity, the RT outperformed XGBoost and MPNN baselines despite using only a classification loss. It performed on par with XLNet using a conventional regression head, and was only mildly inferior to models like BERT and BART that used large-scale self-supervised pre-training with regression losses.

Critically, only the RT could also conditionally generate molecules for these tasks. External validation with Grover (a self-supervised Graph Transformer) confirmed high correlation with the RT’s own property predictions (0.86, 0.84, and 0.75 for ESOL, FreeSolv, and Lipophilicity respectively).

Constrained Property Optimization

On the penalized logP (plogP) benchmark with similarity constraints, the RT outperformed JT-VAE and GCPN by large margins. At similarity threshold $\delta = 0.4$, the RT achieved 3.16 average improvement with 97.1% success rate, while also predicting plogP with PCC of 0.92. Competing methods cannot perform property prediction at all.

Model	Improvement ($\delta$=0.4)	Success	Property Prediction
JT-VAE	0.84	83.6%	Unfeasible
GCPN	2.49	100%	Unfeasible
MoFlow	4.71	85.7%	Unfeasible
RT	3.16	97.1%	PCC = 0.92

The comparison is not strictly fair: all competing methods are trained specifically to maximize plogP, and some (GCPN, JT-VAE) apply gradient optimization at inference time. The RT is only trained to reconstruct molecules with similar predicted plogP to the seed, so its training objective is property-agnostic rather than directly optimizing for higher plogP values.

Protein Language Modelling

On the TAPE benchmark, the RT matched or outperformed conventional transformers on fluorescence and stability prediction tasks, despite those baselines being pre-trained on 24-106 million protein sequences (vs. 2.6 million for the RT). The RT also performed conditional protein generation, a task that none of the TAPE baselines can address.

Chemical Reaction Modelling

The RT was applied to reaction yield prediction on Buchwald-Hartwig amination and Suzuki coupling datasets. It matched Yield-BERT performance ($R^2$ = 0.939 and 0.81 respectively) while also enabling novel capabilities: reconstructing missing precursors from partial reactions and decorating existing reactions to achieve higher predicted yields. Across both datasets, over 40% of top-five predicted sequences contained reactions with novel precursors and higher predicted yield.

Key Findings and Limitations

Key Findings

Regression can be successfully reformulated as sequential classification over digit tokens without losing predictive accuracy compared to models using regression losses.
The alternating training scheme with self-consistency loss enables cross-task benefits, where the model outperforms single-task variants at both prediction and generation.
A single ~27M parameter model handles property prediction, conditional molecular generation, conditional protein generation, and reaction yield prediction with precursor generation.
The model learns the natural ordering of digits from data: 47% of embedding dimensions for the tenths place directly encode digit ordering even without explicit numerical encodings.

Limitations

No large-scale pre-training: The RT uses ~27M parameters trained from scratch on task-specific datasets, unlike BARTSmiles or MoLFormer which pre-train on billions of molecules. Scaling up could improve results.
Fine-grained regression precision: The model sometimes struggles with intra-mode precision (e.g., on the fluorescence dataset where predictions cluster around bright/dark modes rather than capturing continuous variation).
Single-property focus: All reported experiments use a single continuous property, though the framework naturally extends to multi-property settings.
SELFIES validity caveats: While SELFIES are always syntactically valid, they can produce degenerate short molecules (~1.9% defective generations where the output has less than 50% of the seed’s atoms).
XLNet backbone limitations: Results on MoleculeNet regression are slightly below models using BART or BERT backbones with large-scale pre-training, suggesting the RT framework could benefit from stronger base models.

Reproducibility Details

Artifacts

Artifact	Type	License	Notes
Regression Transformer (GitHub)	Code	MIT	Training and evaluation scripts
GT4SD Integration	Code + Models	MIT	Pre-trained model inference pipelines
HuggingFace Demo	Demo	-	Interactive inference webapp

Data

Purpose	Dataset	Size	Notes
Drug likeness	ChEMBL (QED)	~1.4M molecules	Synthetic QED labels computed with RDKit
Regression benchmark	MoleculeNet (ESOL, FreeSolv, Lipo)	642-4,200 compounds	16x SMILES augmentation, 3 random splits
Property optimization	ZINC (plogP)	215,381 train / 799 test	Fixed split from Jin et al. (2018)
Protein pre-training	UniProt (Boman)	2,648,205 peptides	15-45 amino acid peptides
Protein benchmarks	TAPE (Fluorescence, Stability)	21,446-53,416 samples	Fixed splits
Reaction pre-training	USPTO	2,830,616 reactions	Molecular weight as numerical property
Reaction yield	Buchwald-Hartwig / Suzuki	3,955 / 5,760 reactions	Ten 70/30 random splits

Algorithms

Architecture: XLNet (32 hidden layers, 256 hidden dim, 1024 FFN dim, 16 attention heads, 20% dropout)
Parameters: ~27 million
Training: Permutation language modelling pre-training, then alternating objectives (property prediction + conditional generation with SC loss)
Decoding: Greedy for property prediction, beam search for sequence generation

Evaluation

Task	Metric	RT Result	Notes
QED prediction	RMSE	0.037	Best config (NE + SC)
QED generation	Spearman’s $\rho$	0.517	Between primers and generated QED
ESOL	RMSE	Comparable to XLNet	Within s.d. of regression-loss XLNet
plogP optimization ($\delta$=0.4)	Improvement	3.16	Outperforms JT-VAE, GCPN
Protein fluorescence	Spearman’s $\rho$	0.72	Outperforms TAPE baselines
BH yield prediction	$R^2$	0.939	Near Yield-BERT (0.951)

Hardware

All models trained on single GPUs (NVIDIA A100 or V100)
Training time: ~4 days for pre-training, ~1 day for fine-tuning
Framework: PyTorch 1.3.1 with HuggingFace Transformers 3.1.0

Paper Information

Citation: Born, J. & Manica, M. (2023). Regression Transformer enables concurrent sequence regression and generation for molecular language modelling. Nature Machine Intelligence, 5(4), 432-444. https://doi.org/10.1038/s42256-023-00639-z

Publication: Nature Machine Intelligence, April 2023

Additional Resources:

Citation

@article{born2023regression,
  title={Regression Transformer enables concurrent sequence regression and generation for molecular language modelling},
  author={Born, Jannis and Manica, Matteo},
  journal={Nature Machine Intelligence},
  volume={5},
  number={4},
  pages={432--444},
  year={2023},
  publisher={Nature Publishing Group}
}

Language Models Learn Complex Molecular Distributions

Sun, 22 Mar 2026 00:00:00 +0000

RNN Language Models as Flexible Molecular Generators

This is an Empirical paper that investigates the capacity of simple recurrent neural network (RNN) language models to learn complex molecular distributions. The core finding is that LSTM-based models trained on SMILES (SM-RNN) or SELFIES (SF-RNN) string representations consistently outperform popular graph generative models (JTVAE, CGVAE) across three increasingly challenging generative modeling tasks. The paper positions language models as flexible, scalable alternatives to graph-based approaches for molecular generation.

Scaling Beyond Standard Benchmarks

Most molecular generative models are evaluated on relatively small, drug-like molecules from datasets like ZINC or MOSES. These standard benchmarks do not test whether models can handle larger, more structurally diverse molecules or distributions with complex shapes (multi-modal, heavy-tailed). This gap matters because there is increasing interest in larger, more complex molecules for therapeutics, including peptides and natural products.

Graph generative models like JTVAE and CGVAE impose structural constraints (tree decompositions, valency restrictions) that help with validity but limit their ability to scale. Language models, by contrast, only need to generate a single character sequence, making them inherently more flexible.

Three Challenging Generative Modeling Tasks

The paper introduces three benchmark tasks designed to stress-test generative models:

Task 1: Penalized LogP Distribution

A dataset of approximately 160K molecules from ZINC15 with penalized LogP scores exceeding 4.0. The training distribution is sharply peaked around 4.0 to 4.5 with a subtle tail extending above 6.0. Molecules in the tail tend to have long carbon chains and fewer rings. The challenge is learning this skewed distribution rather than just finding individual high-scoring molecules.

A composite dataset of approximately 200K molecules drawn from four sources with distinct molecular weight ranges:

GDB-13 (MW $\leq$ 185)
ZINC (185 $\leq$ MW $\leq$ 425)
Harvard Clean Energy Project (460 $\leq$ MW $\leq$ 600)
POLYMERS (MW $>$ 600)

Models must learn to generate from all four modes simultaneously, each with very different molecular structures.

Task 3: Large-Scale Molecules

The largest molecules in PubChem with more than 100 heavy atoms, yielding approximately 300K molecules with molecular weights ranging from 1,250 to 5,000. These include small biomolecules, photovoltaics, peptides, and cyclic peptides. This task is particularly challenging because the SMILES/SELFIES strings are very long.

Evaluation by Distributional Fidelity

The evaluation framework focuses on how well a model learns the full training distribution rather than generating individual good molecules. The primary quantitative metric is the Wasserstein distance (earth mover’s distance) between molecular property distributions of generated and training molecules:

$$W(P, Q) = \inf_{\gamma \in \Gamma(P,Q)} \int | x - y | , d\gamma(x, y)$$

Properties evaluated include LogP, synthetic accessibility (SA), quantitative estimate of drug-likeness (QED), molecular weight (MW), Bertz complexity (BCT), and natural product likeness (NP). An oracle baseline is computed by measuring the Wasserstein distance between different random samples of the training data itself.

Standard metrics (validity, uniqueness, novelty) are also reported but are secondary to distributional fidelity.

Architecture: LSTM Language Models

The language models use standard LSTM architectures trained autoregressively on molecular strings. Two variants are compared:

SM-RNN: Trained on canonical SMILES
SF-RNN: Trained on SELFIES representations

Hyperparameters are tuned via random search over learning rate ($\in [0.0001, 0.001]$), hidden units ($\in [100, 1000]$), layers (1 to 5), and dropout ($\in [0.0, 0.5]$). Model selection uses a combination of standard metrics and Wasserstein distance rankings.

The graph model baselines include JTVAE (junction tree VAE) and CGVAE (constrained graph VAE), along with several additional baselines (MolGAN, GraphNVP, and others).

Results: Language Models Outperform Graph Models Across All Tasks

Penalized LogP

Both RNN models learn the sharp training distribution far better than graph models. The SM-RNN achieves the lowest Wasserstein distances across most properties. The graph models produce substantial out-of-distribution mass around penalized LogP scores of 1.75 to 2.25, failing to capture the peaked nature of the training distribution.

Critically, the RNNs also learn the subtle tail above penalized LogP of 6.0, generating molecules with long carbon chains and fewer rings that match the structural characteristics of high-scoring training molecules. CGVAE and JTVAE almost entirely miss this tail.

Both RNN models capture all four modes of the training distribution. JTVAE entirely misses the GDB13 mode and poorly learns the ZINC and CEP modes. CGVAE learns GDB13 but misses the CEP mode. The SM-RNN again achieves the best Wasserstein metrics.

Large-Scale Molecules

This is the most discriminating task. Both JTVAE and CGVAE completely fail to train on these large molecules. JTVAE’s tree decomposition produces a vocabulary of approximately 11,000 substructures, making training intractable. Only the RNN models succeed, with the SF-RNN achieving slightly better distributional match due to SELFIES guaranteeing 100% validity even for very long strings.

Both RNN models also learn the bimodal LogP structure within the large-molecule distribution and can generate molecules with substructures resembling peptides, including backbone chains and standard amino acid side chains.

Summary of Wasserstein Distance Results

Task	Model	LogP	SA	QED	MW
LogP	SM-RNN	0.095	0.031	0.007	3.3
LogP	SF-RNN	0.177	0.290	0.010	6.3
LogP	JTVAE	0.536	0.289	0.081	35.9
LogP	CGVAE	1.000	2.120	0.115	69.3
Multi	SM-RNN	0.081	0.025	0.006	5.5
Multi	SF-RNN	0.286	0.179	0.023	11.4
Multi	JTVAE	0.495	0.274	0.034	27.7
Multi	CGVAE	1.617	1.802	0.076	30.3
Large	SM-RNN	1.367	0.213	0.003	124.5
Large	SF-RNN	1.095	0.342	0.010	67.3
Large	JTVAE	–	–	–	–
Large	CGVAE	–	–	–	–

SMILES vs. SELFIES Trade-off

An interesting finding is that SMILES and SELFIES RNNs each have complementary strengths. The SF-RNN consistently achieves better standard metrics (validity, uniqueness, novelty) across all tasks, while the SM-RNN achieves better Wasserstein distance metrics. The authors suggest that the SELFIES grammar may reduce memorization of the training data, improving novelty but slightly hurting distributional fidelity.

Limitations

The authors acknowledge several limitations. Language models cannot account for molecular geometry or 3D information, which is important for many applications. The study evaluates distributional fidelity but does not test downstream utility for specific molecular design tasks (e.g., optimizing for a particular biological target). Additionally, while the graph models (JTVAE, CGVAE) are more interpretable, the language models operate as black boxes over string representations. The comparison is also limited to two specific graph model architectures, and more recent or specialized graph models may close the performance gap. Finally, trained model weights are only available upon request rather than being publicly released.

Reproducibility

Artifact	Type	License	Notes
danielflamshep/genmoltasks	Dataset	Apache-2.0	Processed training data and generated samples

Data: Three custom datasets constructed from ZINC15, GDB-13, Harvard Clean Energy Project, POLYMERS, and PubChem. Processed data available at the GitHub repository.

Code: LSTM networks implemented in PyTorch using the char-rnn code from the MOSES repository. Baselines use the official JTVAE and CGVAE implementations. No unified training script is provided in the repository.

Evaluation: Wasserstein distances computed using SciPy. Molecular properties computed using RDKit. 10K molecules generated from each model for evaluation.

Hyperparameters: Task-specific configurations reported. For example, the LogP task SM-RNN uses 2 hidden layers with 400 units, dropout of 0.2, and learning rate of 0.0001.

Hardware: Models were trained using the Canada Computing Systems (Compute Canada). Specific GPU types and training times are not reported.

Paper Information

Citation: Flam-Shepherd, D., Zhu, K., & Aspuru-Guzik, A. (2022). Language models can learn complex molecular distributions. Nature Communications, 13, 3293. https://doi.org/10.1038/s41467-022-30839-x

Additional Resources:

GitHub: danielflamshep/genmoltasks

@article{flamshepherd2022language,
  title={Language models can learn complex molecular distributions},
  author={Flam-Shepherd, Daniel and Zhu, Kevin and Aspuru-Guzik, Al{\'a}n},
  journal={Nature Communications},
  volume={13},
  number={1},
  pages={3293},
  year={2022},
  publisher={Nature Publishing Group},
  doi={10.1038/s41467-022-30839-x}
}

Exposing Limitations of Molecular ML with Activity Cliffs

Mon, 16 Mar 2026 00:00:00 +0000

A Benchmark for Activity Cliff Prediction

This is a Systematization paper ($\Psi_{\text{Systematization}}$) with a significant Resource component ($\Psi_{\text{Resource}}$).

The paper systematically benchmarks 24 machine learning and deep learning approaches on their ability to predict bioactivity for activity cliff compounds: pairs of structurally similar molecules that exhibit large differences in potency. These cases violate the similarity principle (similar structure implies similar activity) and represent a practical failure mode for molecular property prediction in drug discovery. The authors release MoleculeACE, an open-source benchmarking platform for evaluating ML models on activity cliffs.

The similarity principle underpins most molecular ML: structurally similar compounds should have similar properties. Activity cliffs are the exceptions, where small structural changes cause large potency shifts (e.g., a single substituent change causing a 10x difference in $K_i$).

Despite their importance for hit-to-lead optimization, activity cliffs have received limited attention in ML benchmarking. Standard metrics like RMSE computed over entire test sets can mask poor predictions on cliff compounds. A model might achieve low overall error while systematically mispredicting these edge cases, which are precisely the molecules that matter most for medicinal chemistry applications.

The authors identify 7-52% of compounds as activity cliff molecules across their 30 target datasets, showing this is not a rare phenomenon.

Defining and Detecting Activity Cliffs

The authors use three complementary similarity metrics to identify activity cliffs:

Substructure similarity: Tanimoto coefficient on extended connectivity fingerprints (ECFPs), capturing shared radial substructures
Scaffold similarity: Tanimoto coefficient on ECFPs computed from molecular graph frameworks, detecting core/decoration differences
SMILES similarity: Levenshtein distance on canonical SMILES strings, capturing character-level insertions, deletions, and translocations

Pairs with $\geq 90%$ similarity on any one of the three metrics and $> 10\times$ difference in bioactivity ($K_i$ or $\text{EC}_{50}$) are classified as activity cliff pairs. This union-based approach (rather than requiring agreement across all metrics) captures different types of structural relationships relevant to medicinal chemistry.

24 Methods Across 30 Drug Targets

The benchmark evaluates 16 traditional ML configurations (4 algorithms $\times$ 4 descriptor types) and 8 deep learning approaches across 30 curated ChEMBL v29 datasets (48,707 total molecules).

Traditional ML algorithms: KNN, RF, GBM, SVM, each combined with ECFPs, MACCS keys, WHIM descriptors, or physicochemical properties.

Deep learning methods: MPNN, GCN, GAT, Attentive FP (graph-based), plus LSTM, CNN, Transformer/ChemBERTa (SMILES-based), and an MLP on ECFPs.

Performance is measured with both standard RMSE and a dedicated $\text{RMSE}_{\text{cliff}}$ computed only on activity cliff compounds in the test set:

$$ \text{RMSE}_{\text{cliff}} = \sqrt{\frac{\sum_{j=1}^{n_c} (\hat{y}_j - y_j)^2}{n_c}} $$

Key results:

Molecular descriptors matter more than algorithms: The choice of descriptor (ECFPs vs. MACCS vs. WHIM vs. physicochemical) had a larger impact on $\text{RMSE}_{\text{cliff}}$ than the choice of ML algorithm ($p < 0.05$, Wilcoxon rank-sum test with Benjamini-Hochberg correction).
SVM + ECFPs wins on average: The best overall method for activity cliff prediction, though the difference from RF + ECFPs or GBM + ECFPs was not statistically significant.
Deep learning underperforms: All graph and SMILES-based deep learning methods performed worse than a simple MLP on ECFPs. Among deep learning, LSTM with transfer learning (pretrained on 36K molecules) was the best, outperforming the ChemBERTa transformer pretrained on 10M compounds.
Large case-by-case variation: $\text{RMSE}_{\text{cliff}}$ ranged from 0.62 to 1.60 log units across datasets, with no method consistently best. Deep learning methods showed the highest variance across targets.

Simple Descriptors Beat Complex Architectures on Cliffs

The core finding is that activity cliffs expose a gap in learned molecular representations. Despite graph neural networks and transformers being able to learn directly from molecular structure, they fail to capture the subtle structural differences that drive activity cliffs.

Key observations:

RMSE and $\text{RMSE}_{\text{cliff}}$ correlate ($r = 0.81$ on average), so optimizing overall error usually helps with cliffs too. But this correlation breaks down for some targets (e.g., CLK4), where methods with similar RMSE can have very different $\text{RMSE}_{\text{cliff}}$.
Training set size matters for the RMSE/$\text{RMSE}_{\text{cliff}}$ correlation: Datasets with $> 1000$ training molecules show $r > 0.80$ between the two metrics. In low-data regimes, the correlation weakens, making dedicated cliff evaluation more important.
No relationship between % cliff compounds and model performance, and no target-family-specific effects were found.
Transfer learning helped SMILES models (LSTM) but not graph models: Self-supervised pretraining strategies (context prediction, infomax, edge prediction, masking) did not improve GNN performance, consistent with findings from other studies.

The MoleculeACE platform provides standardized data curation, activity cliff detection, and cliff-specific evaluation, enabling researchers to assess new methods against this benchmark.

Reproducibility Details

Data

Purpose	Source	Size	Notes
Benchmarking	ChEMBL v29	48,707 molecules (35,632 unique) across 30 targets	Curated for duplicates, salts, outliers
Smallest dataset	JAK1	615 molecules	7% activity cliffs
Largest dataset	DRD3	3,657 molecules	39% activity cliffs

Algorithms

Activity cliff detection: Pairwise similarity $\geq 0.9$ (Tanimoto on ECFPs, scaffold ECFPs, or Levenshtein on SMILES) with $> 10\times$ potency difference
Splitting: Spectral clustering on ECFPs (5 clusters), 80/20 stratified split preserving cliff proportion
Hyperparameter optimization: Bayesian optimization with Gaussian process, max 50 combinations, 5-fold cross-validation
SMILES augmentation: 10-fold for all SMILES-based methods
Transfer learning: LSTM pretrained on 36,281 merged training molecules (next-character prediction); ChemBERTa pretrained on 10M PubChem compounds

Models

Traditional ML: KNN, RF, GBM, SVM (scikit-learn v1.0.2)
Descriptors: ECFPs (1024-bit, radius 2), MACCS keys (166-bit), WHIM (114 descriptors), physicochemical (11 properties)
GNNs: MPNN, GCN, GAT, AFP (PyTorch Geometric v2.0.4), with graph multiset transformer pooling
SMILES models: LSTM (4 layers, 5.8M params), 1D CNN, ChemBERTa transformer
Total models trained: 720 (24 methods $\times$ 30 targets)

Evaluation

Metric	Scope	Details
RMSE	All test molecules	Standard root-mean-square error on $\text{pK}_i$ / $\text{pEC}_{50}$
$\text{RMSE}_{\text{cliff}}$	Activity cliff compounds only	RMSE restricted to cliff molecules in test set

Artifacts

Artifact	Type	License	Notes
MoleculeACE	Code + Data	MIT	Benchmark platform with all 30 curated datasets
Curated datasets	Data	MIT	Processed ChEMBL bioactivity data

Paper Information

Citation: van Tilborg, D., Alenicheva, A., & Grisoni, F. (2022). Exposing the Limitations of Molecular Machine Learning with Activity Cliffs. Journal of Chemical Information and Modeling, 62(23), 5938-5951. https://doi.org/10.1021/acs.jcim.2c01073

Publication: Journal of Chemical Information and Modeling 2022

Additional Resources:

Citation

@article{vantilborg2022activity,
  title={Exposing the Limitations of Molecular Machine Learning with Activity Cliffs},
  author={van Tilborg, Derek and Alenicheva, Alisa and Grisoni, Francesca},
  journal={Journal of Chemical Information and Modeling},
  volume={62},
  number={23},
  pages={5938--5951},
  year={2022},
  publisher={American Chemical Society},
  doi={10.1021/acs.jcim.2c01073}
}