Property Prediction on Hunter Heidenreich | ML Research Scientist

MTL-BERT: Multitask BERT for Property Prediction

Fri, 27 Mar 2026 00:00:00 +0000

A Multitask BERT Framework for Molecular Property Prediction

MTL-BERT is a Method paper that introduces a multitask learning framework built on BERT for predicting molecular properties from SMILES strings. The primary contribution is the combination of three strategies to address data scarcity in drug discovery: (1) masked token pretraining on 1.7 million unlabeled molecules from ChEMBL, (2) multitask fine-tuning across 60 property prediction datasets simultaneously, and (3) SMILES enumeration as a data augmentation technique applied during pretraining, fine-tuning, and inference. The model achieves strong performance across 60 ADMET and molecular property datasets (44 classification and 16 regression), outperforming baselines including GNNs, XGBoost with molecular fingerprints, and prior SMILES-BERT approaches.

Data Scarcity in Molecular Property Prediction

Deep learning methods for molecular property prediction face a fundamental tension: they require large amounts of labeled data to learn effectively, but labeled bioactivity data is scarce due to the cost and time of laboratory experiments. Existing approaches at the time of publication addressed this in isolation. Graph neural networks (GNNs) learn from molecular graphs but are typically shallow (2-3 layers) and prone to overfitting on small datasets. The original SMILES-BERT model applied masked language modeling to SMILES strings but fine-tuned separately for each task, missing opportunities to share information across related properties. Fixed molecular representations like CDDD (continuous and data-driven descriptors) cannot be further optimized for specific downstream tasks.

The authors identify three specific gaps: (1) single-task fine-tuning wastes the correlations between related ADMET properties (e.g., lipophilicity relates to many ADMET endpoints), (2) using only canonical SMILES limits the model’s ability to learn robust molecular features, and (3) no prior work had combined pretraining, multitask learning, and SMILES enumeration into a unified framework.

Three Strategies Combined: Pretraining, Multitask Learning, and SMILES Enumeration

The core innovation of MTL-BERT is the synergistic combination of three strategies in a single pipeline.

Masked SMILES Pretraining

Following the BERT paradigm, MTL-BERT pretrains on 1.7 million unlabeled molecules from ChEMBL using a masked token recovery task. For each SMILES string, 15% of tokens are randomly selected: 80% are replaced with a [MASK] token, 10% are replaced with a random token, and 10% remain unchanged. The loss is computed only at masked positions. Unlike the original BERT, MTL-BERT omits the next-sentence prediction task since there is no sequential relationship between SMILES strings (following the RoBERTa finding that this task is unnecessary).

SMILES strings are tokenized with a regular expression that captures multi-character tokens (e.g., Si, Br, Cl) and common SMILES syntax. The model uses positional encoding to capture token order.

Transformer Architecture

The model uses a standard Transformer encoder with multihead self-attention. The scaled dot-product attention computes:

$$\mathbf{O}_h = \text{softmax}\left(\frac{\mathbf{Q}_h \mathbf{K}_h^T}{\sqrt{d_k}}\right) \mathbf{V}_h$$

where $\mathbf{Q}_h$, $\mathbf{K}_h$, and $\mathbf{V}_h$ are the query, key, and value matrices for head $h$, and $\sqrt{d_k}$ is a scaling factor. The outputs from all heads are concatenated and projected. Each attention sublayer is followed by a position-wise feedforward network with GELU activation, layer normalization, and residual connections.

Three model sizes were compared:

Model	Layers	Heads	Embedding Size	FFN Size	Recovery Accuracy	Fine-tuning Performance
MTL-BERT_SMALL	4	4	128	512	0.931	0.826
MTL-BERT_MEDIUM	8	8	256	1,024	0.962	0.852
MTL-BERT_LARGE	12	12	576	2,304	0.974	0.848

The medium model was selected for its best fine-tuning performance with lower computational cost, despite the large model achieving higher pretraining recovery accuracy. The slight performance drop for the large model suggests mild overfitting.

Multitask Fine-tuning with Task Tokens

During fine-tuning, task tokens ([T0], [T1], …) are prepended to each input SMILES string. The Transformer output at each task token position is passed through a task-specific two-layer feedforward network for the corresponding prediction task. An attention mask prevents direct information exchange between task tokens, allowing each task to learn directly from SMILES tokens without interference. This design also reduces the discrepancy between pretraining (no task tokens visible) and fine-tuning.

Cross-entropy loss is used for classification tasks and mean squared error for regression tasks. The total multitask loss is a simple sum of per-task losses without learned weighting.

SMILES Enumeration as Data Augmentation

A molecule can be represented by multiple valid SMILES strings by varying starting atoms and traversal orders. MTL-BERT applies SMILES enumeration at all three stages:

Pretraining: Enumerated SMILES increase diversity of the self-supervised training data.
Fine-tuning: Each dataset is augmented 20x with random SMILES variants, increasing data diversity and helping the model learn position-invariant features.
Inference: Multiple SMILES are generated per test molecule, predictions are fused (averaged) for a more robust final prediction.

The 20x augmentation factor was chosen based on prior work showing diminishing returns beyond this level while significantly increasing computational cost.

Experimental Evaluation Across 60 Datasets

Setup

MTL-BERT was evaluated on 60 datasets (44 classification, 16 regression) covering ADMET properties and common molecular benchmarks. Datasets were sourced from ADMETlab and MoleculeNet. Each dataset was split 8:1:1 (train/validation/test), and experiments were repeated 10 times with random splits, reporting mean and standard deviation.

Classification tasks were evaluated with ROC-AUC and accuracy; regression tasks with $R^2$ and RMSE.

Baselines

Five baselines were compared:

ECFP4-XGBoost: Extended-connectivity fingerprints (diameter 4) with gradient boosting
Graph Attention Network (GAT)
Graph Convolutional Network (GCN)
AttentiveFP: A GNN with attention for molecular property prediction
CDDD: Continuous and data-driven descriptors from a pretrained RNN auto-encoder

Ablation Study

Three model variants were compared to isolate contributions:

MTL-BERT: Full model (pretraining + multitask + SMILES enumeration)
STL-BERT: Single-task fine-tuning with SMILES enumeration (no multitask)
Cano-BERT: Canonical SMILES only, single-task fine-tuning (equivalent to SMILES-BERT)

Cano-BERT showed more than 10% degradation on several datasets (CL, Fu, LC50DM) compared to STL-BERT, demonstrating the importance of SMILES enumeration. MTL-BERT outperformed STL-BERT on most datasets, with improvements exceeding 5% on $F_{20\%}$, SR-ARE, and SR-ATAD5, confirming that multitask learning provides additional benefit on top of enumeration.

Results vs. Baselines

MTL-BERT outperformed all baselines on nearly all 60 datasets. Specific findings:

ECFP4-XGBoost performed inconsistently, doing well on some tasks (e.g., $F_{30\%}$, BACE, CL) but poorly on others, reflecting the limitation of fixed-length fingerprint representations.
GNNs generally improved over fingerprints but still suffered from data scarcity, falling behind ECFP4-XGBoost by more than 3% on $F_{30\%}$, Carcinogenicity, CL, and VD.
MTL-BERT surpassed all baselines except on CYP2C19-sub and BACE (by less than 1.1%).
On 14 tasks (NR-ER, NR-PPAR-gamma, SR-ARE, SR-ATAD5, SR-HSE, SR-MMP, Bioconcentration Factor, Fu, LC50FM, Lipophilicity, CL, PPB, VD, LC50DM), MTL-BERT exceeded the best baseline by more than 5-10%.
Improvements were statistically significant at the 95% confidence level (paired t-test, $P \leq 0.001$).

Representation Analysis

t-SNE visualization of pretrained token embeddings (from 1,000 randomly selected molecules, approximately 35,000 tokens) showed that:

Tokens of the same type cluster together (capturing atomic type information).
Within type clusters, sub-groups correspond to different chemical environments (e.g., oxygen atoms in nitrate groups vs. carbonyl groups).
Nearby embeddings share similar molecular neighborhood environments.

Attention-based Interpretability

The model’s attention weights provide interpretability for predictions:

For a solubility task (LogS/LogD), attention concentrated on polar groups, which are known determinants of aqueous solubility.
For AMES (mutagenicity), attention focused on azide, nitrosamide, acylchloride, and nitrite groups, which are known mutagenic structural alerts.

Performance Gains from Combined Strategies with Interpretable Attention

MTL-BERT demonstrates that the combination of pretraining, multitask learning, and SMILES enumeration is more effective than any individual strategy for molecular property prediction. The ablation study provides clear evidence for the additive benefit of each component.

Key strengths include the breadth of evaluation (60 datasets covering diverse ADMET endpoints), the consistent improvement over multiple baseline types (fingerprints, GNNs, pretrained representations), and the interpretable attention mechanism that highlights chemically meaningful substructures.

Limitations to note: the simple sum of multitask losses (no learned task weighting) may not be optimal when tasks have very different scales or when some tasks are unrelated. The authors observe slight degradation on a few datasets (AMES, CYP1A2-Sub, FreeSolv), suggesting negative transfer in those cases. The 20x SMILES enumeration significantly increases computational cost during fine-tuning and inference. The paper does not report wall-clock training times or GPU hours, making it difficult to assess the practical cost of the enumeration strategy. Hardware details are not specified beyond acknowledgment of the High-Performance Computing Center at Central South University.

The hierarchical clustering of task representations reveals meaningful task groupings (e.g., LogD and LogP cluster together due to their shared relationship with water solubility), supporting the premise that multitask learning captures cross-task correlations.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pretraining	ChEMBL	1.7M molecules	Unlabeled SMILES; 10% held out for evaluation
Fine-tuning/Evaluation	ADMETlab + MoleculeNet	60 datasets (44 classification, 16 regression)	8:1:1 train/val/test split

Algorithms

Pretraining: Masked token prediction (15% masking rate: 80% [MASK], 10% random, 10% unchanged). Adam optimizer, learning rate 1e-4, batch size 512, 50 epochs.
Fine-tuning: Adam optimizer, learning rate 5e-5, batch size 64, dropout 0.1. Cross-entropy for classification, MSE for regression. Early stopping with patience 20, max 200 epochs.
SMILES enumeration: 20x augmentation. Repeated search up to 100 times if enumerated SMILES is identical to a previous one.
Inference fusion: Predictions from multiple enumerated SMILES are averaged.

Models

MTL-BERT_MEDIUM (selected model): 8 layers, 8 attention heads, 256 embedding size, 1,024 FFN size
Pretraining recovery accuracy: 0.962
1,000 task tokens pre-allocated for future tasks

Evaluation

Metric	Task Type	Notes
ROC-AUC	Classification	Primary metric
Accuracy	Classification	Secondary metric
$R^2$	Regression	Primary metric
RMSE	Regression	Secondary metric

All experiments repeated 10 times with random splits; mean and standard deviation reported.

Hardware

Hardware specifications are not reported in the paper. The authors acknowledge the High-Performance Computing Center of Central South University.

Artifacts

Artifact	Type	License	Notes
MTL-BERT	Code	Not specified	Official implementation
ChEMBL	Dataset	CC BY-SA 3.0	Pretraining data source
MoleculeNet	Dataset	MIT	Fine-tuning benchmark
ADMETlab	Dataset	Free for academic use	ADMET property datasets

Paper Information

Citation: Zhang, X.-C., Wu, C.-K., Yi, J.-C., Zeng, X.-X., Yang, C.-Q., Lu, A.-P., Hou, T.-J., & Cao, D.-S. (2022). Pushing the boundaries of molecular property prediction for drug discovery with multitask learning BERT enhanced by SMILES enumeration. Research, 2022, Article 0004. https://doi.org/10.34133/research.0004

@article{zhang2022mtlbert,
  title={Pushing the Boundaries of Molecular Property Prediction for Drug Discovery with Multitask Learning BERT Enhanced by SMILES Enumeration},
  author={Zhang, Xiao-Chen and Wu, Cheng-Kun and Yi, Jia-Cai and Zeng, Xiang-Xiang and Yang, Can-Qun and Lu, Ai-Ping and Hou, Ting-Jun and Cao, Dong-Sheng},
  journal={Research},
  volume={2022},
  pages={Article 0004},
  year={2022},
  doi={10.34133/research.0004},
  publisher={American Association for the Advancement of Science (AAAS)}
}

Maxsmi: SMILES Augmentation for Property Prediction

Fri, 27 Mar 2026 00:00:00 +0000

Systematic Benchmarking of SMILES Data Augmentation

This is an Empirical paper that systematically evaluates how SMILES augmentation affects deep learning molecular property prediction. The primary contribution is a comprehensive comparison of five augmentation strategies across three neural network architectures and four datasets, producing the “Maxsmi” models that maximize prediction performance. The study also demonstrates that test-time augmentation provides a practical confidence measure for predictions.

The Data Scarcity Problem in QSAR Modeling

Deep learning models require large training sets to perform well, but experimental physico-chemical and bioactivity datasets remain small, typically ranging from hundreds to a few thousand compounds. SMILES augmentation, where the non-unique SMILES representation of a molecule is exploited to generate multiple training examples per compound, has been shown to help in prior work by Bjerrum (2017), Kimber et al. (2018), and Li and Fourches (2020). However, no prior study had systematically compared different augmentation strategies, analyzed how much augmentation is needed, or examined the relationship between augmentation factor and prediction confidence. Most previous work chose augmentation numbers a priori without justification. Maxsmi fills this gap by providing a systematic analysis and practical guidelines.

Five Augmentation Strategies and Test-Time Ensemble Learning

The core insight is twofold. First, the authors define five distinct strategies for generating augmented SMILES:

No augmentation: use only the canonical SMILES (baseline)
Augmentation with duplication: generate $m$ random SMILES per compound, allowing duplicates; dataset grows to $N \times m$
Augmentation without duplication: generate $m$ random SMILES and discard exact duplicates
Augmentation with reduced duplication: keep only $f(m) = \sqrt{m}$ copies of each duplicate, a compromise between the above
Augmentation with estimated maximum: sample random SMILES until the same string has been generated 10 times, attempting to cover most of the valid SMILES space

Second, the authors formalize test-time augmentation as ensemble learning. Given a trained model $M_{\Theta}$, each test compound $C$ is represented by $k$ random SMILES $S_1(C), \ldots, S_k(C)$. The per-SMILES predictions are:

$$ \hat{y}_i(C) = M_{\Theta}(S_i(C)) $$

The compound-level prediction is an aggregation (mean) over these:

$$ \hat{y}(C) = A\big(\hat{y}_1(C), \ldots, \hat{y}_k(C)\big) $$

The standard deviation of the per-SMILES predictions serves as a confidence measure: high variance indicates the model is uncertain about a compound.

Experimental Design: Three Architectures, Four Datasets

Datasets

Dataset	Size (after preprocessing)	Train / Test	Task	Provenance
ESOL	1,128	902 / 226	Water solubility	MoleculeNet
ESOL_small	1,068	854 / 214	Solubility (max 25 heavy atoms)	MoleculeNet
FreeSolv	642	513 / 129	Hydration free energy	MoleculeNet
Lipophilicity	4,199	3,359 / 840	Octanol/water distribution	ChEMBL
Affinity (EGFR)	5,849	4,679 / 1,170	pIC50 against EGFR kinase	Kinodata

Architectures

Three shallow neural networks are compared:

CONV1D: 1D convolution (kernel size 10, stride 1) followed by two fully connected layers
CONV2D: 2D convolution on the one-hot encoded SMILES matrix, followed by two fully connected layers
RNN: LSTM layer followed by two fully connected layers (128 and 64 units)

All models are trained for 250 epochs with batch size 16, MSE loss, SGD optimizer, and learning rate 0.001. A Random Forest baseline with Morgan fingerprints (radius 2, length 1024) is also included.

Augmentation sweep

The augmentation number $m$ is varied from 1 to 20 (step 1) and from 20 to 100 (step 10) for three strategies (with, without, and reduced duplication). The estimated maximum strategy is tested on the smaller datasets. Both training and test sets receive the same augmentation.

Key Findings: Augmentation Consistently Improves RMSE

Augmentation always helps

Across all datasets and architectures, SMILES augmentation reduces test RMSE compared to the no-augmentation baseline. Performance improves sharply in the low augmentation range (1 to 10) and reaches a plateau around 40 to 70, after which additional augmentation provides diminishing returns.

Best models (Maxsmi)

Dataset	Model	Augmentation Number	Strategy	Test RMSE
ESOL	CONV1D	70	Reduced duplication	0.569
FreeSolv	CONV1D	70	With duplication	1.032
Lipophilicity	CONV1D	80	Without duplication	0.593

The CONV1D architecture consistently outperforms RNN and CONV2D. For ESOL, the CONV1D model improves from 0.839 RMSE (no augmentation) to 0.569 RMSE (70x reduced duplication), a 32% reduction.

No single best augmentation strategy

The three main augmentation strategies (with, without, and reduced duplication) perform similarly. Generating the estimated maximum number of unique SMILES does not yield the best results, suggesting a saturation point exists where additional SMILES diversity stops helping.

Canonical SMILES outperform single random SMILES

When augmentation is limited to a single representation ($m = 1$), the canonical SMILES consistently outperforms a single random SMILES. On ESOL with CONV1D, the canonical model achieves 0.839 RMSE versus 0.964 for a random SMILES. The authors attribute this to the simpler, more readable structure of canonical SMILES (fewer branches and brackets).

Comparison to prior work

Study	ESOL	FreeSolv	Lipophilicity	Model
Maxsmi	0.569	1.032	0.593	CNN
MoleculeNet	0.58 +/- 0.03	1.15 +/- 0.12	0.655 +/- 0.036	GNN
CNF	0.62	1.11	0.67	CNN
MolPMoFiT	N/A	1.197 +/- 0.127	0.565 +/- 0.037	RNN

Maxsmi outperforms or matches MoleculeNet’s graph neural networks and the CNF model on all three tasks. MolPMoFiT slightly outperforms Maxsmi on lipophilicity (0.565 vs 0.593) but performs worse on FreeSolv.

Confidence estimation

The standard deviation of per-SMILES predictions correlates with prediction error. Confidence curves show that sequentially removing compounds with the highest uncertainty leads to monotonically decreasing mean prediction error. For ESOL, keeping only the top 10% most confident predictions yields errors below 0.25.

EGFR affinity test case

Applying the Maxsmi approach (CONV1D, 70x augmentation, reduced duplication) to EGFR kinase affinity prediction yields test RMSE of 0.777 and R2 of 0.712, compared to 1.031 RMSE and 0.494 R2 for the canonical model (a 25% RMSE improvement). The Random Forest baseline (0.758 RMSE, 0.726 R2) performs comparably, which the authors note without further explanation.

Limitations

All experiments use a single train/test split (80/20) without cross-validation, due to the computational cost of the full augmentation sweep. This means reported RMSE values lack uncertainty estimates for the Maxsmi models.
The study uses shallow networks only. Whether the same augmentation benefits apply to deeper architectures or pre-trained models is untested.
The EGFR test case shows the Random Forest baseline performing comparably to the Maxsmi model, raising questions about when SMILES augmentation provides a meaningful advantage over traditional fingerprint-based methods.
The comparison to prior work uses different splits, preprocessing, and evaluation protocols across studies, which the authors acknowledge limits direct comparability.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Training/Evaluation	ESOL	1,128	MoleculeNet, water solubility
Training/Evaluation	FreeSolv	642	MoleculeNet, hydration free energy
Training/Evaluation	Lipophilicity	4,199	ChEMBL, logD
Test case	EGFR Affinity	5,849	Kinodata (ChEMBL v28), pIC50

All datasets are publicly available through MoleculeNet/DeepChem and Kinodata.

Algorithms

SMILES generation via RDKit’s random SMILES enumeration
One-hot encoding of SMILES characters with padding to max length
Five augmentation strategies applied to both training and test sets
Mean aggregation for compound-level predictions

Models

Model	Architecture	Parameters
CONV1D	1D conv (kernel 10, stride 1) + 2 FC layers	Not specified
CONV2D	2D conv (single channel) + 2 FC layers	Not specified
RNN	LSTM + FC(128) + FC(64)	Not specified
RF Baseline	Random Forest (default sklearn)	Morgan FP, radius 2, length 1024

Training: 250 epochs, batch size 16, MSE loss, SGD, lr=0.001.

Evaluation

Metric	Best Value	Baseline	Notes
RMSE (ESOL)	0.569	1.102 (RF)	CONV1D, 70x reduced dup
RMSE (FreeSolv)	1.032	2.563 (RF)	CONV1D, 70x with dup
RMSE (Lipophilicity)	0.593	0.860 (RF)	CONV1D, 80x without dup
RMSE (EGFR)	0.777	0.758 (RF)	CONV1D, 70x reduced dup

Hardware

Training was performed on a GeForce GTX 1080 Ti, provided by the HPC cluster at Freie Universitat Berlin. Training CONV1D on ESOL with 100x augmentation (keeping duplicates, 90,200 data points) takes approximately 3 hours. Training with 19x augmentation achieves RMSE of 0.605 in under 30 minutes.

Artifacts

Artifact	Type	License	Notes
volkamerlab/maxsmi	Code	MIT	Full source code, trained models, CLI for prediction
Documentation	Docs	N/A	Read the Docs documentation
Kinodata	Dataset	N/A	Curated kinase bioactivity data from ChEMBL v28

Reproducibility status: Highly Reproducible. Code, data, trained models, and a command-line prediction tool are all publicly available under the MIT license.

Paper Information

Citation: Kimber, T. B., Gagnebin, M., & Volkamer, A. (2021). Maxsmi: Maximizing molecular property prediction performance with confidence estimation using SMILES augmentation and deep learning. Artificial Intelligence in the Life Sciences, 1, 100014. https://doi.org/10.1016/j.ailsci.2021.100014

@article{kimber2021maxsmi,
  title={Maxsmi: Maximizing molecular property prediction performance with confidence estimation using SMILES augmentation and deep learning},
  author={Kimber, Talia B. and Gagnebin, Maxime and Volkamer, Andrea},
  journal={Artificial Intelligence in the Life Sciences},
  volume={1},
  pages={100014},
  year={2021},
  publisher={Elsevier},
  doi={10.1016/j.ailsci.2021.100014}
}

Transformer-CNN: SMILES Embeddings for QSAR Modeling

Thu, 26 Mar 2026 00:00:00 +0000

Transformer-Based SMILES Embeddings for Property Prediction

This is a Method paper that introduces Transformer-CNN, a two-stage architecture for QSAR (Quantitative Structure-Activity Relationship) modeling. The primary contribution is a transfer learning approach: a Transformer model is first trained on the task of SMILES canonicalization (mapping non-canonical SMILES to canonical forms), and the encoder’s internal representations are then used as “dynamic SMILES embeddings” for downstream property prediction via a convolutional neural network (TextCNN). The authors also contribute an interpretability framework based on Layer-wise Relevance Propagation (LRP) that traces predictions back to individual atom contributions.

From Descriptors to Learned Embeddings in QSAR

Traditional QSAR methods rely on hand-engineered molecular descriptors (fragment counts, physicochemical features) coupled with feature selection and classical ML algorithms. While deep learning approaches that operate on raw SMILES strings or molecular graphs have reduced the need for manual feature engineering, they typically require large training datasets to learn effective representations from scratch. QSAR datasets, in contrast, often contain only hundreds of molecules, making it difficult to train end-to-end deep models.

The authors identify two specific gaps. First, existing SMILES-based autoencoders such as CDDD (Continuous and Data-Driven molecular Descriptors) produce fixed-length latent vectors, discarding positional information that could be useful for property prediction and interpretation. Second, QSAR models built on deep architectures generally lack interpretability, making it hard to verify that predictions rely on chemically meaningful structural features rather than spurious correlations.

Dynamic SMILES Embeddings via Canonicalization Pre-training

The core insight is that training a Transformer to perform SMILES canonicalization (a Seq2Seq task mapping non-canonical SMILES to canonical SMILES) produces an encoder whose internal states serve as information-rich, position-dependent molecular embeddings.

Pre-training on SMILES Canonicalization

The Transformer encoder-decoder is trained on approximately 17.7 million canonicalization pairs derived from the ChEMBL database (SMILES with length up to 110 characters). Each molecule is augmented 10 times by generating non-canonical SMILES variants, plus one identity pair where both sides are canonical. The training uses character-level tokenization with a 66-symbol vocabulary covering drug-like molecules including stereochemistry, charges, and inorganic ions.

The Transformer architecture follows Vaswani et al. with 3 layers and 10 self-attention heads. The learning rate schedule follows:

$$\lambda = \text{factor} \cdot \min(1.0,; \text{step} / \text{warmup}) / \max(\text{step},; \text{warmup})$$

where factor = 20, warmup = 16,000 steps, and $\lambda$ is clipped at a minimum of $10^{-4}$. Training runs for 10 epochs (275,907 batches per epoch) without early stopping.

On validation with 500,000 generated ChEMBL-like SMILES, the model correctly canonicalizes 83.6% of all samples. Performance drops for stereochemistry (37.2% for @-containing SMILES) and cis/trans notation (73.9%).

From Encoder States to QSAR Predictions

After pre-training, the encoder’s output for a molecule with $N$ characters is a matrix of dimensions $(N, \text{EMBEDDINGS})$. Unlike fixed-length CDDD descriptors, these “dynamic embeddings” preserve positional information, meaning equivalent characters receive different embedding values depending on their context and position.

To handle variable-length embeddings, the authors use a TextCNN architecture (from DeepChem) with 1D convolutional filters at kernel sizes (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20) producing (100, 200, 200, 200, 200, 100, 100, 100, 100, 100, 160, 160) filters respectively. After GlobalMaxPool and concatenation, the features pass through Dropout (rate = 0.25), a Dense layer ($N = 512$), a Highway layer, and finally an output layer (1 neuron for regression, 2 for classification).

The Transformer weights are frozen during QSAR training. The Adam optimizer is used with a fixed learning rate of $10^{-4}$ and early stopping on a 10% held-out validation set. Critically, SMILES augmentation ($n = 10$) is applied during both training and inference, with the final prediction being the average over augmented SMILES for each molecule.

Interpretability via Layer-wise Relevance Propagation

The LRP algorithm propagates relevance scores from the output back through the CNN layers to the Transformer encoder output (which is position-wise). The relevance conservation property holds:

$$y = R = f(x) = \sum_{l \in (L)} R_{l} = \sum_{l \in (L-1)} R_{l} = \cdots = \sum_{l \in (1)} R_{l}$$

In practice, biases absorb some relevance, so the total propagated to the input is less than the output:

$$\sum_{l \in (L)} R_{l} = \sum_{l \in (L-1)} R_{l} + B$$

For gated connections in the Highway block, the authors implement the signal-take-all redistribution rule. The interpretation algorithm generates one SMILES per non-hydrogen atom (each drawn starting from that atom), runs LRP on each, and averages contributions. If more than 50% of relevance dissipates on biases, the interpretation may be unreliable, serving as an applicability domain indicator.

Benchmarks Across 18 Regression and Classification Datasets

The authors evaluate on the same 18 datasets (9 regression, 9 classification) used in their previous SMILES augmentation study, enabling direct comparison. All experiments use five-fold cross-validation.

Regression Results ($r^2$)

Dataset	Descriptor-based	SMILES-based (augm=10)	Transformer-CNN (no augm)	Transformer-CNN (augm=10)	CDDD
MP (19,104)	0.83	0.85	0.83	0.86	0.85
BP (11,893)	0.98	0.98	0.97	0.98	0.98
BCF (378)	0.85	0.85	0.71	0.85	0.81
FreeSolv (642)	0.94	0.93	0.72	0.91	0.93
LogS (1,311)	0.92	0.92	0.85	0.91	0.91
Lipo (4,200)	0.70	0.72	0.60	0.73	0.74
BACE (1,513)	0.73	0.72	0.66	0.76	0.75
DHFR (739)	0.62	0.63	0.46	0.67	0.61
LEL (483)	0.19	0.25	0.20	0.27	0.23

Classification Results (AUC)

Dataset	Descriptor-based	SMILES-based (augm=10)	Transformer-CNN (no augm)	Transformer-CNN (augm=10)	CDDD
HIV (41,127)	0.82	0.78	0.81	0.83	0.74
AMES (6,542)	0.86	0.88	0.86	0.89	0.86
BACE (1,513)	0.88	0.89	0.89	0.91	0.90
ClinTox (1,478)	0.77	0.76	0.71	0.77	0.73
Tox21 (7,831)	0.79	0.83	0.81	0.82	0.82
BBBP (2,039)	0.90	0.91	0.90	0.92	0.89
JAK3 (886)	0.79	0.80	0.70	0.78	0.76
BioDeg (1,737)	0.92	0.93	0.91	0.93	0.92
RP AR (930)	0.85	0.87	0.83	0.87	0.86

Key Comparisons

Baselines include descriptor-based methods (the best from LibSVM, Random Forest, XGBoost, ASNN, and DNNs), direct SMILES-based models with augmentation, and CDDD descriptors analyzed by the same classical ML methods. CDDD descriptors come from the Sml2canSml autoencoder approach, which produces fixed 512-dimensional vectors.

Transformer-CNN with augmentation matches or exceeds all baselines on 14 of 18 datasets. The effect of augmentation is dramatic: without it, Transformer-CNN underperforms substantially (e.g., BCF drops from 0.85 to 0.71, JAK3 from 0.78 to 0.70). This confirms that the internal consensus from multiple SMILES representations is essential to the method’s effectiveness.

A practical advantage over CDDD is that Transformer-CNN imposes no constraints on molecular properties (CDDD requires logP in (-5, 7), molecular weight under 12,600, 3-50 heavy atoms, and organic molecules only), since the Transformer was trained on the full diversity of ChEMBL.

Interpretability Case Studies

For AMES mutagenicity, the LRP analysis of 1-Bromo-4-nitrobenzene correctly identifies the nitro group and halogen as structural alerts, consistent with known mutagenicity rules. For aqueous solubility of haloperidol, the model assigns positive contributions to hydroxyl, carbonyl, and aliphatic nitrogen groups (which increase solubility) and negative contributions to aromatic carbons (which decrease it). Both cases align with established chemical knowledge, supporting the trustworthiness of the model.

Effective Transfer Learning for Small QSAR Datasets

Transformer-CNN achieves competitive or superior QSAR performance across 18 diverse benchmarks by combining three ingredients: (1) Transformer-based pre-training via SMILES canonicalization, (2) SMILES augmentation during training and inference, and (3) a lightweight CNN head. The method requires minimal hyperparameter tuning, as the Transformer weights are frozen and the CNN architecture is fixed.

The authors acknowledge several limitations and future directions:

Stereochemistry canonicalization accuracy is low (37.2%), which could impact models for stereo-sensitive properties
The LRP interpretability depends on sufficient relevance propagation (at least 50% reaching the input layer)
The variance among augmented SMILES predictions could serve as a confidence estimate, but this is left to future work
Applicability domain assessment based on SMILES reconstruction quality is proposed but not fully developed

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pre-training	ChEMBL (SMILES <= 110 chars)	17.7M pairs	10x augmentation + 1 identity pair per molecule
Validation (canon.)	Generated ChEMBL-like SMILES	500,000	From a molecular generator
QSAR benchmarks	9 regression + 9 classification	378-41,127	Available on OCHEM (https://ochem.eu)

Algorithms

Transformer: 3 layers, 10 self-attention heads, character-level tokenization (66 symbols)
TextCNN: 12 kernel sizes (1-10, 15, 20) with 100-200 filters each, GlobalMaxPool, Dense(512), Highway, Dropout(0.25)
Augmentation: n=10 non-canonical SMILES per molecule during training and inference
LRP: signal-take-all redistribution for Highway gates, standard LRP for Dense and Conv layers

Models

Transformer encoder weights pre-trained on canonicalization task (frozen during QSAR training)
QSAR CNN trained with Adam optimizer, learning rate $10^{-4}$, early stopping
Pre-trained embeddings and standalone prediction models available in the GitHub repository

Evaluation

Regression: coefficient of determination $r^2 = 1 - SS_{\text{res}} / SS_{\text{tot}}$
Classification: Area Under the ROC Curve (AUC)
Five-fold cross-validation with bootstrap standard errors

Hardware

NVIDIA Quadro P6000, Titan Xp, and Titan V GPUs (donated by NVIDIA)
TensorFlow v1.12.0, RDKit v2018.09.2

Artifacts

Artifact	Type	License	Notes
transformer-cnn	Code	MIT	Source code, pre-trained embeddings, standalone prediction models
OCHEM	Other	N/A	Online platform hosting the method, training datasets, and models

Paper Information

Citation: Karpov, P., Godin, G., & Tetko, I. V. (2020). Transformer-CNN: Swiss knife for QSAR modeling and interpretation. Journal of Cheminformatics, 12, 17. https://doi.org/10.1186/s13321-020-00423-w

@article{karpov2020transformer,
  title={Transformer-{CNN}: Swiss knife for {QSAR} modeling and interpretation},
  author={Karpov, Pavel and Godin, Guillaume and Tetko, Igor V.},
  journal={Journal of Cheminformatics},
  volume={12},
  number={1},
  pages={17},
  year={2020},
  publisher={Springer},
  doi={10.1186/s13321-020-00423-w}
}

SMILES2Vec: Interpretable Chemical Property Prediction

Thu, 26 Mar 2026 00:00:00 +0000

A General-Purpose RNN for Chemical Property Prediction from SMILES

SMILES2Vec is a Method paper that introduces a deep recurrent neural network architecture for predicting chemical properties directly from SMILES text representations. The primary contributions are: (1) a Bayesian-optimized CNN-GRU architecture that serves as a general-purpose predictor for diverse chemical properties (toxicity, activity, solubility, solvation energy), (2) an explanation mask mechanism that provides interpretable predictions by identifying which SMILES characters drive the network’s decisions, and (3) evidence that representation learning from raw SMILES can match or outperform models using hand-crafted molecular descriptors.

Motivation: Beyond Engineered Features in Chemical Modeling

At the time of writing (2017), deep learning models in chemistry relied heavily on engineered molecular descriptors and fingerprints as input features. Over 5,000 molecular descriptors had been developed since the late 1940s, and QSAR/QSPR modeling remained the dominant paradigm. The authors identified two key limitations with this approach:

Restricted search space: Engineered features limit the neural network’s ability to discover potentially useful representations that domain experts have not anticipated.
Incomplete domain knowledge: For complex properties where first-principles understanding is incomplete, the lack of appropriate descriptors constrains model performance.

In contrast, computer vision and NLP had shown that deep learning models trained on raw data (unaltered images, raw text) could learn powerful representations without feature engineering. The chemical SMILES notation, a text-based encoding of molecular structure that serves as the standard interchange format in cheminformatics, provided a natural analog to text data for NLP-style modeling.

A secondary motivation was interpretability. Most ML and DL models for chemistry operated as black boxes, which posed particular problems for regulated applications like FDA drug approval where mechanistic explanations are required.

Core Innovation: CNN-GRU Architecture with Explanation Masks

Architecture Design via Bayesian Optimization

SMILES2Vec treats SMILES strings as character-level text input. The network processes one-hot encoded characters (padded to length 250, covering 99.9% of the ChEMBL database) through three stages:

Embedding layer: Maps one-hot character vectors to a learned embedding space (size 50)
1D convolutional layer: 192 filters with kernel size 3, stride 1
Bidirectional GRU layers: Two layers with 224 and 384 units respectively

The authors explored four architectural classes (GRU, LSTM, CNN-GRU, CNN-LSTM) using Bayesian optimization via SigOpt. Each class was evaluated over 60 trials, optimizing embedding size, convolutional filter count, and RNN layer widths. The CNN-GRU class was selected as the best compromise: CNN-LSTM performed best on classification (Tox21), while GRU-based networks excelled at regression (FreeSolv). The final architecture is summarized by the hyperparameters:

Component	Parameter	Value
Embedding	Size	50
Conv1D	Filters	192
BiGRU Layer 1	Units	224
BiGRU Layer 2	Units	384

Explanation Mask for Interpretability

The explanation mask is a post-hoc interpretability mechanism. Given a trained (frozen) SMILES2Vec base model, a separate explanation network learns to produce a per-character mask over the input SMILES string. The mask is trained to preserve the base model’s output while masking as much input as possible. The loss function for a single sample is:

$$ \text{Loss}_i = | f(\text{SMILES}_i, \theta) - \text{Sol}(\text{SMILES}_i) |_2 + 10^{-6} | \text{MASK}_i |_2 + 0.05 , H(\text{MASK}_i) $$

where $f(\text{SMILES}_i, \theta)$ is the base network prediction, $\text{Sol}(\text{SMILES}_i)$ is the ground truth solubility, $H$ is the entropy of the normalized mask, and $\text{MASK}_i$ is the per-character mask vector. The L2 term encourages sparsity and the entropy term penalizes uniform attention distributions.

The explanation network itself is a 20-layer residual network with SELU activations, ending in a 1D convolution of length 1, batch normalization, and a softplus activation. The softplus output ranges from 0 (fully masked) to infinity (amplified attention), allowing the mask to both suppress and emphasize specific SMILES characters.

Experimental Setup and Baseline Comparisons

Datasets

The model was evaluated on four datasets from the MoleculeNet benchmark and the ESOL solubility dataset:

Dataset	Property	Task	Size
Tox21	Toxicity	Multi-task classification	8,014
HIV	Activity	Single-task classification	41,193
FreeSolv	Solvation energy	Single-task regression	643
ESOL	Solubility	Single-task regression	1,128

SMILES strings longer than 250 characters were excluded. Classification datasets (Tox21, HIV) used 1/6 test split with minority class oversampling; regression datasets (FreeSolv, ESOL) used 1/10 test split. All experiments used 5-fold cross-validation.

Training Protocol

Optimizer: RMSprop with learning rate $10^{-3}$, $\rho = 0.9$, $\epsilon = 10^{-8}$
Batch size: 32
Epochs: 250 with early stopping (patience of 25 epochs based on validation loss)
Classification loss: Binary cross-entropy
Regression loss: Mean absolute error
Metrics: AUC for classification, RMSE for regression

Baselines

SMILES2Vec was compared against:

MLP with engineered features: Standard multi-layer perceptron using molecular fingerprints (from MoleculeNet)
Molecular graph convolutions: Graph-based neural network from MoleculeNet
Chemception: CNN operating on 2D chemical images

Bayesian Optimization Protocol

Only two datasets were used for architecture optimization: the nr-ahr toxicity task from Tox21 (classification) and FreeSolv (regression). The remaining datasets (full Tox21, HIV, ESOL) served purely for generalization evaluation. A fixed test set was held out during optimization, and correlation between validation and test metrics (0.54 for Tox21, 0.78 for FreeSolv) confirmed limited overfitting to the validation set.

Results: Competitive Accuracy with Interpretable Predictions

Property Prediction Performance

SMILES2Vec achieved the following validation metrics (with a pre-training approach from ChemNet improving performance slightly):

Dataset	Metric	SMILES2Vec	SMILES2Vec + Pre-training	Graph Conv
Tox21	AUC	0.80	0.81	0.81
HIV	AUC	0.78	0.80	0.80
FreeSolv	RMSE (kcal/mol)	1.4	1.2	1.3
ESOL	RMSE	0.63	-	-

Exact numbers for MLP and Chemception baselines were reported only in a bar chart (Figure 6) and not as precise values. The paper states that MLP with fingerprints performed worst across all tasks, and Chemception fell between MLP and the graph/SMILES methods.

Key findings:

SMILES2Vec outperformed MLP models using engineered features across all tasks, despite using no feature engineering.
Against graph convolutions (the state-of-the-art at the time), SMILES2Vec matched on classification (Tox21: 0.81 vs 0.81, HIV: 0.80 vs 0.80) and outperformed on regression (FreeSolv: 1.2 vs 1.3).
SMILES2Vec outperformed Chemception (2D image CNN) on classification tasks but slightly underperformed on regression, which the authors attributed to SMILES lacking explicit atomic number information.

Interpretability Evaluation

On the ESOL solubility dataset, the explanation mask was evaluated against first-principles chemical knowledge. The authors separated compounds into soluble (> 1.0) and insoluble (< -5.0) categories and defined ground truth: soluble compounds should attend to hydrophilic atoms (O, N) while insoluble compounds should attend to hydrophobic atoms (C, F, Cl, Br, I). The top-3 character accuracy was 88%, confirming that SMILES2Vec learned representations consistent with known functional group chemistry.

Qualitative analysis of the masks showed that for low-solubility molecules, characters corresponding to hydrophobic groups (c, C, Cl) received high attention, while high-solubility molecules showed attention focused on hydrophilic groups (O, N).

Limitations

The interpretability evaluation was limited to solubility, a well-understood property with simple first-principles rules. The authors acknowledged that quantifying interpretability for complex properties (toxicity, activity) where no simple ground truth exists is nontrivial.
The Bayesian optimization used only a subset of datasets, so the architecture may not be globally optimal across all chemical tasks.
SMILES strings lack explicit atomic number information, which may limit performance on physical property prediction compared to image or graph representations.
The explanation mask approach requires training a separate 20-layer network per property, adding computational overhead.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Architecture optimization	Tox21 (nr-ahr task)	8,014	Single toxicity task for Bayesian optimization
Architecture optimization	FreeSolv	643	Solvation free energy regression
Evaluation	Tox21 (full, 12 tasks)	8,014	Multi-task classification
Evaluation	HIV	41,193	Single-task classification
Evaluation	ESOL	1,128	Solubility regression, also used for interpretability

All datasets are publicly available through MoleculeNet. The ESOL dataset is from Delaney (2004).

Algorithms

Bayesian optimization via SigOpt (60 trials per architectural class, 4 classes, 6 manually seeded initial designs per class)
RMSprop optimizer with standard settings
Explanation mask trained with Adam, learning rate annealed from $10^{-2}$ to $10^{-6}$

Models

Final architecture: Embedding(50) -> Conv1D(192, kernel=3, stride=1) -> BiGRU(224) -> BiGRU(384)
Explanation network: 20-layer residual network with SELU activations
No pre-trained weights or code were released

Evaluation

Metric	Dataset	Value	Notes
AUC	Tox21	0.81	With pre-training
AUC	HIV	0.80	With pre-training
RMSE	FreeSolv	1.2 kcal/mol	With pre-training
RMSE	ESOL	0.63	Base model
Top-3 accuracy	ESOL interpretability	88%	Explanation mask

Hardware

The authors report using TensorFlow with GPU acceleration via NVIDIA cuDNN libraries. Specific GPU models and training times were not reported.

Artifacts

No code, models, or data artifacts were released by the authors. The datasets used are publicly available through MoleculeNet.

Paper Information

Citation: Goh, G. B., Hodas, N. O., Siegel, C., & Vishnu, A. (2017). SMILES2Vec: An Interpretable General-Purpose Deep Neural Network for Predicting Chemical Properties. arXiv preprint arXiv:1712.02034.

@article{goh2017smiles2vec,
  title={SMILES2Vec: An Interpretable General-Purpose Deep Neural Network for Predicting Chemical Properties},
  author={Goh, Garrett B. and Hodas, Nathan O. and Siegel, Charles and Vishnu, Abhinav},
  journal={arXiv preprint arXiv:1712.02034},
  year={2017},
  doi={10.48550/arxiv.1712.02034}
}

MolPMoFiT: Inductive Transfer Learning for QSAR

Thu, 26 Mar 2026 00:00:00 +0000

Transfer Learning Meets Molecular Property Prediction

This is a Method paper that introduces MolPMoFiT (Molecular Prediction Model Fine-Tuning), a transfer learning approach for QSPR/QSAR modeling. The primary contribution is adapting the ULMFiT framework from NLP to molecular property prediction by treating SMILES strings as a chemical language. A general-purpose molecular structure prediction model (MSPM) is pre-trained on one million ChEMBL molecules via self-supervised next-token prediction, then fine-tuned for specific QSAR endpoints. The approach achieves competitive or superior results to graph neural networks and descriptor-based methods across four benchmark datasets, with particular benefits for small datasets.

The Small Data Problem in QSAR Modeling

Deep learning models for molecular property prediction typically require large labeled training sets to learn useful structural representations. While methods like graph convolutional neural networks and SMILES-based models have achieved strong results on well-studied endpoints, they must be trained from scratch for each new task. This presents a challenge for small chemical datasets with limited labeled data, which remain common in drug discovery for specialized endpoints like allosteric inhibition, renal clearance, and inhibitor residence times.

Transfer learning had already shown transformative impact in computer vision (ImageNet pre-training) and NLP (ELMo, BERT, ULMFiT). In chemistry, prior transfer learning efforts included ChemNet (supervised pre-training on computed descriptors), Mol2vec (unsupervised substructure embeddings), and pre-trained graph neural networks. However, a systematic application of the ULMFiT self-supervised pre-training pipeline to SMILES-based molecular models had not been explored. MolPMoFiT fills this gap by treating the vast corpus of unlabeled molecular structures as a self-supervised training signal, analogous to how language models learn from unlabeled text.

Core Innovation: ULMFiT Adapted for SMILES

MolPMoFiT adapts ULMFiT’s three-stage transfer learning pipeline to molecular property prediction:

Stage 1: General-Domain MSPM Pre-training. A molecular structure prediction model is trained on one million curated ChEMBL molecules to predict the next token in a SMILES string. This is purely self-supervised: the SMILES string provides its own labels. The model learns general chemical syntax and structural patterns.

Stage 2: Task-Specific MSPM Fine-tuning (Optional). The general MSPM is further fine-tuned on the unlabeled SMILES of the target task dataset. This adapts the language model to the specific chemical distribution of interest (e.g., HIV inhibitors vs. general bioactive molecules). Discriminative fine-tuning adjusts learning rates per layer:

$$\eta^{layer-1} = \eta^{layer} / 2.6$$

where higher layers (containing more task-specific features) receive higher learning rates.

Stage 3: QSAR/QSPR Model Fine-tuning. The embedding and encoder weights from the pre-trained MSPM are transferred to a new model with a task-specific classifier head. Fine-tuning uses three key techniques from ULMFiT:

Discriminative fine-tuning: Different learning rates per layer group
Gradual unfreezing: Layers are unfrozen sequentially (classifier first, then progressively deeper LSTM layers)
One cycle policy: Learning rate scheduling following Smith’s approach

The model architecture is AWD-LSTM (ASGD Weight-Dropped LSTM) with an embedding dimension of 400, three LSTM layers with 1152 hidden units, and dropouts applied at every layer (embedding, input, weights, hidden). The QSAR classifier concatenates max pooling, mean pooling, and the last hidden state $h_T$ from the final LSTM layer, feeding this into two feedforward layers.

SMILES Augmentation. Since multiple valid SMILES can represent the same molecule through different atom orderings, the authors use SMILES enumeration as data augmentation. For regression tasks, Gaussian noise ($\sigma_{noise}$) is added to labels of augmented SMILES to simulate experimental error. Test-time augmentation (TTA) averages predictions across the canonical SMILES and four randomized SMILES.

Benchmarks Across Four QSAR Datasets

Datasets

Dataset	Size	Task	Metric
Lipophilicity	4,200	Regression (logD)	RMSE
FreeSolv	642	Regression (solvation energy)	RMSE
HIV	41,127	Classification (replication inhibition)	AUROC
BBBP	2,039	Classification (blood-brain barrier)	AUROC

All datasets use the same 10 random 80:10:10 splits from Yang et al. (2019) for fair comparison. Both random and scaffold splits were evaluated, with scaffold splits representing a more realistic test of generalization to novel chemical scaffolds.

Baselines

Models were compared against results reported by Yang et al. (2019): directed message passing neural network (D-MPNN), D-MPNN with RDKit features, random forest on Morgan fingerprints, feed-forward networks on Morgan fingerprints, and feed-forward networks on RDKit descriptors.

Hyperparameters

The same set of fine-tuning hyperparameters was used across all four tasks (tuned on the HIV dataset):

Layer Group	Base Learning Rate	Epochs
Linear head only	3e-2	4
+ Final LSTM layer	5e-3	4
+ Final two LSTM layers	5e-4	4
Full model	5e-5	6

Data augmentation settings were task-specific: lipophilicity training SMILES augmented 25x ($\sigma_{noise} = 0.3$); FreeSolv augmented 50x ($\sigma_{noise} = 0.5$); HIV active class augmented 60x and inactive 2x; BBBP positive class 10x and negative 30x.

Key Findings and Limitations

Benchmark Results

Lipophilicity (random split): MolPMoFiT achieved RMSE of $0.565 \pm 0.037$ with TTA and $0.625 \pm 0.032$ without, outperforming D-MPNN and other baselines.

FreeSolv (random split): RMSE of $1.197 \pm 0.127$ with TTA. The small dataset size (642 compounds) led to high variance across splits.

BBBP (random split): AUROC of $0.950 \pm 0.020$, outperforming all comparison models. Task-specific MSPM fine-tuning showed no clear benefit over the general MSPM.

HIV (random split): General MolPMoFiT achieved AUROC of $0.828 \pm 0.029$ with TTA. Task-specific fine-tuning yielded a slightly higher $0.834 \pm 0.025$ with TTA.

Scaffold splits consistently produced lower performance than random splits across all datasets, as expected for out-of-distribution generalization.

Transfer Learning Impact

Across all four datasets and varying training set sizes, MolPMoFiT consistently outperformed models trained from scratch with the same architecture. The improvement was most pronounced at smaller training set sizes, confirming the utility of pre-trained representations for low-data regimes.

SMILES Augmentation Analysis

Training data augmentation provided significant improvements across all tasks. For classification (HIV, BBBP), augmentation improved performance regardless of whether class re-balancing was applied. For regression (lipophilicity, FreeSolv), both SMILES augmentation and label noise were beneficial, with optimal noise levels varying by dataset.

Limitations

The authors note a fundamental limitation: the model learns mappings from individual SMILES strings to properties rather than from molecular structures to properties. SMILES augmentation acts as a regularization technique to mitigate this, making the model more robust to different SMILES representations of the same molecule. The task-specific MSPM fine-tuning stage did not consistently improve results, requiring further investigation. All hyperparameters were tuned on one dataset (HIV) and applied uniformly, which may not be optimal for all endpoints.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pre-training	ChEMBL (curated)	1M molecules	Filtered: no mixtures, max 50 heavy atoms, standardized with MolVS, canonized with RDKit
Evaluation	Lipophilicity	4,200	MoleculeNet benchmark
Evaluation	FreeSolv	642	MoleculeNet benchmark
Evaluation	HIV	41,127	MoleculeNet benchmark
Evaluation	BBBP	2,039	MoleculeNet benchmark

Algorithms

AWD-LSTM architecture with embedding dim 400, three LSTM layers (1152 hidden units), dropouts at all layers
ULMFiT fine-tuning: discriminative learning rates ($\eta^{layer-1} = \eta^{layer}/2.6$), gradual unfreezing, one cycle policy
SMILES character-level tokenization with special handling for two-character tokens (Cl, Br) and bracket-enclosed tokens
SMILES enumeration for data augmentation with optional Gaussian label noise for regression

Models

General-domain MSPM pre-trained on 1M ChEMBL molecules (10 epochs)
Task-specific MSPMs fine-tuned per dataset (optional stage)
QSAR models fine-tuned with transferred embeddings and encoder

Evaluation

Dataset	Split	Metric	MolPMoFiT (TTA)	Best Baseline
Lipophilicity	Random	RMSE	$0.565 \pm 0.037$	D-MPNN
Lipophilicity	Scaffold	RMSE	$0.635 \pm 0.031$	D-MPNN
FreeSolv	Random	RMSE	$1.197 \pm 0.127$	D-MPNN
FreeSolv	Scaffold	RMSE	$2.082 \pm 0.460$	D-MPNN
BBBP	Random	AUROC	$0.950 \pm 0.020$	D-MPNN
BBBP	Scaffold	AUROC	$0.931 \pm 0.025$	D-MPNN
HIV	Random	AUROC	$0.828 \pm 0.029$	D-MPNN
HIV	Scaffold	AUROC	$0.816 \pm 0.022$	D-MPNN

Hardware

NVIDIA Quadro P4000 GPU (single GPU)
General-domain MSPM pre-training: approximately 1 day
Pre-training needs to be done only once; fine-tuning is fast per task

Artifacts

Artifact	Type	License	Notes
MolPMoFiT	Code	Not specified	PyTorch + fastai v1 implementation with curated datasets

Paper Information

Citation: Li, X., & Fourches, D. (2020). Inductive transfer learning for molecular activity prediction: Next-Gen QSAR Models with MolPMoFiT. Journal of Cheminformatics, 12, 27. https://doi.org/10.1186/s13321-020-00430-x

@article{li2020molpmofit,
  title={Inductive transfer learning for molecular activity prediction: Next-Gen QSAR Models with MolPMoFiT},
  author={Li, Xinhao and Fourches, Denis},
  journal={Journal of Cheminformatics},
  volume={12},
  number={1},
  pages={27},
  year={2020},
  doi={10.1186/s13321-020-00430-x}
}

LLM-Prop: Predicting Crystal Properties from Text

Thu, 26 Mar 2026 00:00:00 +0000

Text-Based Crystal Property Prediction with LLMs

LLM-Prop is a Method paper that proposes using the encoder portion of T5 (a general-purpose language model) fine-tuned on crystal text descriptions to predict physical and electronic properties of crystalline materials. The primary contribution is demonstrating that text-based representations of crystals, generated by Robocrystallographer, can serve as effective inputs for property prediction, outperforming graph neural network (GNN) baselines on several tasks despite using a non-domain-specific pre-trained model with fewer parameters.

Why Text Instead of Crystal Graphs?

Graph neural networks have been the dominant approach for crystal property prediction. Models like CGCNN, MEGNet, and ALIGNN represent crystals as graphs where atoms are nodes and bonds are edges. However, GNNs face several fundamental challenges for crystals:

Periodicity encoding: Crystals have repetitive unit cell arrangements that are distinct from standard molecular graphs, and GNNs struggle to encode this periodicity efficiently.
Information incorporation: Critical structural information like bond angles, space group symmetry, and Wyckoff sites is difficult to incorporate into graph representations.
Expressiveness: Graphs may lack the expressiveness needed to convey complex crystal information relevant to property prediction.

Meanwhile, textual descriptions of crystals (generated by tools like Robocrystallographer) naturally encode space group information, bond geometries, coordination environments, and symmetry details in human-readable form. Despite this richness, text-based approaches for crystal property prediction had been largely unexplored.

Core Innovation: T5 Encoder with Careful Fine-Tuning

The key insight of LLM-Prop is to take a pre-trained encoder-decoder model (T5-small) and discard the decoder entirely, using only the encoder with a linear prediction head. This design has several advantages:

Cutting the network in half (from ~60M to ~37M parameters) allows processing of longer input sequences
Longer sequences mean more crystal information can be included
The encoder-only approach avoids T5’s known weakness at regression in text-to-text format

The framework applies several preprocessing strategies to the crystal text descriptions:

Stopword removal: Standard English stopwords are removed, except digits and symbols carrying chemical information
Numerical token replacement: Bond distances are replaced with a [NUM] token and bond angles with [ANG], reducing sequence length while preserving structural cues
[CLS] token prepending: A classification token is added at the start, and its learned embedding is used as input to the prediction layer
Label scaling: For regression tasks, targets are normalized using z-score, min-max, or log normalization

The normalization schemes are defined as:

$$ \hat{Y}_{i}(\text{z-score}) = \frac{Y_{i} - \mu}{\sigma} $$

$$ \hat{Y}_{i}(\text{min-max}) = \frac{Y_{i} - Y_{\min}}{Y_{\max} - Y_{\min}} $$

$$ \hat{Y}_{i}(\text{log-norm}) = \log(Y_{i} + 1) $$

The tokenizer is also retrained on the crystal text corpus with a vocabulary size of 32k, and the special tokens [NUM], [ANG], and [CLS] are added to the vocabulary.

Experimental Setup and Baselines

Dataset: TextEdge

The authors collected data from the Materials Project database (as of November 2022), yielding 144,931 crystal structure-description pairs split into 125,098 training, 9,945 validation, and 9,888 test samples. Crystal text descriptions were generated using Robocrystallographer. The dataset covers six prediction tasks:

Task	Type	Metric
Band gap (eV)	Regression	MAE (lower is better)
Unit cell volume (A^3/cell)	Regression	MAE (lower is better)
Formation energy per atom (eV/atom)	Regression	MAE (lower is better)
Energy per atom (eV/atom)	Regression	MAE (lower is better)
Energy above hull (eV/atom)	Regression	MAE (lower is better)
Is-gap-direct	Classification	AUC (higher is better)

Baselines

Seven baselines were compared:

GNN-based: CGCNN, MEGNet, ALIGNN, DeeperGATGNN
Classic ML: XGBoost, Random Forest (on Robocrystallographer features)
Text-based: MatBERT (domain-specific pre-trained BERT, ~110M parameters)

All models were trained and evaluated on the same dataset splits for fair comparison. GNN models were retrained on the new data rather than using results from older, smaller Materials Project versions.

Main Results: LLM-Prop vs. GNN Baselines

When using crystal text descriptions as input, LLM-Prop achieved:

Model	Band gap (eV)	Volume (A^3/cell)	FEPA (eV/atom)	EPA (eV/atom)	Ehull (eV/atom)	Is-gap-direct (AUC)
CGCNN	0.293	188.834	0.046	0.082	0.040	0.830
MEGNet	0.304	297.948	0.077	0.056	0.051	N/A
ALIGNN	0.250	129.580	0.027	0.059	0.028	0.678
DeeperGATGNN	0.291	111.857	0.081	0.116	0.045	N/A
LLM-Prop (Descr.)	0.231	39.252	0.056	0.067	0.047	0.857

LLM-Prop outperformed the best GNN baseline (ALIGNN) by approximately 8% on band gap prediction, 65% on volume prediction, and 3% on band gap classification (Is-gap-direct). For formation energy per atom, energy per atom, and energy above hull, ALIGNN retained an advantage.

LLM-Prop vs. MatBERT

LLM-Prop also outperformed MatBERT (a domain-specific pre-trained BERT) across all tasks despite having roughly 3x fewer parameters. The table below shows the best result for each model across the three input preprocessing strategies (w/ Numbers, w/o Numbers, w/ [NUM]&[ANG]):

Model	Band gap (eV)	Volume (A^3/cell)	FEPA (eV/atom)	EPA (eV/atom)	Ehull (eV/atom)	Is-gap-direct (AUC)
MatBERT (best)	0.258	54.969	0.071	0.098	0.050	0.722
LLM-Prop (best)	0.231	39.138	0.056	0.067	0.047	0.857

Note: LLM-Prop’s best band gap (0.231) comes from the “w/o Numbers” configuration, while the best volume (39.138) comes from “w/ Numbers”. The best Is-gap-direct AUC (0.857) uses the “[NUM]&[ANG]” configuration.

Ablation Studies

The contribution of each preprocessing strategy was evaluated:

Configuration	Band gap	Volume	Is-gap-direct (AUC)
LLM-Prop (baseline)	0.256	69.352	0.796
+ modified tokenizer	0.247	78.632	0.785
+ label scaling	0.242	44.515	N/A
+ [CLS] token	0.231	39.520	0.842
+ [NUM] token	0.251	86.090	0.793
+ [ANG] token	0.242	64.965	0.810
- stopwords	0.252	56.593	0.779
LLM-Prop+all (no space group)	0.235	97.457	0.705
LLM-Prop+all	0.229	42.259	0.857

The [CLS] token provided the single largest improvement across all tasks. Label scaling was critical for volume prediction (reducing MAE from 69.352 to 44.515). Removing space group information from descriptions degraded volume prediction dramatically (from 42.259 to 97.457), confirming that space group symmetry is a key factor.

Data Efficiency and Transfer Learning

LLM-Prop achieved SOTA results on band gap and volume prediction with only about 90k training samples (35k fewer than baselines). For volume prediction specifically, LLM-Prop outperformed all GNN baselines with just 30k training samples.

Transfer learning experiments showed that LLM-Prop transferred well between band gap and volume prediction tasks:

Model	Volume-to-Band gap (Test)	Band gap-to-Volume (Test)
CGCNN-transfer	0.295	182.997
ALIGNN-transfer	0.322	136.164
MatBERT-transfer	0.266	54.289
LLM-Prop-transfer	0.244	50.753

Key Findings, Limitations, and Future Directions

Key findings:

Text descriptions of crystals carry rich structural information (space groups, Wyckoff sites, coordination geometries) that is difficult to encode in graphs but naturally expressed in text
A carefully fine-tuned general-purpose LLM encoder can outperform domain-specific pre-trained models, challenging the assumption that in-domain pre-training is always necessary
Removing numerical information (bond distances and angles) from descriptions often improves performance, because current LLMs treat numbers as regular tokens without understanding their quantitative meaning
Longer input sequences correlate with better performance, with 888 tokens as the default maximum on the hardware used

Limitations acknowledged by the authors:

The origin of LLM-Prop’s performance advantage over GNNs is not fully understood. It remains unclear whether the boost comes from additional structured information in text or from the different data modality itself
LLM-Prop cannot perform zero-shot predictions since T5 was not pre-trained on materials science data
The approach depends on Robocrystallographer to generate text descriptions, adding a preprocessing dependency
Current LLMs’ inability to reason about numerical values limits the use of quantitative information in descriptions

Future directions suggested by the authors include investigating techniques to use CIF files directly as LLM inputs, developing new GNN architectures that incorporate space group and Wyckoff site information, and further exploring which information in crystal descriptions contributes most to each property prediction task.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Training/Eval	TextEdge	144,931 crystals	From Materials Project (Nov 2022), text generated by Robocrystallographer
Training split	TextEdge	125,098	Random split
Validation split	TextEdge	9,945	Random split
Test split	TextEdge	9,888	Random split

Algorithms

Optimizer: Adam with one-cycle learning rate scheduler
Learning rate: 1e-3 for LLM-Prop, 5e-5 for MatBERT
Dropout: 0.2 for LLM-Prop, 0.5 for MatBERT
Batch size: 64 (888 tokens) or 16 (2000 tokens) for LLM-Prop
Epochs: 200-300 depending on task
Loss: MAE for regression, BCE for classification
Evaluation: MAE for regression, AUC for classification
Each model run 5 times on test set, averaged MAE reported

Models

Base model: T5-small encoder (~60M parameters total, ~37M after discarding decoder and adding prediction head)
Vocabulary size: 32k (retrained tokenizer)
Max input tokens: 888 (default) or 2000
Special tokens: [CLS], [NUM], [ANG]

Artifacts

Artifact	Type	License	Notes
LLM-Prop	Code	MIT	Official implementation
TextEdge + Checkpoints	Dataset + Model	Not specified	Benchmark dataset and trained model checkpoints

Hardware

GPUs: NVIDIA RTX A6000
Training time: ~40 minutes per epoch for LLM-Prop
Inference: ~1 minute for 10,000 materials on one GPU

Paper Information

Citation: Rubungo, A. N., Arnold, C. B., Rand, B. P., & Dieng, A. B. (2025). LLM-Prop: predicting the properties of crystalline materials using large language models. npj Computational Materials, 11, 186. https://doi.org/10.1038/s41524-025-01536-2

@article{rubungo2025llmprop,
  title={LLM-Prop: predicting the properties of crystalline materials using large language models},
  author={Rubungo, Andre Niyongabo and Arnold, Craig B. and Rand, Barry P. and Dieng, Adji Bousso},
  journal={npj Computational Materials},
  volume={11},
  number={1},
  pages={186},
  year={2025},
  publisher={Nature Publishing Group},
  doi={10.1038/s41524-025-01536-2}
}

Perplexity for Molecule Ranking and CLM Bias Detection

Wed, 25 Mar 2026 00:00:00 +0000

A Method for Intrinsic Scoring and Bias Detection in Chemical Language Models

This is a Method paper that introduces two contributions to the chemical language model (CLM) pipeline for de novo molecular design. First, the authors propose using perplexity as a model-intrinsic score to rank generated SMILES strings by how well they match the design objectives encoded in the fine-tuning data. Second, they introduce a “delta score” that compares molecule rankings from pretrained and fine-tuned CLMs to detect pretraining bias, where molecules are generated primarily based on generic pretraining knowledge rather than task-specific fine-tuning objectives.

The Ranking and Bias Problem in CLM-Based Molecule Generation

Chemical language models generate new molecules as SMILES strings by iteratively predicting the next character based on learned probability distributions. After training, CLMs can produce large virtual libraries of candidate molecules via multinomial sampling. However, two key challenges remain: (1) the generated molecules lack a natural ranking, requiring external scoring methods such as similarity assessment or activity prediction for prioritization, and (2) transfer learning (pretraining on a large corpus followed by fine-tuning on a small target set) can introduce “pretraining bias,” where some generated molecules reflect generic chemical knowledge from pretraining rather than the specific design objectives of the fine-tuning data.

Beam search offers an alternative sampling approach that produces inherently ranked molecules by greedily selecting the most probable SMILES strings. However, beam search explores only a narrow portion of chemical space. The authors sought to combine the ranking advantage of beam search with the chemical space exploration of multinomial sampling by applying perplexity scoring as a post-hoc ranking criterion.

Perplexity Scoring and the Delta Score for Bias Estimation

The core innovation is the application of perplexity, a standard evaluation metric from natural language processing, to score SMILES strings generated by CLMs. For a SMILES string of length $N$ with character probabilities $p_i$ assigned by the CLM, perplexity is computed as:

$$ \text{perplexity} = 2^{-\frac{1}{N} \sum_{i=1}^{N} \log(p_{i})} $$

Low perplexity indicates that the CLM assigns high probability to each character in the SMILES string, suggesting the molecule closely matches the learned distribution of the fine-tuning data. The metric is normalized by string length, making it comparable across molecules of different sizes.

To address pretraining bias, the authors introduce a delta score. For each generated molecule, the perplexity-based rank from the fine-tuned model ($\text{rank}_{ft}$) is compared against the rank from the pretrained model ($\text{rank}_{pt}$):

$$ \text{delta} = \text{rank}_{ft} - \text{rank}_{pt} $$

A positive delta score indicates that the fine-tuned model ranks the molecule higher than the pretrained model, suggesting the molecule was generated based on task-specific fine-tuning knowledge. A negative delta score flags molecules that may have been generated primarily from pretraining information, which do not necessarily match the design objectives.

The multinomial sampling probability for each character is computed via the softmax function:

$$ p_{i} = \frac{e^{z_{i}/T}}{\sum_{j} e^{z_{j}/T}} $$

where $z_{i}$ is the CLM output logit for the $i$th character, $j$ runs over all dictionary characters, and $T$ is the temperature parameter (set to $T = 1$ in this study).

Experimental Setup: 10 Protein Targets Across Four Data Regimes

The authors systematically evaluated perplexity scoring across 10 macromolecular targets and four low-data fine-tuning regimes (5, 10, 20, and 40 molecules per target).

Model architecture: A four-layer LSTM-based RNN (5,820,515 parameters) with batch normalization layers, LSTM layers of 1024 and 256 units, trained using the Adam optimizer with a learning rate of $10^{-4}$.

Pretraining: The model was pretrained on 1,683,181 molecules from ChEMBL (version 28), encoded as canonical SMILES (20-90 characters), for 90 epochs.

Fine-tuning: For each of 10 randomly selected protein targets (Table 1), bioactive ligands with pChEMBL > 6 were selected. Fine-tuning sets of 5, 10, 20, and 40 molecules were compiled for each target. Fine-tuning ran for 100 epochs, with 1,000 SMILES strings sampled every second epoch via multinomial sampling ($T = 1$).

CHEMBL ID	Target	Protein Classification
CHEMBL1836	Prostanoid EP4 receptor	G protein-coupled receptor
CHEMBL1945	Melatonin receptor 1A	G protein-coupled receptor
CHEMBL1983	Serotonin 1D (5-HT1D) receptor	Family A GPCR
CHEMBL202	Dihydrofolate reductase	Oxidoreductase
CHEMBL3522	Cytochrome P450 17A1	Cytochrome P450
CHEMBL4029	Interleukin-8 receptor A	Family A GPCR
CHEMBL5073	CaM kinase I delta	Kinase
CHEMBL5137	Metabotropic glutamate receptor 2	G protein-coupled receptor
CHEMBL5408	Serine/threonine-protein kinase TBK1	Kinase
CHEMBL5608	NT-3 growth factor receptor	Kinase

Sampling comparison: Beam search sampling was performed with beam widths $k = 10$ and $k = 50$ for comparison against multinomial sampling.

Molecular similarity: Tanimoto similarity was computed using Morgan fingerprints (radius 2, length 1024) and 2D pharmacophore fingerprints via RDKit (2019.03.2).

Key Findings: Multinomial Sampling Outperforms Beam Search

Perplexity correlates with molecular similarity. The Pearson correlation between perplexity and Tanimoto distance to the fine-tuning set stabilized at approximately 0.5 across all data regimes. This correlation emerged earlier with larger fine-tuning sets. The result confirms that perplexity captures both substructural and pharmacophore features while also incorporating additional CLM-learned information.

Multinomial sampling produces better-ranked molecules than beam search. With the smallest fine-tuning sets (5 molecules), the top 50 molecules from multinomial sampling consistently exhibited lower (better) perplexity values than beam search at $k = 10$ or $k = 50$. Increasing the beam width from 10 to 50 did not markedly improve beam search performance. For novel molecules (Tanimoto similarity below 50% to the nearest fine-tuning compound), multinomial sampling identified lower-perplexity molecules in 72% of cases with the smallest fine-tuning sets.

Perplexity scoring narrows the quality distribution. The top 50 molecules selected by perplexity from multinomial sampling spanned a narrower range of perplexity values compared to beam search, suggesting a more consistent pool of high-quality candidates for follow-up synthesis.

Pretraining bias is substantial. The delta score analysis revealed that more than 40% of sampled molecules had negative delta scores during the first 20 fine-tuning epochs, meaning they were ranked higher by the pretrained model than the fine-tuned model. This fraction remained above 10% even at the end of 100 fine-tuning epochs across all data regimes, confirming that 10-40% of generated molecules reflect “generic” pretraining rather than task-focused fine-tuning.

Perplexity alone partially mitigates bias. Among the top 50 molecules selected by perplexity from multinomial sampling, only up to 3% had negative delta scores, compared to 10-40% in the unfiltered population. This suggests that perplexity-based ranking already reduces pretraining bias, though the delta score provides additional filtering power.

SMILES validity remained high. Mean SMILES string validity consistently exceeded 90% across all fine-tuned models and fine-tuning epochs.

Limitations

The authors note several limitations and future directions. The study used a fixed temperature of $T = 1$ for multinomial sampling; combining perplexity with temperature tuning or SMILES augmentation remains unexplored. The evaluation focused on 10 protein targets, and broader validation across diverse target classes would strengthen the conclusions. The authors also suggest that combining CLMs with perplexity scoring could be applied to screen large collections of commercially available compounds, which has not yet been tested.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pretraining	ChEMBL v28	1,683,181 molecules	Canonical SMILES, 20-90 characters, salts and duplicates removed
Validation	ChEMBL v28 (split)	84,160 molecules	Random split from pretraining set
Fine-tuning	ChEMBL v28 (per target)	5, 10, 20, or 40 molecules	pChEMBL > 6, 10 targets

Algorithms

LSTM-based CLM with character-level SMILES prediction
Multinomial sampling at $T = 1$
Beam search at $k = 10$ and $k = 50$
Perplexity computed per Equation 1; delta score per Equation 2
Adam optimizer, learning rate $10^{-4}$, 90 pretraining epochs, 100 fine-tuning epochs

Models

4-layer LSTM RNN: batch normalization, LSTM (1024 units), LSTM (256 units), batch normalization
5,820,515 parameters total
One-hot encoded SMILES input
Pretrained weights available in the GitHub repository

Evaluation

Metric	Description	Notes
Perplexity	Model confidence in generated SMILES	Lower is better
Delta score	Rank difference between fine-tuned and pretrained models	Positive indicates task-relevant generation
Tanimoto similarity	Morgan and pharmacophore fingerprints	Compared to fine-tuning set
Pearson correlation	Perplexity vs. Tanimoto distance	Stabilizes at ~0.5
SMILES validity	Fraction of valid SMILES strings	Consistently > 90%

Hardware

Hardware specifications are not reported in the paper. The implementation uses Keras (v2.2.0) with TensorFlow GPU backend (v1.9.0).

Artifacts

Artifact	Type	License	Notes
CLM_perplexity	Code	MIT	Framework, pretrained weights, and training data
Beam search implementation	Code	Unknown	Referenced beam search implementation

Paper Information

Citation: Moret, M., Grisoni, F., Katzberger, P., & Schneider, G. (2022). Perplexity-Based Molecule Ranking and Bias Estimation of Chemical Language Models. Journal of Chemical Information and Modeling, 62(5), 1199-1206. https://doi.org/10.1021/acs.jcim.2c00079

Publication: Journal of Chemical Information and Modeling, 2022

Additional Resources:

GitHub: CLM_perplexity (MIT License)

Citation

@article{moret2022perplexity,
  title={Perplexity-Based Molecule Ranking and Bias Estimation of Chemical Language Models},
  author={Moret, Michael and Grisoni, Francesca and Katzberger, Paul and Schneider, Gisbert},
  journal={Journal of Chemical Information and Modeling},
  volume={62},
  number={5},
  pages={1199--1206},
  year={2022},
  publisher={American Chemical Society},
  doi={10.1021/acs.jcim.2c00079}
}

Regression Transformer: Prediction Meets Generation

Sun, 22 Mar 2026 00:00:00 +0000

A Multitask Model That Unifies Regression and Generation

The Regression Transformer (RT) is a Method paper. It introduces a single model architecture that can both predict continuous molecular properties and conditionally generate molecules with desired property values. The core idea is to reformulate regression as a sequence modelling task: instead of training a dedicated regression head, continuous property values are tokenized into sequences of digits and predicted alongside molecular tokens using a cross-entropy loss.

Closing the Gap Between Predictors and Generators

Existing transformer-based approaches in computational chemistry develop property predictors and generative models as separate systems. Even when a single architecture like Chemformer (Irwin et al., 2022) addresses both tasks, it does so through task-specific heads. This means the two capabilities remain disjoint, and the generative model cannot use its own property prediction ability during generation.

The RT addresses three specific gaps:

No true multitask entanglement: Prior work either tunes separate heads for prediction and generation or limits communication between modules to a reward signal.
No inductive bias for continuous properties: Molecular generative models lack mechanisms to condition generation on floating-point property values.
Disconnected workflows: Property predictors cannot generate molecules, and generators cannot assess whether their outputs satisfy property constraints.

Core Innovation: Regression as Conditional Sequence Modelling

The RT’s key insight is that regression can be cast as sequential classification over digit tokens while preserving predictive accuracy. This is achieved through three components:

Numerical Tokenization

Floating-point property values are split into individual digit tokens that preserve decimal order. Each token $t_{v,p}$ encodes a digit value $v \in [0, 9]$ and its decimal place $p \in \mathbb{Z}$. For example, the value 12.3 becomes the token sequence [1_1, 2_0, 3_-1].

Numerical Encodings

To provide an inductive bias about the semantic proximity of digit tokens (which cross-entropy loss cannot convey), the RT introduces Numerical Encodings (NEs), analogous to positional encodings. For a token $t_{v,p}$ at embedding dimension $j$:

$$ \text{NE}_{\text{Float}}(v, p, j) = (-1)^j \cdot \frac{v \cdot 10^p}{j + 1} $$

These encodings ensure that pairwise distances between digit tokens decay monotonically with their floating-point proximity. The model can also learn digit orderings from data alone, but NEs provide a useful inductive bias.

Alternating Training with Self-Consistency

The RT uses an XLNet backbone trained with permutation language modelling (PLM). The key is that the same model serves two roles depending on which tokens are masked:

Mask numerical tokens: the model performs property prediction (regression)
Mask textual tokens: the model performs conditional sequence generation

The base PLM objective is:

$$ \mathcal{L}_{\text{PLM}} = \mathbb{E}_{\mathbf{z} \sim \mathcal{Z}_T} \left[ \sum_{i=c+1}^{T} \log p_\theta(x_{z_i} \mid \mathbf{x}_{\mathbf{z}_{< i}}) \right] $$

This is refined into two specialized objectives: a property prediction objective $\mathcal{L}_P$ that masks only numerical tokens, and a generation objective $\mathcal{L}_G$ that masks only textual tokens. Training alternates between these every 50 steps.

The self-consistency (SC) loss adds a critical feedback loop. After generating a candidate molecule $\hat{\mathbf{x}}$, the model re-evaluates it by predicting the property of the generated sequence:

$$ \mathcal{L}_{\text{SC}} = \mathcal{L}_G(\mathbf{x}) + \alpha \cdot \mathcal{L}_P(\hat{\mathbf{x}}) $$

This rewards generating molecules whose predicted properties match the primed property value, exploiting the RT’s dual capability as both predictor and generator.

Experiments Across Molecules, Proteins, and Reactions

Drug Likeness (QED)

Initial validation on a synthetic QED dataset (~1.4M molecules from ChEMBL) demonstrated that the RT can simultaneously learn to predict QED scores (RMSE < 0.06) and generate novel molecules conditioned on desired QED values (Spearman’s $\rho$ up to 0.517 between primers and generated molecule properties). Novelty exceeded 99% across all configurations. The alternating training scheme with SC loss outperformed both single-task models and the vanilla PLM objective.

SELFIES representations proved comparable to SMILES for property prediction and far superior for generation (~100% validity vs. ~40% for SMILES).

MoleculeNet Regression Benchmarks

On MoleculeNet benchmarks ESOL, FreeSolv, and Lipophilicity, the RT outperformed XGBoost and MPNN baselines despite using only a classification loss. It performed on par with XLNet using a conventional regression head, and was only mildly inferior to models like BERT and BART that used large-scale self-supervised pre-training with regression losses.

Critically, only the RT could also conditionally generate molecules for these tasks. External validation with Grover (a self-supervised Graph Transformer) confirmed high correlation with the RT’s own property predictions (0.86, 0.84, and 0.75 for ESOL, FreeSolv, and Lipophilicity respectively).

Constrained Property Optimization

On the penalized logP (plogP) benchmark with similarity constraints, the RT outperformed JT-VAE and GCPN by large margins. At similarity threshold $\delta = 0.4$, the RT achieved 3.16 average improvement with 97.1% success rate, while also predicting plogP with PCC of 0.92. Competing methods cannot perform property prediction at all.

Model	Improvement ($\delta$=0.4)	Success	Property Prediction
JT-VAE	0.84	83.6%	Unfeasible
GCPN	2.49	100%	Unfeasible
MoFlow	4.71	85.7%	Unfeasible
RT	3.16	97.1%	PCC = 0.92

The comparison is not strictly fair: all competing methods are trained specifically to maximize plogP, and some (GCPN, JT-VAE) apply gradient optimization at inference time. The RT is only trained to reconstruct molecules with similar predicted plogP to the seed, so its training objective is property-agnostic rather than directly optimizing for higher plogP values.

Protein Language Modelling

On the TAPE benchmark, the RT matched or outperformed conventional transformers on fluorescence and stability prediction tasks, despite those baselines being pre-trained on 24-106 million protein sequences (vs. 2.6 million for the RT). The RT also performed conditional protein generation, a task that none of the TAPE baselines can address.

Chemical Reaction Modelling

The RT was applied to reaction yield prediction on Buchwald-Hartwig amination and Suzuki coupling datasets. It matched Yield-BERT performance ($R^2$ = 0.939 and 0.81 respectively) while also enabling novel capabilities: reconstructing missing precursors from partial reactions and decorating existing reactions to achieve higher predicted yields. Across both datasets, over 40% of top-five predicted sequences contained reactions with novel precursors and higher predicted yield.

Key Findings and Limitations

Key Findings

Regression can be successfully reformulated as sequential classification over digit tokens without losing predictive accuracy compared to models using regression losses.
The alternating training scheme with self-consistency loss enables cross-task benefits, where the model outperforms single-task variants at both prediction and generation.
A single ~27M parameter model handles property prediction, conditional molecular generation, conditional protein generation, and reaction yield prediction with precursor generation.
The model learns the natural ordering of digits from data: 47% of embedding dimensions for the tenths place directly encode digit ordering even without explicit numerical encodings.

Limitations

No large-scale pre-training: The RT uses ~27M parameters trained from scratch on task-specific datasets, unlike BARTSmiles or MoLFormer which pre-train on billions of molecules. Scaling up could improve results.
Fine-grained regression precision: The model sometimes struggles with intra-mode precision (e.g., on the fluorescence dataset where predictions cluster around bright/dark modes rather than capturing continuous variation).
Single-property focus: All reported experiments use a single continuous property, though the framework naturally extends to multi-property settings.
SELFIES validity caveats: While SELFIES are always syntactically valid, they can produce degenerate short molecules (~1.9% defective generations where the output has less than 50% of the seed’s atoms).
XLNet backbone limitations: Results on MoleculeNet regression are slightly below models using BART or BERT backbones with large-scale pre-training, suggesting the RT framework could benefit from stronger base models.

Reproducibility Details

Artifacts

Artifact	Type	License	Notes
Regression Transformer (GitHub)	Code	MIT	Training and evaluation scripts
GT4SD Integration	Code + Models	MIT	Pre-trained model inference pipelines
HuggingFace Demo	Demo	-	Interactive inference webapp

Data

Purpose	Dataset	Size	Notes
Drug likeness	ChEMBL (QED)	~1.4M molecules	Synthetic QED labels computed with RDKit
Regression benchmark	MoleculeNet (ESOL, FreeSolv, Lipo)	642-4,200 compounds	16x SMILES augmentation, 3 random splits
Property optimization	ZINC (plogP)	215,381 train / 799 test	Fixed split from Jin et al. (2018)
Protein pre-training	UniProt (Boman)	2,648,205 peptides	15-45 amino acid peptides
Protein benchmarks	TAPE (Fluorescence, Stability)	21,446-53,416 samples	Fixed splits
Reaction pre-training	USPTO	2,830,616 reactions	Molecular weight as numerical property
Reaction yield	Buchwald-Hartwig / Suzuki	3,955 / 5,760 reactions	Ten 70/30 random splits

Algorithms

Architecture: XLNet (32 hidden layers, 256 hidden dim, 1024 FFN dim, 16 attention heads, 20% dropout)
Parameters: ~27 million
Training: Permutation language modelling pre-training, then alternating objectives (property prediction + conditional generation with SC loss)
Decoding: Greedy for property prediction, beam search for sequence generation

Evaluation

Task	Metric	RT Result	Notes
QED prediction	RMSE	0.037	Best config (NE + SC)
QED generation	Spearman’s $\rho$	0.517	Between primers and generated QED
ESOL	RMSE	Comparable to XLNet	Within s.d. of regression-loss XLNet
plogP optimization ($\delta$=0.4)	Improvement	3.16	Outperforms JT-VAE, GCPN
Protein fluorescence	Spearman’s $\rho$	0.72	Outperforms TAPE baselines
BH yield prediction	$R^2$	0.939	Near Yield-BERT (0.951)

Hardware

All models trained on single GPUs (NVIDIA A100 or V100)
Training time: ~4 days for pre-training, ~1 day for fine-tuning
Framework: PyTorch 1.3.1 with HuggingFace Transformers 3.1.0

Paper Information

Citation: Born, J. & Manica, M. (2023). Regression Transformer enables concurrent sequence regression and generation for molecular language modelling. Nature Machine Intelligence, 5(4), 432-444. https://doi.org/10.1038/s42256-023-00639-z

Publication: Nature Machine Intelligence, April 2023

Additional Resources:

Citation

@article{born2023regression,
  title={Regression Transformer enables concurrent sequence regression and generation for molecular language modelling},
  author={Born, Jannis and Manica, Matteo},
  journal={Nature Machine Intelligence},
  volume={5},
  number={4},
  pages={432--444},
  year={2023},
  publisher={Nature Publishing Group}
}

Language Models Learn Complex Molecular Distributions

Sun, 22 Mar 2026 00:00:00 +0000

RNN Language Models as Flexible Molecular Generators

This is an Empirical paper that investigates the capacity of simple recurrent neural network (RNN) language models to learn complex molecular distributions. The core finding is that LSTM-based models trained on SMILES (SM-RNN) or SELFIES (SF-RNN) string representations consistently outperform popular graph generative models (JTVAE, CGVAE) across three increasingly challenging generative modeling tasks. The paper positions language models as flexible, scalable alternatives to graph-based approaches for molecular generation.

Scaling Beyond Standard Benchmarks

Most molecular generative models are evaluated on relatively small, drug-like molecules from datasets like ZINC or MOSES. These standard benchmarks do not test whether models can handle larger, more structurally diverse molecules or distributions with complex shapes (multi-modal, heavy-tailed). This gap matters because there is increasing interest in larger, more complex molecules for therapeutics, including peptides and natural products.

Graph generative models like JTVAE and CGVAE impose structural constraints (tree decompositions, valency restrictions) that help with validity but limit their ability to scale. Language models, by contrast, only need to generate a single character sequence, making them inherently more flexible.

Three Challenging Generative Modeling Tasks

The paper introduces three benchmark tasks designed to stress-test generative models:

Task 1: Penalized LogP Distribution

A dataset of approximately 160K molecules from ZINC15 with penalized LogP scores exceeding 4.0. The training distribution is sharply peaked around 4.0 to 4.5 with a subtle tail extending above 6.0. Molecules in the tail tend to have long carbon chains and fewer rings. The challenge is learning this skewed distribution rather than just finding individual high-scoring molecules.

A composite dataset of approximately 200K molecules drawn from four sources with distinct molecular weight ranges:

GDB-13 (MW $\leq$ 185)
ZINC (185 $\leq$ MW $\leq$ 425)
Harvard Clean Energy Project (460 $\leq$ MW $\leq$ 600)
POLYMERS (MW $>$ 600)

Models must learn to generate from all four modes simultaneously, each with very different molecular structures.

Task 3: Large-Scale Molecules

The largest molecules in PubChem with more than 100 heavy atoms, yielding approximately 300K molecules with molecular weights ranging from 1,250 to 5,000. These include small biomolecules, photovoltaics, peptides, and cyclic peptides. This task is particularly challenging because the SMILES/SELFIES strings are very long.

Evaluation by Distributional Fidelity

The evaluation framework focuses on how well a model learns the full training distribution rather than generating individual good molecules. The primary quantitative metric is the Wasserstein distance (earth mover’s distance) between molecular property distributions of generated and training molecules:

$$W(P, Q) = \inf_{\gamma \in \Gamma(P,Q)} \int | x - y | , d\gamma(x, y)$$

Properties evaluated include LogP, synthetic accessibility (SA), quantitative estimate of drug-likeness (QED), molecular weight (MW), Bertz complexity (BCT), and natural product likeness (NP). An oracle baseline is computed by measuring the Wasserstein distance between different random samples of the training data itself.

Standard metrics (validity, uniqueness, novelty) are also reported but are secondary to distributional fidelity.

Architecture: LSTM Language Models

The language models use standard LSTM architectures trained autoregressively on molecular strings. Two variants are compared:

SM-RNN: Trained on canonical SMILES
SF-RNN: Trained on SELFIES representations

Hyperparameters are tuned via random search over learning rate ($\in [0.0001, 0.001]$), hidden units ($\in [100, 1000]$), layers (1 to 5), and dropout ($\in [0.0, 0.5]$). Model selection uses a combination of standard metrics and Wasserstein distance rankings.

The graph model baselines include JTVAE (junction tree VAE) and CGVAE (constrained graph VAE), along with several additional baselines (MolGAN, GraphNVP, and others).

Results: Language Models Outperform Graph Models Across All Tasks

Penalized LogP

Both RNN models learn the sharp training distribution far better than graph models. The SM-RNN achieves the lowest Wasserstein distances across most properties. The graph models produce substantial out-of-distribution mass around penalized LogP scores of 1.75 to 2.25, failing to capture the peaked nature of the training distribution.

Critically, the RNNs also learn the subtle tail above penalized LogP of 6.0, generating molecules with long carbon chains and fewer rings that match the structural characteristics of high-scoring training molecules. CGVAE and JTVAE almost entirely miss this tail.

Both RNN models capture all four modes of the training distribution. JTVAE entirely misses the GDB13 mode and poorly learns the ZINC and CEP modes. CGVAE learns GDB13 but misses the CEP mode. The SM-RNN again achieves the best Wasserstein metrics.

Large-Scale Molecules

This is the most discriminating task. Both JTVAE and CGVAE completely fail to train on these large molecules. JTVAE’s tree decomposition produces a vocabulary of approximately 11,000 substructures, making training intractable. Only the RNN models succeed, with the SF-RNN achieving slightly better distributional match due to SELFIES guaranteeing 100% validity even for very long strings.

Both RNN models also learn the bimodal LogP structure within the large-molecule distribution and can generate molecules with substructures resembling peptides, including backbone chains and standard amino acid side chains.

Summary of Wasserstein Distance Results

Task	Model	LogP	SA	QED	MW
LogP	SM-RNN	0.095	0.031	0.007	3.3
LogP	SF-RNN	0.177	0.290	0.010	6.3
LogP	JTVAE	0.536	0.289	0.081	35.9
LogP	CGVAE	1.000	2.120	0.115	69.3
Multi	SM-RNN	0.081	0.025	0.006	5.5
Multi	SF-RNN	0.286	0.179	0.023	11.4
Multi	JTVAE	0.495	0.274	0.034	27.7
Multi	CGVAE	1.617	1.802	0.076	30.3
Large	SM-RNN	1.367	0.213	0.003	124.5
Large	SF-RNN	1.095	0.342	0.010	67.3
Large	JTVAE	–	–	–	–
Large	CGVAE	–	–	–	–

SMILES vs. SELFIES Trade-off

An interesting finding is that SMILES and SELFIES RNNs each have complementary strengths. The SF-RNN consistently achieves better standard metrics (validity, uniqueness, novelty) across all tasks, while the SM-RNN achieves better Wasserstein distance metrics. The authors suggest that the SELFIES grammar may reduce memorization of the training data, improving novelty but slightly hurting distributional fidelity.

Limitations

The authors acknowledge several limitations. Language models cannot account for molecular geometry or 3D information, which is important for many applications. The study evaluates distributional fidelity but does not test downstream utility for specific molecular design tasks (e.g., optimizing for a particular biological target). Additionally, while the graph models (JTVAE, CGVAE) are more interpretable, the language models operate as black boxes over string representations. The comparison is also limited to two specific graph model architectures, and more recent or specialized graph models may close the performance gap. Finally, trained model weights are only available upon request rather than being publicly released.

Reproducibility

Artifact	Type	License	Notes
danielflamshep/genmoltasks	Dataset	Apache-2.0	Processed training data and generated samples

Data: Three custom datasets constructed from ZINC15, GDB-13, Harvard Clean Energy Project, POLYMERS, and PubChem. Processed data available at the GitHub repository.

Code: LSTM networks implemented in PyTorch using the char-rnn code from the MOSES repository. Baselines use the official JTVAE and CGVAE implementations. No unified training script is provided in the repository.

Evaluation: Wasserstein distances computed using SciPy. Molecular properties computed using RDKit. 10K molecules generated from each model for evaluation.

Hyperparameters: Task-specific configurations reported. For example, the LogP task SM-RNN uses 2 hidden layers with 400 units, dropout of 0.2, and learning rate of 0.0001.

Hardware: Models were trained using the Canada Computing Systems (Compute Canada). Specific GPU types and training times are not reported.

Paper Information

Citation: Flam-Shepherd, D., Zhu, K., & Aspuru-Guzik, A. (2022). Language models can learn complex molecular distributions. Nature Communications, 13, 3293. https://doi.org/10.1038/s41467-022-30839-x

Additional Resources:

GitHub: danielflamshep/genmoltasks

@article{flamshepherd2022language,
  title={Language models can learn complex molecular distributions},
  author={Flam-Shepherd, Daniel and Zhu, Kevin and Aspuru-Guzik, Al{\'a}n},
  journal={Nature Communications},
  volume={13},
  number={1},
  pages={3293},
  year={2022},
  publisher={Nature Publishing Group},
  doi={10.1038/s41467-022-30839-x}
}