Application Papers: Transferring Methods to New Domains on Hunter Heidenreich | ML Research Scientist

Fine-Tuning GPT-3 for Predictive Chemistry Tasks

Sat, 28 Mar 2026 00:00:00 +0000

GPT-3 as a General-Purpose Chemistry Predictor

This is an Empirical paper that systematically benchmarks fine-tuned GPT-3 against dedicated machine learning models across 15 chemistry and materials science prediction tasks. The primary contribution is demonstrating that a general-purpose large language model, with no chemistry-specific architecture or featurization, can match or outperform specialized ML approaches, particularly when training data is limited. The paper also demonstrates inverse molecular design through simple prompt inversion.

Why General-Purpose LLMs for Chemistry

Machine learning in chemistry typically requires domain-specific feature engineering: molecular fingerprints, graph neural network architectures, or hand-crafted descriptors tailored to each application. Developing these approaches demands specialized expertise and significant effort for each new problem. The small datasets common in experimental chemistry further complicate matters, as many sophisticated ML approaches require large training sets to learn meaningful representations.

Large language models like GPT-3, trained on vast internet text corpora, had shown surprising capability at tasks they were not explicitly trained for. The key question motivating this work was whether these general-purpose models could also answer scientific questions for which we lack answers, given that most chemistry problems can be represented in text form. For example: “If I change the metal in my metal-organic framework, will it be stable in water?”

Prior chemical language models (e.g., Transformer-CNN, Regression Transformer, SELFormer) were pre-trained on chemistry-specific corpora. In contrast, this work investigates models trained primarily on general internet text, examining whether the implicit chemical knowledge encoded during pre-training, combined with task-specific fine-tuning, can substitute for explicit chemical featurization.

Language-Interfaced Fine-Tuning for Chemistry

The core innovation is “language-interfaced fine-tuning” (LIFT): reformulating chemistry prediction tasks as natural language question-answering. Training examples take the form of question-completion pairs, where questions describe the chemical system in text and completions provide the target property. For example:

Classification: “What is the phase of Co1Cu1Fe1Ni1V1?” with completion “0” (multi-phase)
Regression: Property values are rounded to a fixed precision, converting continuous prediction into a text generation problem
Inverse design: Questions and completions are simply swapped, asking “What is a molecule with property X?” and expecting a SMILES string as completion

The fine-tuning uses OpenAI’s API with the smallest ada variant of GPT-3, with uniform hyperparameters across all tasks (8 epochs, learning rate multiplier of 0.02). No optimization of prompt structure, tokenization, or training schedule was performed, making the approach deliberately simple.

For regression, since language models generate discrete tokens rather than continuous values, the authors round target values to a fixed precision (e.g., 1% for Henry coefficients). This converts regression into a form of classification over numeric strings, with the assumption that GPT-3 can interpolate between these discretized values.

The approach also extends to open-source models. The authors demonstrate that GPT-J-6B can be fine-tuned using parameter-efficient techniques (LoRA, 8-bit quantization) on consumer hardware, and provide the chemlift Python package for this purpose.

Benchmarks Across Molecules, Materials, and Reactions

Datasets and Tasks

The evaluation spans three chemical domains with 15 total benchmarks:

Molecules:

Photoswitch transition wavelength prediction (2022)
Free energy of solvation (FreeSolv, 2014)
Aqueous solubility (ESOL, 2004)
Lipophilicity (ChEMBL, 2012)
HOMO-LUMO gap (QMugs, 2022)
Organic photovoltaic power conversion efficiency (2018)

Materials:

Coarse-grained surfactant adsorption free energy (2021)
CO2 and CH4 Henry coefficients in MOFs (2020)
MOF heat capacity (2022)
High-entropy alloy phase prediction (2020)
Bulk metallic glass formation ability (2006)
Metallic behavior prediction (2018)

Reactions:

C-N cross-coupling yield (Buchwald-Hartwig, 2018)
C-C cross-coupling yield (Suzuki, 2022)

Baselines

The baselines include both traditional ML and deep learning approaches:

Non-DL: XGBoost with molecular descriptors/fragprints, Gaussian Process Regression (GPR), random forests, n-Gram models, Automatminer, differential reaction fingerprints (DRFP)
Deep learning: MolCLR, ModNet, CrabNet, TabPFN

Data Efficiency Analysis

To compare data efficiency, the authors fit power law curves to learning curves for all models and measure the “data efficiency factor”: how much more (or fewer) data the best baseline needs to match GPT-3’s performance in the low-data regime.

Domain	Benchmark	Data Efficiency vs. Non-DL	vs. DL Baseline
Molecules	Photoswitch wavelength	1.1x (n-Gram)	1.2x (TabPFN)
Molecules	Solvation free energy	3.1x (GPR)	1.3x (TabPFN)
Molecules	Solubility	1.0x (XGBoost)	0.002x (MolCLR)
Molecules	Lipophilicity	3.43x (GPR)	0.97x (TabPFN)
Molecules	HOMO-LUMO gap	4.3x (XGBoost)	0.62x (TabPFN)
Materials	HEA phase	24x (RF)	9.0x (CrabNet)
Materials	CO2 Henry coeff.	0.40x (XGBoost)	12x (TabPFN)
Reactions	C-N cross-coupling	2.9x (DRFP)	-

Values >1 indicate GPT-3 is more data-efficient. For the HEA phase prediction task, GPT-3 achieved comparable accuracy to a random forest model trained on 1,126 data points using only about 50 training examples.

Representation Sensitivity

An important finding is that GPT-3 performs well regardless of molecular representation format. The authors tested IUPAC names, SMILES, and SELFIES, finding good results across all representations. IUPAC names often produced the best performance, which is notable because it makes the approach accessible to non-specialists who can simply use chemical names rather than learning specialized encodings.

Inverse Design

For inverse design, the authors fine-tuned GPT-3 with reversed question-completion pairs. On photoswitches:

Generated molecules include both training set members and novel structures (some not in PubChem)
Transition wavelengths matched target values within about 10% mean absolute percentage error (validated using the GPR model from Griffiths et al.)
A temperature parameter controls the diversity-validity tradeoff: low temperatures produce training set copies, high temperatures produce diverse but potentially invalid structures
Across all temperatures, generated molecules showed low synthetic accessibility (SA) scores, suggesting synthesizability

The authors also demonstrated iterative inverse design for HOMO-LUMO gap optimization: starting from QMugs data, they iteratively fine-tuned GPT-3 to generate molecules with progressively larger bandgaps (>5 eV), successfully shifting the distribution over four generations. This worked even when extrapolating beyond the training distribution (e.g., training only on molecules with gaps <3.5 eV, then generating molecules with gaps >4.0 eV).

Coarse-Grained Polymer Design

A striking test involved coarse-grained dispersant polymers with four monomer types and chain lengths of 16-48 units. GPT-3 had no prior knowledge of these abstract representations, yet it outperformed dedicated models for adsorption free energy prediction and successfully performed inverse design, generating monomer sequences with a mean percentage error of about 22% for the desired property.

Key Findings and Limitations

Key Findings

Low-data advantage: Fine-tuned GPT-3 consistently shows the largest advantages over conventional ML in low-data regimes (tens to hundreds of data points), which is precisely where experimental chemistry datasets typically fall.
Representation agnostic: The model works with IUPAC names, SMILES, SELFIES, and even invented abstract representations, removing the need for chemistry-specific tokenization.
No feature engineering: The approach requires no domain-specific descriptors, fingerprints, or architectural modifications, making it accessible to researchers without ML expertise.
Bidirectional design: Inverse design is achieved by simply reversing the question format, with no architectural changes or separate generative model needed.
Extrapolation capability: The model can generate molecules with properties outside the training distribution, as demonstrated by the HOMO-LUMO gap extrapolation experiments.

Limitations

In the high-data regime, conventional ML models with chemistry-specific features often catch up to or surpass GPT-3, as the inductive biases encoded in GPT-3 become less necessary with sufficient data.
Regression is inherently limited by the discretization of continuous values into tokens. This requires more data than classification and introduces quantization error.
The approach relies on the OpenAI API, introducing cost and reproducibility concerns (model versions may change). The authors partially address this by providing open-source alternatives via chemlift.
The authors acknowledge that identified correlations may not represent causal relationships. GPT-3 finding predictive patterns does not guarantee that the patterns are chemically meaningful.
No optimization of prompts, tokenization, or hyperparameters was performed, suggesting room for improvement but also making it difficult to assess the ceiling of this approach.

Reproducibility Details

Data

All datasets are publicly available and were obtained from published benchmarks.

Purpose	Dataset	Size	Notes
Classification	HEA phase (Pei et al.)	1,252 alloys	Single-phase vs. multi-phase
Regression	FreeSolv	643 molecules	Hydration free energies
Regression	ESOL	1,128 molecules	Aqueous solubility
Regression	QMugs	665,000 molecules	HOMO-LUMO gaps via GFN2-xTB
Classification	Lipophilicity (ChEMBL)	Varies	LogP classification
Classification	OPV PCE	Varies	Organic photovoltaic efficiency
Regression	MOF Henry coefficients	Varies	CO2/CH4 adsorption
Inverse design	Photoswitches (Griffiths et al.)	392 molecules	Transition wavelengths

Algorithms

Fine-tuning via OpenAI API: 8 epochs, learning rate multiplier 0.02
GPT-3 ada variant (smallest model) used for all main results
In-context learning also tested with larger GPT-3 models and GPT-4
Open-source alternative: GPT-J-6B with LoRA + 8-bit quantization
Learning curves fit to power laws $-a \exp(-bx + c)$ for data efficiency comparison
Validity checked using RDKit via GuacaMol’s is\_valid method

Models

GPT-3 ada (OpenAI API, proprietary)
GPT-J-6B (open-source, fine-tunable on consumer hardware)

Evaluation

Metric	Task	Notes
Accuracy	HEA phase	Classification
$F_1$ macro	All classification tasks	Class-balanced
Cohen’s $\kappa$	Classification	Used for learning curve thresholds
MAE / MAPE	Regression, inverse design	Property prediction accuracy
Validity rate	Inverse design	Fraction of parseable SMILES
Frechet ChemNet distance	Inverse design	Distribution similarity
SA score	Inverse design	Synthetic accessibility

Hardware

Fine-tuning via OpenAI API (cloud compute, not user-specified)
Open-source experiments: consumer GPU hardware with 8-bit quantization
Quantum chemistry validation: GFN2-xTB for HOMO-LUMO calculations

Artifacts

Artifact	Type	License	Notes
gptchem	Code	MIT	All experiments with OpenAI API
chemlift	Code	MIT	Open-source LLM fine-tuning support
Zenodo (gptchem)	Code	MIT	Archived release
Zenodo (chemlift)	Code	MIT	Archived release

Paper Information

Citation: Jablonka, K. M., Schwaller, P., Ortega-Guerrero, A., & Smit, B. (2024). Leveraging large language models for predictive chemistry. Nature Machine Intelligence, 6(2), 161-169. https://doi.org/10.1038/s42256-023-00788-1

@article{jablonka2024leveraging,
  title={Leveraging large language models for predictive chemistry},
  author={Jablonka, Kevin Maik and Schwaller, Philippe and Ortega-Guerrero, Andres and Smit, Berend},
  journal={Nature Machine Intelligence},
  volume={6},
  number={2},
  pages={161--169},
  year={2024},
  publisher={Nature Publishing Group},
  doi={10.1038/s42256-023-00788-1}
}

Data Transfer Approaches for Seq-to-Seq Retrosynthesis

Sat, 28 Mar 2026 00:00:00 +0000

Systematic Study of Data Transfer for Retrosynthesis

This is an Empirical paper that systematically compares three standard data transfer methods (joint training, self-training, and pre-training plus fine-tuning) applied to a Transformer-based sequence-to-sequence model for single-step retrosynthesis. The primary contribution is demonstrating that pre-training on a large augmented dataset (USPTO-Full, 877K reactions) followed by fine-tuning on the smaller target dataset (USPTO-50K) produces substantial accuracy improvements over the baseline Transformer, achieving competitive or superior results to contemporaneous state-of-the-art graph-based models at higher values of n-best accuracy.

Bridging the Data Gap in Retrosynthesis Prediction

Retrosynthesis, the problem of predicting reactant compounds needed to synthesize a target product, has seen rapid progress through increasingly sophisticated model architectures: LSTM seq-to-seq models, Transformer models, and graph-to-graph approaches. However, the authors identify a gap in this research trajectory. While model architecture has received extensive attention, the role of training data strategies has been largely neglected in the retrosynthesis literature.

The core practical problem is that high-quality supervised datasets for retrosynthesis (like USPTO-50K) tend to be small and distribution-skewed, with all samples pre-classified into ten major reaction classes. Meanwhile, larger datasets (USPTO-Full with 877K samples, USPTO-MIT with 479K samples) exist but have different distributional properties. Data transfer techniques are standard practice in computer vision, NLP, and machine translation for exactly this scenario, yet they had not been systematically evaluated for retrosynthesis at the time of this work.

The authors also note a contrast with Zoph et al. (2020), who found that self-training outperforms pre-training in image recognition. They hypothesize that chemical compound strings may have more universal representations than images, making pre-training more effective in the chemistry domain.

Three Data Transfer Methods for Retrosynthesis

The paper formalizes retrosynthesis as a seq-to-seq problem where both the product $x$ and reactant set $y$ are represented as SMILES strings. A retrosynthesis model defines a likelihood $p_{\mathcal{M}}(y \mid x; \theta)$ optimized via maximum log-likelihood:

$$ \theta^{*} = \arg\max_{\theta} \sum_{(x_{i}, y_{i}) \in \mathcal{D}^{T}_{\text{Train}}} \log p(y_{i} \mid x_{i}) $$

Given a target dataset $\mathcal{D}^{T}$ and an augment dataset $\mathcal{D}^{A}$, three transfer methods are examined:

Joint Training concatenates the training sets and optimizes over the union:

$$ \theta^{*}_{\text{joint}} = \arg\max_{\theta} \sum_{(x_{i}, y_{i}) \in \mathcal{D}_{\text{joint}}} \log p(y_{i} \mid x_{i}), \quad \mathcal{D}_{\text{joint}} = \mathcal{D}^{T}_{\text{Train}} \cup \mathcal{D}^{A}_{\text{Train}} $$

This requires that both datasets share the same input/output domain (same SMILES canonicalization rules).

Self-Training (pseudo labeling) first trains a base model on $\mathcal{D}^{T}$ alone, then uses this model to relabel the augment dataset products:

$$ \hat{y}_{i} = \arg\max_{y} \log p(y \mid x_{i}; \theta^{*}_{\text{single}}) \quad \text{for } x_{i} \in \mathcal{D}^{A}_{\text{Train}} $$

The pseudo-labeled augment set is then combined with $\mathcal{D}^{T}_{\text{Train}}$ for joint training. This approach does not require consistent label domains between datasets.

Pre-training plus Fine-tuning trains first on the augment dataset to obtain $\theta^{*}_{\text{pretrain}}$, then initializes fine-tuning from this checkpoint:

$$ \theta^{0}_{\text{finetune}} \leftarrow \theta^{*}_{\text{pretrain}}, \quad \theta^{\ell+1}_{\text{finetune}} \leftarrow \theta^{\ell}_{\text{finetune}} - \gamma^{\ell} \nabla \mathcal{L}(\mathcal{D}^{T}_{\text{Train}}) \big|_{{\theta^{\ell}_{\text{finetune}}}} $$

Experimental Setup on USPTO Benchmarks

The experiments use a fixed Transformer architecture (3 self-attention layers, 500-dimensional latent vectors) implemented in OpenNMT-py, evaluated across all three transfer methods.

Datasets:

Purpose	Dataset	Size	Notes
Target	USPTO-50K	40K/5K/5K (train/val/test)	10 reaction classes, curated by Lowe (2012)
Augment (main)	USPTO-Full	844K train (after cleansing)	Curated by Lowe (2017)
Augment (smaller)	USPTO-MIT	384K train (after cleansing)	Curated by Jin et al. (2017)

Data cleansing removed all augment dataset samples whose product SMILES appeared in any USPTO-50K subset, preventing data leakage. All datasets were re-canonicalized with a unified RDKit version.

Evaluation uses n-best accuracy with k=50 beam search, computing accuracy at n=1, 3, 5, 10, 20, 50. Models are selected by best validation perplexity. All experiments report averages and standard deviations over 5 runs.

Optimization uses Adam with cyclic learning rate scheduling (warm-up) for all methods except fine-tuning, which uses a standard non-cyclic scheduler.

Results comparing data transfer methods (USPTO-Full augment):

Training Method	n=1	n=3	n=5	n=10	n=20	n=50
Single model (No Transfer)	35.3 +/- 1.4	52.8 +/- 1.4	58.9 +/- 1.3	64.5 +/- 1.2	68.8 +/- 1.2	72.1 +/- 1.3
Joint Training	39.1 +/- 1.3	63.4 +/- 0.9	71.9 +/- 0.5	80.1 +/- 0.2	85.4 +/- 0.3	89.4 +/- 0.2
Self-Training	41.5 +/- 1.0	60.4 +/- 0.7	66.1 +/- 0.7	71.8 +/- 0.6	75.3 +/- 0.5	78.0 +/- 0.3
Pre-training + Fine-Tune	57.4 +/- 0.4	77.6 +/- 0.4	83.1 +/- 0.2	87.4 +/- 0.4	89.6 +/- 0.3	90.9 +/- 0.2

Comparison with state-of-the-art models:

Model	Architecture	n=1	n=3	n=5	n=10	n=20	n=50
GLN (Dai et al., 2019)	Logic Network	52.5	69.0	75.6	83.7	88.5	92.4
G2Gs (Shi et al., 2020)	Graph-to-Graph	48.9	67.6	72.5	75.5	N/A	N/A
RetroXpert (Yan et al., 2020)	Graph-to-Graph	65.6	78.7	80.8	83.3	84.6	86.0
GraphRetro (Somnath et al., 2020)	Graph-to-Graph	63.8	80.5	84.1	85.9	N/A	87.2
Pre-training + Fine-Tune (ours)	Seq-to-Seq	57.4	77.6	83.1	87.4	89.6	90.9

Key Findings and Limitations

Primary findings:

All three data transfer methods improve over the no-transfer baseline across all n-best accuracy levels.
Pre-training plus fine-tuning provides the largest gains, improving top-1 accuracy by 22.1 absolute percentage points (from 35.3% to 57.4%) and achieving the best n=10 and n=20 accuracy among all compared models, including graph-based approaches.
Augment dataset size matters: using USPTO-Full (844K) yields substantially better results than USPTO-MIT (384K) for joint training and pre-training plus fine-tuning, though self-training gains are surprisingly robust to augment dataset size.
Manual inspection of erroneous predictions shows that over 99% of top-1 predictions from the pre-trained/fine-tuned model are chemically appropriate or sensible, even when they do not exactly match the gold-standard reactants.
Pre-training plus fine-tuning shows a distinct advantage in training dynamics: the 1-best and n-best accuracy curves evolve similarly during fine-tuning, unlike the single model where these curves can diverge significantly. This makes early stopping more reliable.

Class-wise improvements are observed across all 10 reaction classes, with the largest gains in heterocycle formation (0.40 to 0.86 at 50-best) and functional group interconversion (0.57 to 0.90).

Limitations acknowledged by the authors:

The model struggles with compounds containing multiple similar substituents (e.g., long-chain hydrocarbons), occasionally selecting the wrong one.
Some reactions involving rare chemical groups (polycyclic aromatic hydrocarbons) still produce invalid SMILES, suggesting the augment dataset lacks sufficient examples of these structures.
Top-1 accuracy (57.4%) lags behind the best graph-based models (RetroXpert at 65.6%), though the gap narrows at higher n values.
The study uses a fixed Transformer architecture without architecture-specific optimization for each transfer method.

Future directions proposed include freezing parts of the network during fine-tuning, applying data transfer to graph-to-graph models, and testing transferability to other retrosynthesis datasets beyond USPTO-50K.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Target	USPTO-50K	50K reactions	Curated by Lowe (2012), 10 reaction classes
Augment	USPTO-Full	877K reactions (844K after cleansing)	Curated by Lowe (2017), available via Figshare
Augment (alt)	USPTO-MIT	479K reactions (384K after cleansing)	Curated by Jin et al. (2017)

Data cleansing removes augment samples whose products appear in any USPTO-50K subset. Unified RDKit canonicalization applied to all datasets.

Algorithms

Transformer seq-to-seq model (3 self-attention layers, 500-dim latent vectors)
Positional encoding enabled
Maximum sequence length: 200 tokens
Adam optimizer
Cyclic learning rate scheduler with warm-up (all methods except fine-tuning)
Non-cyclic scheduler for fine-tuning phase (Klein et al., 2017)
Beam search with k=50 for inference

Models

Implementation: OpenNMT-py
No pre-trained weights or model checkpoints released

Evaluation

Metric	Value	Baseline	Notes
Top-1 accuracy	57.4%	35.3% (no transfer)	Pre-train + fine-tune, USPTO-Full augment
Top-10 accuracy	87.4%	64.5% (no transfer)	Best among all compared models
Top-20 accuracy	89.6%	68.8% (no transfer)	Best among all compared models
Top-50 accuracy	90.9%	72.1% (no transfer)	Competitive with GLN (92.4%)

Hardware

Hardware details are not specified in the paper. The authors mention GPU memory constraints motivating the 200-token sequence length limit.

Paper Information

Citation: Ishiguro, K., Ujihara, K., Sawada, R., Akita, H., & Kotera, M. (2020). Data Transfer Approaches to Improve Seq-to-Seq Retrosynthesis. arXiv preprint arXiv:2010.00792.

@article{ishiguro2020data,
  title={Data Transfer Approaches to Improve Seq-to-Seq Retrosynthesis},
  author={Ishiguro, Katsuhiko and Ujihara, Kazuya and Sawada, Ryohto and Akita, Hirotaka and Kotera, Masaaki},
  journal={arXiv preprint arXiv:2010.00792},
  year={2020}
}

Maxsmi: SMILES Augmentation for Property Prediction

Fri, 27 Mar 2026 00:00:00 +0000

Systematic Benchmarking of SMILES Data Augmentation

This is an Empirical paper that systematically evaluates how SMILES augmentation affects deep learning molecular property prediction. The primary contribution is a comprehensive comparison of five augmentation strategies across three neural network architectures and four datasets, producing the “Maxsmi” models that maximize prediction performance. The study also demonstrates that test-time augmentation provides a practical confidence measure for predictions.

The Data Scarcity Problem in QSAR Modeling

Deep learning models require large training sets to perform well, but experimental physico-chemical and bioactivity datasets remain small, typically ranging from hundreds to a few thousand compounds. SMILES augmentation, where the non-unique SMILES representation of a molecule is exploited to generate multiple training examples per compound, has been shown to help in prior work by Bjerrum (2017), Kimber et al. (2018), and Li and Fourches (2020). However, no prior study had systematically compared different augmentation strategies, analyzed how much augmentation is needed, or examined the relationship between augmentation factor and prediction confidence. Most previous work chose augmentation numbers a priori without justification. Maxsmi fills this gap by providing a systematic analysis and practical guidelines.

Five Augmentation Strategies and Test-Time Ensemble Learning

The core insight is twofold. First, the authors define five distinct strategies for generating augmented SMILES:

No augmentation: use only the canonical SMILES (baseline)
Augmentation with duplication: generate $m$ random SMILES per compound, allowing duplicates; dataset grows to $N \times m$
Augmentation without duplication: generate $m$ random SMILES and discard exact duplicates
Augmentation with reduced duplication: keep only $f(m) = \sqrt{m}$ copies of each duplicate, a compromise between the above
Augmentation with estimated maximum: sample random SMILES until the same string has been generated 10 times, attempting to cover most of the valid SMILES space

Second, the authors formalize test-time augmentation as ensemble learning. Given a trained model $M_{\Theta}$, each test compound $C$ is represented by $k$ random SMILES $S_1(C), \ldots, S_k(C)$. The per-SMILES predictions are:

$$ \hat{y}_i(C) = M_{\Theta}(S_i(C)) $$

The compound-level prediction is an aggregation (mean) over these:

$$ \hat{y}(C) = A\big(\hat{y}_1(C), \ldots, \hat{y}_k(C)\big) $$

The standard deviation of the per-SMILES predictions serves as a confidence measure: high variance indicates the model is uncertain about a compound.

Experimental Design: Three Architectures, Four Datasets

Datasets

Dataset	Size (after preprocessing)	Train / Test	Task	Provenance
ESOL	1,128	902 / 226	Water solubility	MoleculeNet
ESOL_small	1,068	854 / 214	Solubility (max 25 heavy atoms)	MoleculeNet
FreeSolv	642	513 / 129	Hydration free energy	MoleculeNet
Lipophilicity	4,199	3,359 / 840	Octanol/water distribution	ChEMBL
Affinity (EGFR)	5,849	4,679 / 1,170	pIC50 against EGFR kinase	Kinodata

Architectures

Three shallow neural networks are compared:

CONV1D: 1D convolution (kernel size 10, stride 1) followed by two fully connected layers
CONV2D: 2D convolution on the one-hot encoded SMILES matrix, followed by two fully connected layers
RNN: LSTM layer followed by two fully connected layers (128 and 64 units)

All models are trained for 250 epochs with batch size 16, MSE loss, SGD optimizer, and learning rate 0.001. A Random Forest baseline with Morgan fingerprints (radius 2, length 1024) is also included.

Augmentation sweep

The augmentation number $m$ is varied from 1 to 20 (step 1) and from 20 to 100 (step 10) for three strategies (with, without, and reduced duplication). The estimated maximum strategy is tested on the smaller datasets. Both training and test sets receive the same augmentation.

Key Findings: Augmentation Consistently Improves RMSE

Augmentation always helps

Across all datasets and architectures, SMILES augmentation reduces test RMSE compared to the no-augmentation baseline. Performance improves sharply in the low augmentation range (1 to 10) and reaches a plateau around 40 to 70, after which additional augmentation provides diminishing returns.

Best models (Maxsmi)

Dataset	Model	Augmentation Number	Strategy	Test RMSE
ESOL	CONV1D	70	Reduced duplication	0.569
FreeSolv	CONV1D	70	With duplication	1.032
Lipophilicity	CONV1D	80	Without duplication	0.593

The CONV1D architecture consistently outperforms RNN and CONV2D. For ESOL, the CONV1D model improves from 0.839 RMSE (no augmentation) to 0.569 RMSE (70x reduced duplication), a 32% reduction.

No single best augmentation strategy

The three main augmentation strategies (with, without, and reduced duplication) perform similarly. Generating the estimated maximum number of unique SMILES does not yield the best results, suggesting a saturation point exists where additional SMILES diversity stops helping.

Canonical SMILES outperform single random SMILES

When augmentation is limited to a single representation ($m = 1$), the canonical SMILES consistently outperforms a single random SMILES. On ESOL with CONV1D, the canonical model achieves 0.839 RMSE versus 0.964 for a random SMILES. The authors attribute this to the simpler, more readable structure of canonical SMILES (fewer branches and brackets).

Comparison to prior work

Study	ESOL	FreeSolv	Lipophilicity	Model
Maxsmi	0.569	1.032	0.593	CNN
MoleculeNet	0.58 +/- 0.03	1.15 +/- 0.12	0.655 +/- 0.036	GNN
CNF	0.62	1.11	0.67	CNN
MolPMoFiT	N/A	1.197 +/- 0.127	0.565 +/- 0.037	RNN

Maxsmi outperforms or matches MoleculeNet’s graph neural networks and the CNF model on all three tasks. MolPMoFiT slightly outperforms Maxsmi on lipophilicity (0.565 vs 0.593) but performs worse on FreeSolv.

Confidence estimation

The standard deviation of per-SMILES predictions correlates with prediction error. Confidence curves show that sequentially removing compounds with the highest uncertainty leads to monotonically decreasing mean prediction error. For ESOL, keeping only the top 10% most confident predictions yields errors below 0.25.

EGFR affinity test case

Applying the Maxsmi approach (CONV1D, 70x augmentation, reduced duplication) to EGFR kinase affinity prediction yields test RMSE of 0.777 and R2 of 0.712, compared to 1.031 RMSE and 0.494 R2 for the canonical model (a 25% RMSE improvement). The Random Forest baseline (0.758 RMSE, 0.726 R2) performs comparably, which the authors note without further explanation.

Limitations

All experiments use a single train/test split (80/20) without cross-validation, due to the computational cost of the full augmentation sweep. This means reported RMSE values lack uncertainty estimates for the Maxsmi models.
The study uses shallow networks only. Whether the same augmentation benefits apply to deeper architectures or pre-trained models is untested.
The EGFR test case shows the Random Forest baseline performing comparably to the Maxsmi model, raising questions about when SMILES augmentation provides a meaningful advantage over traditional fingerprint-based methods.
The comparison to prior work uses different splits, preprocessing, and evaluation protocols across studies, which the authors acknowledge limits direct comparability.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Training/Evaluation	ESOL	1,128	MoleculeNet, water solubility
Training/Evaluation	FreeSolv	642	MoleculeNet, hydration free energy
Training/Evaluation	Lipophilicity	4,199	ChEMBL, logD
Test case	EGFR Affinity	5,849	Kinodata (ChEMBL v28), pIC50

All datasets are publicly available through MoleculeNet/DeepChem and Kinodata.

Algorithms

SMILES generation via RDKit’s random SMILES enumeration
One-hot encoding of SMILES characters with padding to max length
Five augmentation strategies applied to both training and test sets
Mean aggregation for compound-level predictions

Models

Model	Architecture	Parameters
CONV1D	1D conv (kernel 10, stride 1) + 2 FC layers	Not specified
CONV2D	2D conv (single channel) + 2 FC layers	Not specified
RNN	LSTM + FC(128) + FC(64)	Not specified
RF Baseline	Random Forest (default sklearn)	Morgan FP, radius 2, length 1024

Training: 250 epochs, batch size 16, MSE loss, SGD, lr=0.001.

Evaluation

Metric	Best Value	Baseline	Notes
RMSE (ESOL)	0.569	1.102 (RF)	CONV1D, 70x reduced dup
RMSE (FreeSolv)	1.032	2.563 (RF)	CONV1D, 70x with dup
RMSE (Lipophilicity)	0.593	0.860 (RF)	CONV1D, 80x without dup
RMSE (EGFR)	0.777	0.758 (RF)	CONV1D, 70x reduced dup

Hardware

Training was performed on a GeForce GTX 1080 Ti, provided by the HPC cluster at Freie Universitat Berlin. Training CONV1D on ESOL with 100x augmentation (keeping duplicates, 90,200 data points) takes approximately 3 hours. Training with 19x augmentation achieves RMSE of 0.605 in under 30 minutes.

Artifacts

Artifact	Type	License	Notes
volkamerlab/maxsmi	Code	MIT	Full source code, trained models, CLI for prediction
Documentation	Docs	N/A	Read the Docs documentation
Kinodata	Dataset	N/A	Curated kinase bioactivity data from ChEMBL v28

Reproducibility status: Highly Reproducible. Code, data, trained models, and a command-line prediction tool are all publicly available under the MIT license.

Paper Information

Citation: Kimber, T. B., Gagnebin, M., & Volkamer, A. (2021). Maxsmi: Maximizing molecular property prediction performance with confidence estimation using SMILES augmentation and deep learning. Artificial Intelligence in the Life Sciences, 1, 100014. https://doi.org/10.1016/j.ailsci.2021.100014

@article{kimber2021maxsmi,
  title={Maxsmi: Maximizing molecular property prediction performance with confidence estimation using SMILES augmentation and deep learning},
  author={Kimber, Talia B. and Gagnebin, Maxime and Volkamer, Andrea},
  journal={Artificial Intelligence in the Life Sciences},
  volume={1},
  pages={100014},
  year={2021},
  publisher={Elsevier},
  doi={10.1016/j.ailsci.2021.100014}
}

RNNs vs Transformers for Molecular Generation Tasks

Thu, 26 Mar 2026 00:00:00 +0000

An Empirical Comparison of Sequence Architectures for Molecular Generation

This is an Empirical paper that systematically compares two dominant sequence modeling architectures, recurrent neural networks (RNNs) and the Transformer, for chemical language modeling. The primary contribution is a controlled experimental comparison across three generative tasks of increasing complexity, combined with an evaluation of two molecular string representations (SMILES and SELFIES). The paper does not propose a new method; instead, it provides practical guidance on when each architecture is more appropriate for molecular generation.

Why Compare RNNs and Transformers for Molecular Design?

Exploring unknown molecular space and designing molecules with target properties is a central goal in computational drug design. Language models trained on molecular string representations (SMILES, SELFIES) have shown the capacity to learn complex molecular distributions. RNN-based models, including LSTM and GRU variants, were the first widely adopted architectures for this task. Models like CharRNN, ReLeaSE, and conditional RNNs demonstrated success in generating focused molecular libraries. More recently, self-attention-based Transformer models (Mol-GPT, LigGPT) have gained popularity due to their parallelizability and ability to capture long-range dependencies.

Despite the widespread adoption of Transformers across NLP, it was not clear whether they uniformly outperform RNNs for molecular generation. Prior work by Dollar et al. showed that RNN-based models achieved higher validity than Transformer-based models in some settings. Flam-Shepherd et al. demonstrated that RNN language models could learn complex molecular distributions across challenging generative tasks. This paper extends that comparison by adding the Transformer architecture to the same set of challenging tasks and evaluating both SMILES and SELFIES representations.

Experimental Design: Three Tasks, Two Architectures, Two Representations

The core experimental design uses a 2x2 setup: two architectures (RNN and Transformer) crossed with two molecular representations (SMILES and SELFIES), yielding four model variants: SM-RNN, SF-RNN, SM-Transformer, and SF-Transformer.

Three generative tasks

The three tasks, drawn from Flam-Shepherd et al., are designed with increasing complexity:

Penalized LogP task: Generate molecules with high penalized LogP scores (LogP minus synthetic accessibility and long-cycle penalties). The dataset is built from ZINC15 molecules with penalized LogP > 4.0. Molecule sequences are relatively short (50-75 tokens).
Multidistribution task: Learn a multimodal molecular weight distribution constructed from four distinct subsets: GDB13 (MW <= 185), ZINC (185 <= MW <= 425), Harvard Clean Energy Project (460 <= MW <= 600), and POLYMERS (MW > 600). This tests the ability to capture multiple modes simultaneously.
Large-scale task: Generate large molecules from PubChem with more than 100 heavy atoms and MW ranging from 1250 to 5000. This tests long-sequence generation capability.

Model configuration

Models are compared with matched parameter counts (5.2-5.3M to 36.4M parameters). Hyperparameter optimization uses random search over learning rate [0.0001, 0.001], hidden units (500-1000 for RNNs, 376-776 for Transformers), layer number [3, 5], and dropout [0.0, 0.5]. A regex-based tokenizer replaces character-by-character tokenization, reducing token lengths from 10,000 to under 3,000 for large molecules.

Evaluation metrics

The evaluation covers multiple dimensions:

Standard metrics: validity, uniqueness, novelty
Molecular properties: FCD, LogP, SA, QED, Bertz complexity (BCT), natural product likeness (NP), molecular weight (MW)
Wasserstein distance: measures distributional similarity between generated and training molecules for each property
Tanimoto similarity: structural and scaffold similarity between generated and training molecules
Token length (TL): comparison of generated vs. training sequence lengths

For each task, 10,000 molecules are generated and evaluated.

Key Results Across Tasks

Penalized LogP task

Model	FCD	LogP	SA	QED	BCT	NP	MW	TL
SM-RNN	0.56	0.12	0.02	0.01	16.61	0.09	5.90	0.43
SF-RNN	1.63	0.25	0.42	0.02	36.43	0.23	2.35	0.40
SM-Transformer	0.83	0.18	0.02	0.01	23.77	0.09	7.99	0.84
SF-Transformer	1.97	0.22	0.47	0.02	44.43	0.28	5.04	0.53

RNN-based models achieve smaller Wasserstein distances across most properties. The authors attribute this to LogP being computed as a sum of atomic contributions (a local property), which aligns with RNNs’ strength in capturing local structural features. RNNs also generated ring counts closer to the training distribution (4.10 for SM-RNN vs. 4.04 for SM-Transformer, with training data at 4.21). The Transformer performed better on global structural similarity (higher Tanimoto similarity to training data).

Multidistribution task

Model	FCD	LogP	SA	QED	BCT	NP	MW	TL
SM-RNN	0.16	0.07	0.03	0.01	18.34	0.02	7.07	0.81
SF-RNN	1.46	0.38	0.55	0.03	110.72	0.24	10.00	1.58
SM-Transformer	0.16	0.16	0.03	0.01	39.94	0.02	10.03	1.28
SF-Transformer	1.73	0.37	0.63	0.04	107.46	0.30	17.57	2.40

Both SMILES-based models captured all four modes of the MW distribution well. While RNNs had smaller overall Wasserstein distances, the Transformer fitted the higher-MW modes better. This aligns with the observation that longer molecular sequences (which correlate with higher MW) favor the Transformer’s global attention mechanism over the RNN’s sequential processing.

Large-scale task

Model	FCD	LogP	SA	QED	BCT	NP	MW	TL
SM-RNN	0.46	1.89	0.20	0.01	307.09	0.03	105.29	12.05
SF-RNN	1.65	1.78	0.43	0.01	456.98	0.14	100.79	15.26
SM-Transformer	0.36	1.64	0.07	0.01	172.93	0.02	59.04	7.41
SF-Transformer	1.91	2.82	0.47	0.01	464.75	0.18	92.91	11.57

The Transformer demonstrates a clear advantage on large molecules. SM-Transformer achieves substantially lower Wasserstein distances than SM-RNN across nearly all properties, with particularly large improvements in BCT (172.93 vs. 307.09) and MW (59.04 vs. 105.29). The Transformer also produces better Tanimoto similarity scores and more accurate token length distributions.

Standard metrics across all tasks

Task	Metric	SM-RNN	SF-RNN	SM-Transformer	SF-Transformer
LogP	Valid	0.90	1.00	0.89	1.00
LogP	Uniqueness	0.98	0.99	0.98	0.99
LogP	Novelty	0.75	0.71	0.71	0.71
Multi	Valid	0.95	1.00	0.97	1.00
Multi	Uniqueness	0.96	1.00	1.00	1.00
Multi	Novelty	0.91	0.98	0.91	0.98
Large	Valid	0.84	1.00	0.88	1.00
Large	Uniqueness	0.99	0.99	0.98	0.99
Large	Novelty	0.85	0.92	0.86	0.94

SELFIES achieves 100% validity across all tasks by construction, while SMILES validity drops for large molecules. The Transformer achieves slightly higher validity than the RNN for SMILES-based models, particularly on the large-scale task (0.88 vs. 0.84).

Conclusions and Practical Guidelines

The central finding is that neither architecture universally dominates. The choice between RNNs and Transformers should depend on the characteristics of the molecular data:

RNNs are preferred when molecular properties depend on local structural features (e.g., LogP, ring counts) and when sequences are relatively short. They better capture local fragment distributions.
Transformers are preferred when dealing with large molecules (high MW, long sequences) where global attention can capture the overall distribution more effectively. RNNs suffer from information obliteration on long sequences.
SMILES outperforms SELFIES on property distribution metrics across nearly all tasks and models. While SELFIES guarantees 100% syntactic validity, its generated molecules show worse distributional fidelity to training data. The authors argue that validity is a less important concern than property fidelity, since invalid SMILES can be filtered easily.

The authors acknowledge that longer sequences remain challenging for both architectures. For Transformers, the quadratic growth of the attention matrix limits scalability. For RNNs, the vanishing gradient problem limits effective context length.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Task 1	ZINC15 (penalized LogP > 4.0)	Not specified	High penalized LogP molecules
Task 2	GDB-13 + ZINC + CEP + POLYMERS	~200K	Multimodal MW distribution
Task 3	PubChem (>100 heavy atoms)	Not specified	MW range 1250-5000

Data processing code available at https://github.com/danielflamshep/genmoltasks (from the original Flam-Shepherd et al. study).

Algorithms

Tokenization: Regex-based tokenizer (not character-by-character)
Hyperparameter search: Random search over learning rate [0.0001, 0.001], hidden units, layers [3, 5], dropout [0.0, 0.5]
Selection: Top 20% by sum of valid + unique + novelty, then final selection on all indicators
Generation: 10K molecules per model per task

Models

Model	Parameters	Architecture
RNN variants	5.2M - 36.4M	RNN (LSTM/GRU)
Transformer variants	5.3M - 36.4M	Transformer decoder

Evaluation

Wasserstein distance for property distributions (FCD, LogP, SA, QED, BCT, NP, MW, TL), Tanimoto similarity (molecular and scaffold), validity, uniqueness, novelty.

Hardware

Not specified in the paper.

Artifacts

Artifact	Type	License	Notes
trans_language	Code	Not specified	Transformer implementation by the authors
genmoltasks	Code/Data	Apache-2.0	Dataset construction from Flam-Shepherd et al.

Paper Information

Citation: Chen, Y., Wang, Z., Zeng, X., Li, Y., Li, P., Ye, X., & Sakurai, T. (2023). Molecular language models: RNNs or transformer? Briefings in Functional Genomics, 22(4), 392-400. https://doi.org/10.1093/bfgp/elad012

@article{chen2023molecular,
  title={Molecular language models: RNNs or transformer?},
  author={Chen, Yangyang and Wang, Zixu and Zeng, Xiangxiang and Li, Yayang and Li, Pengyong and Ye, Xiucai and Sakurai, Tetsuya},
  journal={Briefings in Functional Genomics},
  volume={22},
  number={4},
  pages={392--400},
  year={2023},
  publisher={Oxford University Press},
  doi={10.1093/bfgp/elad012}
}

Re-evaluating Sample Efficiency in Molecule Generation

Thu, 26 Mar 2026 00:00:00 +0000

An Empirical Re-evaluation of Generative Model Benchmarks

This is an Empirical paper. The primary contribution is a critical reassessment of the Practical Molecular Optimization (PMO) benchmark for de novo molecule generation. Rather than proposing a new generative model, the authors modify existing benchmark metrics to account for chemical desirability (molecular weight, LogP, topological novelty) and molecular diversity. They then re-evaluate all 25 generative models from the original PMO benchmark plus the recently proposed Augmented Hill-Climb (AHC) method.

Sample Efficiency and Chemical Quality in Drug Design

Deep generative models for de novo molecule generation often require large numbers of oracle evaluations (up to $10^5$ samples) to optimize toward a target objective. This is a practical limitation when using computationally expensive scoring functions like molecular docking. The PMO benchmark by Gao et al. addressed this by reformulating performance as maximizing an objective within a fixed budget of 10,000 oracle calls, finding REINVENT to be the most sample-efficient model across 23 tasks.

However, the authors identify a key limitation: the PMO benchmark measures only sample efficiency without considering the chemical quality of proposed molecules. Investigating the top-performing REINVENT model on the JNK3 task, they find that 4 of 5 replicate runs produce molecules with molecular weight and LogP distributions far outside the training data (ZINC250k). The resulting molecules contain large structures with repeating substructures that are undesirable from a medicinal chemistry perspective. This disconnect between benchmark performance and practical utility motivates the modified evaluation metrics.

Modified Metrics: Property Filters and Diversity Requirements

The core innovation is the introduction of three modified AUC Top-10 metrics that extend the original PMO benchmark evaluation:

AUC Top-10 (Filtered): Molecules are excluded if their molecular weight or LogP falls beyond 4 standard deviations from the mean of the ZINC250k pre-training dataset ($\mu \pm 4\sigma$, covering approximately 99.99% of a normal distribution). Molecules with more than 10% de novo (unobserved in ZINC250k) ECFP4 fingerprint bits are also filtered out. This ensures the generative model does not drift beyond its applicability domain.

AUC Top-10 (Diverse): The top 10 molecules are selected iteratively, where a molecule is only added if its Tanimoto similarity (by ECFP4 fingerprints) to any previously selected compound does not exceed 0.35. This threshold corresponds to an approximately 80% probability that more-similar molecules belong to the same bioactivity class, enforcing that distinct candidates possess different profiles.

AUC Top-10 (Combined): Applies both property filters and diversity filters simultaneously, providing the most stringent evaluation of practical performance.

Benchmark Setup and Generative Models Evaluated

Implementation Details

The authors re-implement the PMO benchmark using the original code and data (MIT license) with no changes beyond adding AHC and the new metrics. For Augmented Hill-Climb, the architecture follows REINVENT: an embedding layer of size 128 and 3 layers of Gated Recurrent Units (GRU) with size 512. The prior is trained on ZINC250k using SMILES notation with batch size 128 for 5 epochs.

Two AHC variants are benchmarked:

SMILES-AHC: Hyperparameters optimized via the standard PMO procedure, yielding batch size 256, $\sigma = 120$, $K = 0.25$
SMILES-AHC*: Uses $\sigma = 60$, chosen based on prior knowledge that lower $\sigma$ values maintain better regularization and chemical quality

Both omit diversity filters and non-unique penalization for standardized comparison, despite these being shown to improve performance in prior work.

Models Compared

The benchmark includes 25 generative models from the original PMO paper spanning diverse architectures: REINVENT (RNN + RL), Graph GA (graph-based genetic algorithm), GP BO (Gaussian process Bayesian optimization), SMILES GA (SMILES-based genetic algorithm), SELFIES-based VAEs, and others. The 23 objective tasks derive primarily from the GuacaMol benchmark.

Re-ranked Results and Augmented Hill-Climb Performance

The modified metrics substantially re-order the ranking of generative models:

SMILES-AHC achieves top performance on AUC Top-10 (Combined)*, where both property filters and diversity are enforced. The use of domain-informed hyperparameter selection ($\sigma = 60$) proves critical.
SMILES-AHC (data-driven hyperparameters) ranks first when accounting for property filters alone, diversity alone, or both combined, demonstrating that the AHC algorithm itself provides strong performance even without manual tuning.
REINVENT retains its first-place rank under property filters alone, suggesting that the minority of compounds staying within acceptable property space still perform well. However, it drops when diversity is also required.
Evolutionary algorithms (Graph GA, GP BO, SMILES GA) drop significantly under the new metrics. This is expected because rule-based methods are not constrained by the ZINC250k distribution and tend to propose molecules that diverge from drug-like chemical space.
Both AHC variants excel on empirically difficult tasks, including isomer-based tasks, Zaleplon MPO, and Sitagliptin MPO, where other methods struggle.

Limitations

The authors acknowledge several limitations:

Results are preliminary because generative models have not undergone hyperparameter optimization against the new metrics
Property filter thresholds are subjective, and the 10% de novo ECFP4 bit threshold was chosen by visual inspection
Comparing rule-based models against distribution-based models using ZINC250k similarity introduces a bias toward distribution-based approaches
Six objective task reference molecules sit in the lowest 0.01% of ZINC250k property space, raising questions about whether distribution-based models can reasonably optimize for these objectives
Property filters and diversity could alternatively be incorporated directly into the objective function as additional oracles, though this would not necessarily produce the same results

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pre-training	ZINC250k	~250K molecules	Subset of ZINC15, provided by PMO benchmark
Evaluation	PMO benchmark tasks	23 objectives	Derived primarily from GuacaMol

Algorithms

Augmented Hill-Climb: RL strategy from Thomas et al. (2022), patience of 5
Hyperparameters (SMILES-AHC): batch size 256, $\sigma = 120$, $K = 0.25$
Hyperparameters (SMILES-AHC)*: $\sigma = 60$ (domain-informed selection)
Prior training: 5 epochs, batch size 128, SMILES notation
Oracle budget: 10,000 evaluations per task
Replicates: 5 per model per task

Models

Architecture: Embedding (128) + 3x GRU (512), following REINVENT
All 25 PMO benchmark models re-evaluated using original implementations

Evaluation

Metric	Description	Notes
AUC Top-10 (Original)	Area under curve of average top 10 molecules	Standard PMO metric
AUC Top-10 (Filtered)	Original with MW/LogP and ECFP4 novelty filters	$\mu \pm 4\sigma$ from ZINC250k
AUC Top-10 (Diverse)	Top 10 selected with Tanimoto < 0.35 diversity	ECFP4 fingerprints
AUC Top-10 (Combined)	Both filters and diversity applied	Most stringent metric

Hardware

Hardware requirements are not specified in the paper. The benchmark uses 10,000 oracle evaluations per task with 5 replicates, which is computationally modest compared to standard generative model training.

Artifacts

Artifact	Type	License	Notes
MolScore	Code	MIT	Scoring and benchmarking framework by the first author
PMO Benchmark	Code	MIT	Original benchmark code and data

Paper Information

Citation: Thomas, M., O’Boyle, N. M., Bender, A., & de Graaf, C. (2022). Re-evaluating sample efficiency in de novo molecule generation. arXiv preprint arXiv:2212.01385.

@misc{thomas2022reevaluating,
  title={Re-evaluating sample efficiency in de novo molecule generation},
  author={Thomas, Morgan and O'Boyle, Noel M. and Bender, Andreas and de Graaf, Chris},
  year={2022},
  eprint={2212.01385},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  doi={10.48550/arxiv.2212.01385}
}

Fine-Tuning GPT-3 for Molecular Property Prediction

Thu, 26 Mar 2026 00:00:00 +0000

GPT-3 as a Molecular Property Classifier

This is an Empirical paper that evaluates the effectiveness of fine-tuning OpenAI’s GPT-3 language model (specifically the “ada” base model) for predicting electronic and functional properties of organic molecules. Rather than proposing a new architecture, the work systematically tests whether a general-purpose LLM can learn chemically meaningful patterns from SMILES strings when fine-tuned on classification tasks. The primary contribution is the empirical characterization of GPT-3’s performance, robustness, and limitations for molecular property prediction, including extensive ablation studies.

Why Fine-Tune a General-Purpose LLM for Chemistry?

Machine learning for molecular property prediction typically relies on specialized representations: molecular graphs processed by graph neural networks (GNNs), engineered molecular descriptors, or domain-specific chemical language models trained from scratch on SMILES or SELFIES. These approaches require varying levels of domain expertise to design the inputs and architecture.

GPT-3, pre-trained on vast amounts of general text, already has an internal representation of language structure. SMILES notation, as a text-based molecular representation, can be treated as a “language” with its own syntax. The authors hypothesize that GPT-3’s language understanding capabilities, combined with the human-readable nature of SMILES, may enable the model to recognize significant patterns within chemical structures and capture structure-property dependencies. The key question is whether fine-tuning alone is sufficient, or whether specialized architectures provide fundamental advantages.

Prior work by Jablonka et al. showed that fine-tuned GPT-3 could perform surprisingly well on low-data chemistry tasks, sometimes surpassing dedicated models. This paper extends that investigation with a focus on electronic properties (HOMO and LUMO energies) of organic semiconductors, with deeper analysis of robustness and failure modes.

SMILES-to-Classification via Prompt-Completion Fine-Tuning

The core approach is straightforward. Each training example is a prompt-completion pair in JSONL format:

{"prompt": "SMILES_string", "completion": "class_label"}

The SMILES string serves as the prompt, and the fine-tuned model learns to complete it with a class label (0/1 for binary, 0/1/2 for ternary, 0/1/2/3 for quaternary classification). Class thresholds are determined by equally segmenting the property value range. The authors use GPT-3’s default tokenizer, which breaks SMILES strings into subword tokens that do not correspond to chemically meaningful units (e.g., “c1ccccc1” for benzene gets tokenized into arbitrary fragments).

This design choice has important implications. The model must learn chemical semantics from token patterns that are not aligned with atoms or bonds. The authors note this as a limitation and hypothesize that a chemistry-aware tokenizer could improve performance.

Experimental Setup and Baseline Comparisons

Datasets

The primary dataset is a collection of 48,182 organic semiconductor (OSC) molecules extracted from the Cambridge Structural Database (CSD). Each molecule has a SMILES representation and quantum-chemically computed electronic properties (HOMO and LUMO energies). A secondary dataset of 572 aromatic molecular photocatalysts (AMPs) with experimentally measured hydrogen evolution rates (HER) provides an additional test case.

Baselines

Three baselines are compared:

Directed message-passing neural network (D-MPNN) via Chemprop, using default molecular graph representations
RDKit molecular descriptors + SVM, using the top 20 descriptors selected by SelectKBest
Prior ML results from the original AMP dataset paper (using engineered domain-specific features)

Main Results

Dataset	Task	Classes	GPT-3 Accuracy	GNN Accuracy	Descriptors Accuracy
OSCs (48,182)	HOMO	3	0.92	0.94	0.87
OSCs (48,182)	HOMO	4	0.68	0.75	0.47
OSCs (48,182)	HOMO	5	0.60	0.68	0.40
OSCs (48,182)	LUMO	3	0.94	0.94	0.91
AMPs (572)	HER	2	0.88	0.86	0.87

For ternary classification, GPT-3 performs on par with GNNs (0.92 vs. 0.94 for HOMO; 0.94 vs. 0.94 for LUMO). Performance degrades more steeply than GNNs as the number of classes increases: at 5-class HOMO, GPT-3 achieves only 0.60 vs. GNN’s 0.68. On the small AMP dataset (572 molecules), GPT-3 slightly outperforms the GNN (0.88 vs. 0.86).

Learning Curves

The data efficiency analysis reveals that GPT-3 needs at least 20% of the OSC dataset (approximately 9,600 molecules) to reach accuracy above 0.9. Below 1,000 training points, accuracy drops below 0.6. GNNs outperform GPT-3 in this low-data regime, which the authors attribute to (1) the molecular graph being chemically more expressive than SMILES for these tasks, and (2) fine-tuning requiring sufficient data to capture relevant SMILES patterns.

Ablation Study 1: Single-Atom Removal

The authors tested robustness by removing individual non-hydrogen, non-carbon atoms from SMILES strings and replacing them with a token. Out of 45,763 ablation tests on 7,714 correctly predicted molecules, 95.2% retained the same classification. This suggests the model captures redundant structural information rather than relying on any single atom.

Ablation Study 2: Single-Group Removal

Fifteen chemical groups (nitrile, nitro, enamine, ketone, etc.) were individually ablated. The fine-tuned model attributed the most importance to acetylene (81% agreement for HOMO), enamine (85%), nitro (86%), and ketone (87%) groups, as these altered HOMO predictions in more than 10% of tests. Interestingly, groups that participate in electronic pi-conjugation tended to be more “important” to the model’s HOMO predictions.

When ablated atoms were replaced with random elements instead of the token, the model failed in 80% of cases for a representative molecule. This suggests the model may “fill in” the missing information when seeing the token but gets confused by incorrect atomic identities.

Predicting Unknown Molecular Families

The authors held out entire families of polycyclic aromatic hydrocarbons (naphthalene, anthracene, tetracene, pyrene, perylene), quinones, and imides during training, then tested predictions on these unseen families. Results for the first five PAH families:

Fragment Family	Molecules	GPT-3 HOMO	GNN HOMO	GPT-3 LUMO	GNN LUMO
Naphthalene	475	0.94	0.95	0.88	0.91
Anthracene	577	0.99	1.00	0.93	0.97
Tetracene	72	0.96	1.00	0.90	0.99
Pyrene	237	0.98	1.00	0.97	0.99
Perylene	41	0.98	1.00	0.98	0.95

GPT-3 generalizes well to unknown PAH families, though GNNs have a slight edge on HOMO prediction. Performance degrades somewhat for quinones and imides.

Canonical vs. Non-Canonical SMILES

A model fine-tuned only on canonical SMILES performed poorly on non-canonical variants: only 1,622 of 8,578 molecules achieved consistent predictions across all 11 SMILES variants (1 canonical + 10 non-canonical). Augmenting the training data with 5 non-canonical SMILES per molecule dramatically improved consistency to 7,243 of 8,578 molecules and nearly eliminated erroneous (non-class-label) responses. This finding highlights that GPT-3’s pattern matching is highly sensitive to surface-level string representation and benefits substantially from SMILES enumeration data augmentation.

Key Findings and Limitations

The main findings are:

Fine-tuned GPT-3 (ada) achieves competitive accuracy with GNNs for coarse-grained (ternary) HOMO/LUMO classification, but performance drops more steeply with finer granularity.
The model shows robustness to single-atom and single-group ablation, suggesting it captures chemically redundant patterns.
Generalization to held-out molecular families is strong, though GNNs maintain a slight advantage.
SMILES augmentation with non-canonical variants is essential for consistent predictions.

The authors acknowledge several limitations:

Black-box nature: GPT-3 provides no physical insight or interpretability, unlike GNN models where molecular graph features can be augmented with domain knowledge.
Tokenization: The generic tokenizer does not respect chemical structure. A chemistry-aware tokenizer could improve data efficiency and accuracy.
SELFIES underperformance: Initial tests with SELFIES did not improve over SMILES, likely because generic tokenization stripped away the extra chemical information SELFIES encodes.
Cost: Fine-tuning via OpenAI’s API cost approximately $500 for the experiments, and the model is closed-source, preventing systematic interpretation of learned representations.
Classification only: The approach performs coarse-grained classification rather than regression, limiting utility for applications requiring precise numerical predictions.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Training/Evaluation	OSC molecules from CSD	48,182	SMILES + DFT-computed HOMO/LUMO energies
Training/Evaluation	Aromatic molecular photocatalysts (AMPs)	572	Experimental hydrogen evolution rates

Algorithms

Fine-tuning uses OpenAI’s GPT-3 “ada” base model via the API
Prompt-completion pairs in JSONL format
Default GPT-3 tokenizer
80/20 train/test split for OSC; stratified 10-fold CV for AMPs
Non-canonical SMILES generated using RDKit (10 per molecule for testing, 5 per molecule for augmented training)

Models

GPT-3 “ada” (fine-tuned, closed-source, accessed via OpenAI API)
Chemprop D-MPNN baseline (open-source)
RDKit descriptors + scikit-learn SVM baseline

Evaluation

Metric	Best GPT-3 Value	Best GNN Value	Task
Accuracy	0.92	0.94	3-class HOMO (OSCs)
Accuracy	0.94	0.94	3-class LUMO (OSCs)
Accuracy	0.88	0.86	2-class HER (AMPs)

Hardware

The paper does not specify local hardware requirements. All GPT-3 fine-tuning was conducted via OpenAI’s cloud API at a total cost of approximately $500.

Artifacts

Artifact	Type	License	Notes
Chem-GPT-Finetune	Code	Not specified	Python code and datasets for fine-tuning and evaluation

Paper Information

Citation: Xie, Z., Evangelopoulos, X., Omar, O. H., Troisi, A., Cooper, A. I., & Chen, L. (2024). Fine-tuning GPT-3 for machine learning electronic and functional properties of organic molecules. Chemical Science, 15(2), 500-510.

@article{xie2024finetuning,
  title={Fine-tuning {GPT-3} for machine learning electronic and functional properties of organic molecules},
  author={Xie, Zikai and Evangelopoulos, Xenophon and Omar, {\"O}mer H. and Troisi, Alessandro and Cooper, Andrew I. and Chen, Linjiang},
  journal={Chemical Science},
  volume={15},
  number={2},
  pages={500--510},
  year={2024},
  publisher={Royal Society of Chemistry},
  doi={10.1039/D3SC04610A}
}

Benchmarking LLMs for Molecular Property Prediction

Wed, 25 Mar 2026 00:00:00 +0000

Empirical Benchmarking of LLMs on Molecular Tasks

This is an Empirical paper that systematically evaluates whether large language models (LLMs) can handle molecular property prediction tasks. The primary contribution is a structured benchmarking framework that compares LLMs (GPT-3.5, GPT-4, Llama-2-7b, Llama-2-13b) against conventional ML models (DeBERTa, GCN, GIN) across six standard molecular benchmark datasets from OGB. The study also introduces a collaborative framework where LLM-generated responses augment ML model features.

Why Benchmark LLMs on Molecular Property Prediction

LLMs have demonstrated strong capabilities across many NLP tasks, but their effectiveness on structured scientific data, particularly molecular graphs, remains unclear. Prior work has explored LLMs for chemistry tasks such as reaction prediction, name-to-SMILES translation, and molecule description. However, a systematic evaluation of LLMs on standard molecular property prediction benchmarks (classification and regression) with controlled prompt engineering has been lacking.

The key questions motivating this work:

Can LLMs effectively predict molecular properties when given SMILES strings and textual descriptions of molecular structure?
Does encoding geometric structure information as text help LLMs understand molecules?
Can LLM responses serve as useful augmentations for traditional ML models?

Prompt Engineering for Molecular Prediction

The core methodological contribution is a systematic prompt engineering framework for querying LLMs on molecule tasks. Given a molecule $\mathcal{G} = (S, G, D)$ where $S$ is the SMILES string, $G$ is the geometric structure, and $D$ is a generated text description of atom features and graph structure, the authors design several prompt templates:

Zero-shot prompts (three variants):

Input-Feature (IF): Asks for general insights about a molecule given its SMILES and description
Input-Prediction (IP): Asks for a direct prediction in a specified format
Input-Explanation (IE): Asks for both a prediction and an explanation

Each zero-shot prompt has a variant with descriptions (IFD, IPD, IED) that encodes atom features and graph structure as additional text following the approach of Fatemi et al. (2023).

Few-shot prompts (FS-k): Provide $k$ labeled examples as in-context learning demonstrations before the query. The study uses $k \in {1, 2, 3}$.

The authors also explore three predictive model pipelines:

Solo: A single model (LLM, LM, or GNN) makes predictions independently
Duo: An ML model receives both the original features and LLM-generated responses as input
Trio: A GNN receives SMILES embeddings from an LM plus LLM response embeddings alongside geometric features

The LLM prediction can be formalized as $A = f_{LLM}(Q)$ where $Q$ is the prompt and $A$ is the response. For the ML augmentation pipelines, the LM-based Duo model predicts as:

$$\hat{y} = f_{LM}(S, R)$$

where $R$ is the LLM response, and the GNN-based Trio model predicts as:

$$\hat{y} = f_{GNN}(G, X)$$

where $X$ includes features derived from both SMILES embeddings and LLM response embeddings.

Experimental Setup Across Six OGB Benchmarks

Datasets

The study uses six molecular property prediction datasets from OGB and MoleculeNet:

Dataset	Molecules	Avg. Nodes	Avg. Edges	Task Type
ogbg-molbace	1,513	34.1	73.7	Binary classification (BACE-1 inhibition)
ogbg-molbbbp	2,039	24.1	51.9	Binary classification (BBB penetration)
ogbg-molhiv	41,127	25.5	27.5	Binary classification (HIV inhibition)
ogbg-molesol	1,128	13.3	27.4	Regression (water solubility)
ogbg-molfreesolv	642	8.7	16.8	Regression (hydration free energy)
ogbg-mollipo	4,200	27.0	59.0	Regression (lipophilicity)

Classification tasks are evaluated by ROC-AUC (higher is better) and regression tasks by RMSE (lower is better).

Models Compared

LLMs: GPT-3.5 (primary), GPT-4, Llama-2-7b, Llama-2-13b, all used as black-box APIs with fixed parameters
Language Model: DeBERTa, fine-tuned on SMILES strings
GNNs: GCN and GIN, trained on geometric molecular structure

Key Results: LLMs Alone vs. ML Models

The paper presents five main observations:

Observation 1: GPT models outperform Llama models on molecule tasks. On the ogbg-molhiv dataset, GPT-3.5 and GPT-4 consistently outperform Llama-2-7b and Llama-2-13b across all prompt variants. GPT-4 offers marginal improvement over GPT-3.5 at 20x the cost and 10x the latency, so GPT-3.5 is used as the default LLM.

Observation 2: LLMs lag behind ML models across all datasets. Across all six datasets, LLM-based approaches underperform compared to DeBERTa, GCN, and GIN. For example, on ogbg-molhiv, the best LLM achieves 0.5892 ROC-AUC (IP prompt) compared to GIN’s 0.7601. On regression tasks, the gap is even larger: GIN achieves 0.9555 RMSE on ogbg-molesol versus the best LLM’s 1.9963.

Observation 3: Text descriptions of molecular geometry do not help LLMs. Adding structural descriptions (the “D” variants of prompts) generally degrades LLM performance and reduces response consistency. The additional tokens from structure descriptions appear to introduce noise rather than useful geometric information.

Observation 4: Geometric structure is critical for molecular prediction. GNN models that directly process molecular graphs substantially outperform both LLMs and text-based language models, confirming that geometric information is essential for accurate property prediction.

Observation 5: LLMs can augment ML models effectively. When LLM responses are used as additional features for GNN models (Duo and Trio pipelines), several configurations show improvements. For example, on ogbg-molbace, GCN with FS-2 augmentation achieves 0.7903 test ROC-AUC versus baseline GCN’s 0.7147. GIN with SMILES features (Duo pipeline) achieves 0.7837 on ogbg-molhiv versus the baseline GIN’s 0.7601.

Response Consistency

The study also measures response consistency, defined as the fraction of LLM responses conforming to the required output format. Adding descriptions to prompts reduces consistency, and few-shot prompts generally improve consistency over zero-shot variants.

Findings, Limitations, and Future Directions

Key Findings

LLMs are not competitive with specialized ML models for molecular property prediction when used directly, with GNNs maintaining clear advantages across all six benchmark datasets.
Converting molecular geometric structure to text descriptions is insufficient for conveying structural information to LLMs, as evidenced by degraded performance and reduced response consistency with description-augmented prompts.
LLMs show the most promise as augmenters of existing ML models rather than as standalone predictors, with the Duo and Trio pipelines yielding improvements over Solo baselines in many configurations.
Among LLMs, GPT-3.5 offers the best cost-performance tradeoff for molecule tasks.

Limitations

The study is limited to black-box API access with fixed LLM parameters. Fine-tuning or parameter-efficient adaptation (e.g., LoRA) was not explored due to computational constraints and API limitations.
Advanced prompting techniques (Chain-of-Thought, Tree-of-Thought, Graph-of-Thought, RAG) were tested in preliminary experiments but performed worse, which the authors attribute to the difficulty of designing proper reasoning chains for molecular property prediction.
Only six datasets from OGB/MoleculeNet are evaluated. Other molecular tasks (e.g., reaction prediction, retrosynthesis) are not covered.
The evaluation uses a single random seed for LLM queries, and the stochastic nature of LLM outputs means results may vary across runs.

Future Directions

The authors identify three promising avenues: (1) developing methods to better incorporate molecular geometric structure into LLM inputs, (2) designing more sophisticated frameworks for integrating LLMs with traditional ML models, and (3) training domain-specialized chemistry LLMs that can reduce hallucinations in chemical reasoning.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Evaluation	ogbg-molbace	1,513 molecules	Binary classification, BACE-1 inhibition
Evaluation	ogbg-molbbbp	2,039 molecules	Binary classification, BBB penetration
Evaluation	ogbg-molhiv	41,127 molecules	Binary classification, HIV inhibition
Evaluation	ogbg-molesol	1,128 molecules	Regression, water solubility
Evaluation	ogbg-molfreesolv	642 molecules	Regression, hydration free energy
Evaluation	ogbg-mollipo	4,200 molecules	Regression, lipophilicity

All datasets use standard OGB scaffold splits.

Algorithms

Zero-shot prompts: IF, IP, IE (and description-augmented variants IFD, IPD, IED)
Few-shot prompts: FS-1, FS-2, FS-3
Solo/Duo/Trio integration pipelines for combining LLM outputs with ML models
DeBERTa fine-tuned on SMILES strings
GCN and GIN with OGB benchmark implementations

Models

GPT-3.5 and GPT-4 via OpenAI API with default hyperparameters
Llama-2-7b and Llama-2-13b via HuggingFace
DeBERTa (DeBERTaV3)
GCN and GIN following OGB leaderboard implementations

Evaluation

Metric	Task	Notes
ROC-AUC	Classification (molbace, molbbbp, molhiv)	Higher is better
RMSE	Regression (molesol, molfreesolv, mollipo)	Lower is better
Response consistency	All tasks	Fraction of format-conforming LLM outputs

Hardware

Hardware details are not specified in the paper. LLM experiments use API calls (OpenAI) and HuggingFace inference. GNN and DeBERTa training uses standard implementations from OGB benchmark leaderboards.

Artifacts

Artifact	Type	License	Notes
LLMaMol	Code	Not specified	Official implementation with prompt templates and evaluation pipeline

Paper Information

Citation: Zhong, Z., Zhou, K., & Mottin, D. (2024). Benchmarking Large Language Models for Molecule Prediction Tasks. arXiv preprint arXiv:2403.05075.

@article{zhong2024benchmarking,
  title={Benchmarking Large Language Models for Molecule Prediction Tasks},
  author={Zhong, Zhiqiang and Zhou, Kuangyu and Mottin, Davide},
  journal={arXiv preprint arXiv:2403.05075},
  year={2024},
  doi={10.48550/arxiv.2403.05075}
}

Benchmarking Chemistry Knowledge in Code-Gen LLMs

Wed, 25 Mar 2026 00:00:00 +0000

Paper Information

Citation: White, A. D., Hocky, G. M., Gandhi, H. A., Ansari, M., Cox, S., Wellawatte, G. P., Sasmal, S., Yang, Z., Liu, K., Singh, Y., & Peña Ccoa, W. J. (2023). Assessment of chemistry knowledge in large language models that generate code. Digital Discovery, 2(2), 368-376. https://doi.org/10.1039/d2dd00087c

Publication: Digital Discovery 2023

Additional Resources:

Benchmarking Chemistry Knowledge in Code-Generating LLMs

This is an Empirical paper that evaluates code-generating large language models on chemistry tasks. The primary contribution is a categorized benchmark of 84 chemistry problems across 10 topics, along with a systematic evaluation of several LLMs (Codex cushman, Codex davinci, text-davinci-003, InCoder, CodeGen) on these tasks. The paper also provides practical guidance on prompt engineering strategies that improve accuracy.

Why Evaluate LLMs on Chemistry Coding Tasks

As of late 2022, LLMs trained on code (such as Codex and InCoder) had become widely available through tools like GitHub Copilot and Tabnine. An open question was whether these general-purpose code models contained sufficient domain knowledge to solve chemistry problems expressed as coding tasks. Chemistry has specialized language, equations, and conventions (e.g., SMILES notation, thermodynamic relationships, molecular simulation methods) that may not be well-represented in general code training data. Prior work had shown that knowledge of the periodic table requires very high parameter counts, but the broader extent of chemistry knowledge in code LLMs was unexplored.

The authors sought to answer a specific question: do code-generating LLMs “know” chemistry? This means evaluating whether LLMs can correlate natural language descriptions of chemistry problems with correct code implementations, including proper equations, units, and use of domain-specific libraries.

Benchmark Design and Prompt Engineering Strategies

The benchmark covers 10 topic categories:

Topic	Abbreviation	N	Expert-only
Biochemistry	bio	13	2
Cheminformatics	cheminf	10	0
General chemistry	genchem	11	0
Molecular dynamics	md	11	3
Plotting	plot	10	10
Quantum mechanics	qm	8	3
Simulation methods	sim	8	5
Spectroscopy	spect	11	1
Statistics	stats	11	1
Thermodynamics	thermo	10	0

Each task is formatted as a Python function with a docstring describing the expected behavior. The LLM must generate a completion that passes automated unit tests. Of the 84 total prompts, 25 require expert evaluation (e.g., plotting tasks) where automated testing is insufficient.

The key prompt engineering insight is the use of “contexts,” which are code prepended before prompts. The authors tested several context strategies:

Custom context: Topic-specific imports (e.g., RDKit for cheminformatics) plus a one-line completion example to teach the model how to signal the end of output.
Insert context: Uses model infilling capabilities instead of completion-based generation. Available for davinci and InCoder.
Copyright context: Adding a copyright notice at the top of the file, which conditions the model toward higher-quality code patterns.
Authority context: Adding “This is written by an expert Python programmer.”

The copyright notice improved accuracy at higher temperatures. The intuition is that copyrighted code in training data tends to be higher-quality, so the notice acts similarly to lowering temperature. The best model/temperature combination (davinci at T=0.05) was already operating at effectively low temperature, so the copyright trick did not further improve it.

Experimental Setup: Models, Sampling, and Expert Evaluation

Models evaluated

The study compared five models, all decoder-only architectures:

Model	Abbreviation	Parameters	Source
code-cushman-001	cushman	12B	OpenAI (GPT-3 fine-tuned on code)
code-davinci-002	davinci	~175B (estimated)	OpenAI (GPT-3.5 class)
text-davinci-003	davinci3	~175B (estimated)	OpenAI (RLHF-adapted from davinci)
InCoder	incoder	6B	Fried et al. 2022
CodeGen	codegen	16B	Nijkamp et al. 2022

Sampling and evaluation

Completions were generated using top-k sampling (k=5) at three temperatures: T=0.05, 0.2, and 0.5. For InCoder-6B, GPU memory limited sampling to k=1. Error bars in all reported results are 95% confidence intervals from bootstrap resampling across top-k samples.

Accuracy was defined following the HumanEval approach: a completion is correct if the code runs and passes unit tests, regardless of whether it matches a reference implementation.

Expert evaluation

Nine co-authors (postdoctoral scholars and Ph.D. students) performed 650 evaluations of davinci completions through a web interface. Each completion was scored on a 5-point scale: Perfect (5), Correct but not perfect (4), Runs and is almost correct (3), Does not run but is almost correct (2), Far from correct (1). Expert-evaluated accuracy counted only “Perfect” and “Correct but not perfect” as correct.

Key results by topic and model

Topic	incoder	codegen	davinci	davinci3
bio	0%	29%	43%	86%
cheminf	20%	20%	50%	50%
genchem	29%	86%	86%	86%
md	0%	13%	63%	88%
qm	20%	60%	100%	100%
sim	0%	0%	100%	100%
spect	30%	20%	50%	40%
stats	40%	80%	70%	60%
thermo	10%	10%	80%	70%
total	17%	35%	72%	75%

All accuracies reported use the best context for each model (copyright for incoder-6B, authority for codegen-16B, insert for davinci) at T=0.2.

Findings: LLMs Know Chemistry, With Caveats

The central finding is that code-generating LLMs do contain substantial chemistry knowledge. The best model (davinci) achieved 72% overall accuracy, with prompt engineering contributing approximately 30 percentage points to this figure. The text-davinci-003 model, which was fine-tuned with RLHF, achieved 75% and showed reduced sensitivity to prompt engineering, suggesting that human feedback alignment partially subsumes the benefits of manual prompt design.

Strengths and successful domains

Quantum mechanics and simulation: davinci achieved 100% on both categories, indicating strong knowledge of computational chemistry equations and simulation patterns.
General chemistry: All models except InCoder performed well (86%), suggesting that general chemistry concepts are well-represented in code training data.
Molecular structure generation: InstructGPT showed some ability to connect natural language descriptions with SMILES strings, generating valid (though not exact) molecular structures from prompts like “a phenol derivative.”

Limitations and failure modes

Lack of reasoning: The authors emphasize that LLMs demonstrate knowledge correlation, not reasoning. Davinci frequently uses “relativistic Hartree-Fock” for any prompt requesting a “highly accurate” quantum calculation, because it has memorized the association between “relativistic” and “accurate” rather than understanding the underlying chemistry.
Hallucinated functions: When given difficult prompts (e.g., “return the residual dipolar couplings given a SMILES string”), the model invents non-existent functions like MolToRDC.
API version mismatches: Many errors in the molecular dynamics category stem from the model using outdated function signatures for packages like MDTraj, likely reflecting the training data cutoff.
Expert-evaluated accuracy is lower: On topics requiring expert evaluation (generally harder tasks), accuracy drops, and it correlates negatively with perceived difficulty.

Practical recommendations

The paper offers several practical tips for using code LLMs in chemistry:

Use correctly spelled, precise prompts. If a function should “return” a value, use the word “return” rather than “compute.”
Be explicit about what variables represent (e.g., specify that k is a spring constant, not Boltzmann’s constant).
Import only the packages you intend to use, as the model will attempt to use all imported libraries.
Adding a copyright notice or “expert programmer” statement can improve accuracy, though RLHF-trained models are less sensitive to this.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Evaluation	nlcc-data benchmark	84 prompts across 10 chemistry topics	Open source, community-extensible
Expert evaluation	Human evaluations CSV	650 evaluations	Available in Supporting Information

Algorithms

Evaluation uses automated unit testing for 59 of 84 prompts. Expert evaluation covers the remaining 25 prompts through a web-based scoring interface. Five completions per prompt were generated via top-k sampling at three temperatures.

Models

All models evaluated are external (OpenAI API for Codex/davinci, HuggingFace for InCoder/CodeGen). No new models were trained. Python version and packages were pinned to June 2021 to avoid library changes influencing results.

Evaluation

Accuracy is binary: a completion passes all unit tests (1.0) or fails (0.0), averaged across top-k samples and temperatures. Expert evaluation uses a 5-point scale collapsed to binary (Perfect or Correct = 1.0).

Hardware

GPU memory limitations are mentioned for InCoder-6B (limiting k=1 instead of k=5). No other hardware details are specified.

Artifacts

Artifact	Type	License	Notes
nlcc-data benchmark	Dataset	Unknown	Open-source benchmark prompts and solutions
Evaluation website	Other	Unknown	Web interface showing completions
Zenodo evaluation data	Dataset	Unknown	Expert evaluation completions in HTML
Paper (open access)	Other	CC-BY-NC	Published article

Citation

@article{white2023assessment,
  title={Assessment of chemistry knowledge in large language models that generate code},
  author={White, Andrew D. and Hocky, Glen M. and Gandhi, Heta A. and Ansari, Mehrad and Cox, Sam and Wellawatte, Geemi P. and Sasmal, Subarna and Yang, Ziyue and Liu, Kangxin and Singh, Yuvraj and Peña Ccoa, Willmor J.},
  journal={Digital Discovery},
  volume={2},
  number={2},
  pages={368--376},
  year={2023},
  publisher={Royal Society of Chemistry},
  doi={10.1039/d2dd00087c}
}

AMORE: Testing ChemLLM Robustness to SMILES Variants

Wed, 25 Mar 2026 00:00:00 +0000

An Empirical Framework for Probing Chemical Understanding

This is an Empirical paper that introduces Augmented Molecular Retrieval (AMORE), a zero-shot evaluation framework for chemical language models (ChemLMs). The primary contribution is a method to assess whether ChemLMs have learned genuine molecular semantics or simply memorize textual patterns. Rather than relying on traditional NLP metrics like BLEU and ROUGE, AMORE tests whether a model’s embedding space treats chemically equivalent SMILES representations as similar. The authors evaluate 12 models across multiple architectures (encoder-only, encoder-decoder, decoder-only) on two datasets and five augmentation types, and extend the analysis to downstream MoleculeNet tasks.

Why Standard NLP Metrics Fail for Chemical Evaluation

Chemical language models are typically evaluated using text-based metrics from NLP (BLEU, ROUGE, METEOR) on tasks like molecule captioning. These metrics compare word overlap and sentence fluency but cannot detect whether a model truly understands molecular structure. A SMILES string like C(=O)O and its canonicalized or kekulized form represent the same molecule, yet text-based metrics would penalize valid reformulations. Embedding-based metrics like BERTScore are also insufficient because they were trained on general text, not chemical notation.

The core research question is direct: do evaluation metrics used on ChemLMs reflect actual chemical knowledge, or do the models simply imitate understanding by learning textual features? This question has practical consequences in pharmaceuticals and healthcare, where missteps in chemical reasoning carry serious risks.

Embedding-Based Retrieval as a Chemical Litmus Test

AMORE exploits a fundamental property of molecular representations: a single molecule can be written as multiple valid SMILES strings that are chemically identical. These serve as “total synonyms,” a concept without a true analogue in natural language.

The framework works in four steps:

Take a set $X = (x_1, x_2, \ldots, x_n)$ of $n$ molecular representations.
Apply a transformation $f$ to obtain augmented representations $X’ = (x’_1, x’_2, \ldots, x’_n)$, where $x’_i = f(x_i)$. The constraint is that $f$ must not change the underlying molecule.
Obtain vectorized embeddings $e(x_i)$ and $e(x’_j)$ from the model for each original and augmented SMILES.
Evaluate in a retrieval task: given $e(x_i)$, retrieve $e(x’_i)$ from the augmented set.

The evaluation metrics are top-$k$ accuracy (whether the correct augmented SMILES ranks at position $\leq k$) and Mean Reciprocal Rank (MRR). Retrieval uses FAISS for efficient nearest-neighbor search. The key insight is that if a model truly understands molecular structure, it should embed different SMILES representations of the same molecule close together.

Five SMILES Augmentation Types

The framework uses five identity-preserving augmentations, all executed through RDKit:

Canonicalization: Transform SMILES to the standardized RDKit canonical form.
Hydrogen addition: Explicitly add hydrogen atoms that are normally implied (e.g., C becomes [CH4]). This dramatically increases string length.
Kekulization: Convert aromatic ring notation to explicit alternating double bonds.
Cycle renumbering: Replace ring-closure digit identifiers with random valid alternatives.
Random atom order: Randomize the atom traversal order used to generate the SMILES string.

Twelve Models, Two Datasets, Five Augmentations

Models Evaluated

The authors test 12 publicly available Transformer-based models spanning three architecture families:

Model	Domain	Parameters
Text+Chem T5-standard	Cross-modal	220M
Text+Chem T5-augm	Cross-modal	220M
MolT5-base	Cross-modal	220M
MolT5-large	Cross-modal	770M
SciFive	Text-only	220M
PubChemDeBERTa	Chemical	86M
ChemBERT-ChEMBL	Chemical	6M
ChemBERTa	Chemical	125M
BARTSmiles	Chemical	400M
ZINC-RoBERTa	Chemical	102M
nach0	Chemical	220M
ZINC-GPT	Chemical	87M

Datasets

ChEBI-20 test set: ~3,300 molecule-description pairs, used for both AMORE retrieval and molecule captioning comparisons.
Isomers (QM9 subset): 918 molecules that are all isomers of C9H12N2O, making retrieval harder because all molecules share the same molecular formula.

Key Results on ChEBI-20

On the ChEBI-20 dataset (Table 2 from the paper), top-1 accuracy varies enormously by augmentation type. Cycle renumbering is easiest (up to 98.48% Acc@1 for SciFive), while hydrogen addition is hardest (no model exceeds 5.97% Acc@1).

For the cross-modal Text+Chem T5-standard model:

Augmentation	Acc@1	Acc@5	MRR
Canonical	63.03	82.76	72.4
Hydrogen	5.46	10.85	8.6
Kekulization	76.76	92.03	83.8
Cycle	96.70	99.82	98.2
Random	46.94	74.18	59.33

Key Results on Isomers

Performance drops substantially on the Isomers dataset, where all molecules share the same formula. The best Acc@1 for hydrogen augmentation is just 1.53% (MolT5-large). Even for the relatively easy cycle augmentation, top scores drop from the high 90s to the low 90s for most models, and some models (BARTSmiles: 41.83%) struggle considerably.

Downstream MoleculeNet Impact

The authors also fine-tuned models on original MoleculeNet training data and tested on augmented test sets across 9 tasks (regression, binary classification, multilabel classification). Results confirm that augmentations degrade downstream performance. For example, on ESOL regression, RMSE increased from 0.87 to 7.93 with hydrogen addition. Rankings computed using the Vote’n’Rank framework (using the Copeland rule) show that hydrogen augmentation is the only one that substantially reshuffles model rankings; other augmentations preserve the original ordering.

Correlation Between AMORE and Captioning Metrics

The differences in ROUGE/METEOR between original and augmented SMILES correlate with AMORE retrieval accuracy (Spearman correlation > 0.7 with p-value = 0.003 for Acc@1). This validates AMORE as a proxy for predicting how augmentations will affect generation quality, without requiring labeled captioning data.

Current ChemLMs Learn Syntax, Not Chemistry

The central finding is that existing ChemLMs are not robust to identity-preserving SMILES augmentations. Several specific conclusions emerge:

Hydrogen augmentation is catastrophic: All models fail (< 6% Acc@1 on ChEBI-20, < 2% on Isomers). The authors attribute this to the near-complete absence of explicit hydrogen in pretraining data, creating a distribution shift.
Cross-modal models outperform unimodal ones: Models trained on both text and SMILES (Text+Chem T5, MolT5) consistently achieve higher retrieval accuracy on four of five augmentations.
Augmentation difficulty follows a consistent order: For all models, hydrogen is hardest, followed by canonicalization, random ordering, kekulization, and cycle renumbering (easiest).
Layer-wise analysis reveals instability: Retrieval accuracy across Transformer layers is correlated across augmentation types, suggesting that representations degrade at the same layers regardless of augmentation.
Levenshtein distance partially explains difficulty: Hydrogen augmentation produces strings ~2x longer than originals (Levenshtein ratio of 1.49), but the low correlation between Levenshtein ratio and downstream metrics (ROUGE1 correlation of -0.05 for hydrogen) suggests string length alone does not explain the failure.

Limitations

The authors acknowledge several limitations. Only publicly available HuggingFace models were evaluated, excluding models like Chemformer and Molformer that lack HF checkpoints. The study focuses exclusively on SMILES sequences, not 3D molecular structures or other formats like SELFIES. The augmentation types, while representative, do not cover all possible identity transformations.

The authors suggest that AMORE could serve as a regularization tool during training, for example by using metric learning to encourage models to embed SMILES variants of the same molecule close together.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Retrieval evaluation	ChEBI-20 test set	3,300 molecules	Standard benchmark for molecule captioning
Retrieval evaluation	Isomers (QM9 subset)	918 molecules	All isomers of C9H12N2O
Downstream evaluation	MoleculeNet (9 tasks)	Varies	ESOL, Lipophilicity, FreeSolv, HIV, BBBP, BACE, Tox21, ToxCast, SIDER

Algorithms

SMILES augmentations via RDKit (canonicalization, hydrogen addition, kekulization, cycle renumbering, random atom ordering)
Nearest-neighbor retrieval using FAISS with L2, cosine, inner product, and HNSW metrics
Model ranking via Vote’n’Rank (Copeland rule) on MoleculeNet tasks

Models

All 12 evaluated models are publicly available on HuggingFace. No custom model training was performed for the AMORE retrieval experiments. MoleculeNet experiments used standard fine-tuning on original training splits.

Evaluation

Metric	Description	Notes
Acc@1	Top-1 retrieval accuracy	Primary AMORE metric
Acc@5	Top-5 retrieval accuracy	Secondary AMORE metric
MRR	Mean Reciprocal Rank	Average rank of correct match
ROUGE-2	Bigram overlap for captioning	Compared against AMORE
METEOR	MT evaluation metric for captioning	Compared against AMORE

Hardware

Computational resources from HPC facilities at HSE University. Specific GPU types and training times are not reported.

Artifacts

Artifact	Type	License	Notes
AMORE GitHub	Code	Not specified	Framework code and evaluation data

Paper Information

Citation: Ganeeva, V., Khrabrov, K., Kadurin, A., & Tutubalina, E. (2025). Measuring Chemical LLM robustness to molecular representations: a SMILES variation-based framework. Journal of Cheminformatics, 17(1). https://doi.org/10.1186/s13321-025-01079-0

@article{ganeeva2025measuring,
  title={Measuring Chemical LLM robustness to molecular representations: a SMILES variation-based framework},
  author={Ganeeva, Veronika and Khrabrov, Kuzma and Kadurin, Artur and Tutubalina, Elena},
  journal={Journal of Cheminformatics},
  volume={17},
  number={1},
  year={2025},
  publisher={Springer},
  doi={10.1186/s13321-025-01079-0}
}

String Representations for Chemical Image Recognition

Thu, 18 Dec 2025 00:00:00 +0000

Empirical Focus and Resource Contributions

This is an Empirical Paper ($\Psi_{\text{Empirical}}$) with a secondary contribution as a Resource Paper ($\Psi_{\text{Resource}}$).

It functions as a systematic ablation study, keeping the model architecture (EfficientNet-B3 + Transformer) constant while varying the input/output representation (SMILES, DeepSMILES, SELFIES, InChI) to determine which format yields the best performance for Optical Chemical Structure Recognition (OCSR). It also contributes large-scale benchmarking datasets derived from ChEMBL and PubChem.

The Syntax Challenge in Chemical Image Recognition

Optical Chemical Structure Recognition (OCSR) is essential for extracting chemical information buried in scientific literature and patents. While deep learning offers a promising alternative to rule-based approaches, neural networks struggle with the syntax of standard chemical representations like SMILES. Specifically, the tokenization of SMILES strings (where ring closures and branches are marked by single characters potentially far apart in the sequence) creates learning difficulties for sequence-to-sequence models. Newer representations like DeepSMILES and SELFIES were developed to address these syntax issues, but their comparative performance in image-to-text tasks had not been rigorously benchmarked.

Isolating String Representation Variables

The core novelty is the comparative isolation of the string representation variable in an OCSR context. Previous approaches often selected a representation (usually SMILES) without validating if it was optimal for the learning task. This study specifically tests the hypothesis that syntax-robust representations (like SELFIES) improve deep learning performance compared to standard SMILES. It provides empirical evidence on the trade-off between validity (guaranteed by SELFIES) and accuracy (highest with SMILES).

Large-Scale Image-to-Text Translation Experiments

The authors performed a large-scale image-to-text translation experiment:

Task: Converting 2D chemical structure images into text strings.
Data:
- ChEMBL: ~1.6M molecules, split into two datasets (with and without stereochemistry).
- PubChem: ~3M molecules, split similarly, to test performance scaling with data size.
Representations: The same chemical structures were converted into four formats: SMILES, DeepSMILES, SELFIES, and InChI.
Metric: The models were evaluated on:
- Validity: Can the predicted string be decoded back to a molecule?
- Exact Match: Is the predicted string identical to the ground truth?
- Tanimoto Similarity: How chemically similar is the prediction to the ground truth (using PubChem fingerprints)? The similarity $\mathcal{T}$ between two molecular fingerprints $A$ and $B$ is calculated as: $$ \mathcal{T}(A, B) = \frac{A \cdot B}{|A|^2 + |B|^2 - A \cdot B} $$

Comparative Performance and Validity Trade-offs

SMILES is the most accurate: Contrary to the hypothesis that syntax-robust formats would learn better, SMILES consistently achieved the highest exact match accuracy (up to 88.62% on PubChem data) and average Tanimoto similarity (0.98). This is likely due to SMILES having shorter string lengths and fewer unique tokens compared to SELFIES.
SELFIES guarantees validity: While slightly less accurate in direct translation, SELFIES achieved 100% structural validity (every prediction could be decoded), whereas SMILES predictions occasionally contained syntax errors.
InChI is unsuitable: InChI performed significantly worse (approx. 64% exact match) due to extreme maximum string lengths (up to 273 characters).
Stereochemistry adds difficulty: Including stereochemistry reduced accuracy across all representations due to increased token count and visual complexity.
Recommendation: Use SMILES for maximum accuracy; use SELFIES if generating valid structures is the priority (e.g., generative tasks).

Reproducibility Details

Data

The study used curated subsets from ChEMBL and PubChem. Images were generated synthetically.

Purpose	Dataset	Size	Notes
Training	ChEMBL (Dataset 1/2)	~1.5M	Filtered for MW < 1500, specific elements (C,H,O,N,P,S,F,Cl,Br,I,Se,B).
Training	PubChem (Dataset 3/4)	~3.0M	Same filtering rules, used to test scaling.
Evaluation	Test Split	~120k - 250k	Created using RDKit MaxMin algorithm to ensure chemical diversity.

Image Generation:

Tool: CDK Structure Diagram Generator (SDG).
Specs: $300 \times 300$ pixels, rotated by random angles ($0-360^{\circ}$), saved as 8-bit PNG.

Algorithms

Tokenization Rules (Critical for replication):

SELFIES: Split at every ][ (e.g., [C][N] $\rightarrow$ [C], [N]).
SMILES / DeepSMILES: Regex-based splitting:
- Every heavy atom (e.g., C, N).
- Every bracket ( and ).
- Every bond symbol = and #.
- Every single-digit number.
- Everything inside square brackets [] is kept as a single token.
InChI: The prefix InChI=1S/ was treated as a single token and removed during training, then re-added for evaluation.

Models

The model follows the DECIMER architecture.

Encoder: EfficientNet-B3 (pre-trained with “Noisy Student” weights).
- Output: Image feature vectors of shape $10 \times 10 \times 1536$.
Decoder: Transformer (similar to the “Base” model from Attention Is All You Need).
- Layers: 4 encoder-decoder layers.
- Attention Heads: 8.
- Dimension ($d_{\text{model}}$): 512.
- Feed-forward ($d_{\text{ff}}$): 2048.
- Dropout: 10%.
Loss: Sparse categorical cross-entropy.
Optimizer: Adam with custom learning rate scheduler.

Evaluation

Metrics were calculated after converting all predictions back to standard SMILES.

Metric	Baseline (SMILES)	Notes
Identical Match	88.62% (PubChem)	Strict character-for-character equality.
Valid Structure	99.78%	SMILES had rare syntax errors; SELFIES achieved 100%.
Tanimoto (Avg)	0.98	Calculated using PubChem fingerprints via CDK.

Hardware

Training: Google Cloud TPUs (v3-8).
Format: Data converted to TFRecords (128 image/text pairs per record) for TPU efficiency.
Batch Size: 1024.

Artifacts

Artifact	Type	License	Notes
DECIMER Short Communication	Code	MIT	Training and evaluation scripts (Python, Java)
Datasets on Zenodo	Dataset	MIT	SMILES data and processing scripts

Paper Information

Citation: Rajan, K., Steinbeck, C., & Zielesny, A. (2022). Performance of chemical structure string representations for chemical image recognition using transformers. Digital Discovery, 1(2), 84-90. https://doi.org/10.1039/D1DD00013F

Publication: Digital Discovery 2022

Additional Resources:

ChemRxiv Preprint (PDF)
Official Code Repository
Data on Zenodo
Related work: DECIMER, DECIMER 1.0, IMG2SMI

@article{rajanPerformanceChemicalStructure2022,
  title = {Performance of Chemical Structure String Representations for Chemical Image Recognition Using Transformers},
  author = {Rajan, Kohulan and Steinbeck, Christoph and Zielesny, Achim},
  year = 2022,
  journal = {Digital Discovery},
  volume = {1},
  number = {2},
  pages = {84--90},
  publisher = {Royal Society of Chemistry},
  doi = {10.1039/D1DD00013F}
}