Autoregressive Generation on Hunter Heidenreich | ML Research Scientist

LSTM Neural Network for Drug-Like Molecule Generation

Sat, 28 Mar 2026 00:00:00 +0000

An Early Method for LSTM-Based Molecular Generation

This is a Method paper that applies character-level LSTM networks to the task of de novo drug-like molecule generation. The primary contribution is demonstrating that an LSTM trained on SMILES strings from a large bioactive compound database (ChEMBL) can produce novel, diverse molecules whose chemical properties closely match those of known drug-like compounds. The paper also validates the generated molecules through virtual screening with profile QSAR models, showing comparable predicted bioactivity to the training set.

The Challenge of Exploring Drug-Like Chemical Space

The theoretical space of drug-like molecules is astronomically large. Brute-force enumeration approaches such as GDB-17 (which catalogued 166 billion molecules) are feasible only for small molecules, and full enumeration of molecules with 25-30 heavy atoms (the typical size of drug molecules) remains computationally intractable. Traditional cheminformatics approaches to sampling this space rely on fragment combination, evolutionary algorithms, or particle swarm optimization.

The authors position LSTM networks as a viable alternative. LSTMs had already demonstrated the ability to learn sequential structure in domains like text and music generation, making them natural candidates for learning SMILES grammar and generating novel valid molecular strings. At the time of writing (late 2017), several groups were exploring this direction, including Bjerrum and Threlfall (ZINC-based generation), Gomez-Bombarelli et al. (VAE-based latent space design), Olivecrona et al. (RL-guided generation), and Segler et al. (focused library design). This paper contributes a large-scale empirical study with detailed analysis of the generated molecules’ chemical quality.

Character-Level LSTM with Temperature-Based Sampling

The core approach is straightforward: train an LSTM to predict the next character in a SMILES string, then sample from the trained model to generate new molecules character by character.

The network architecture consists of:

Two stacked LSTM layers (which learn the SMILES grammar)
A dropout layer for regularization
A dense output layer with 23 neurons (one per character in the reduced SMILES alphabet) and softmax activation

The RMSProp optimizer was used for training. The learning rate was gradually decreased from 0.01 to 0.0002 during training. At generation time, a temperature parameter controls the randomness of character sampling to produce more diverse structures rather than reproducing training molecules too closely.

A key preprocessing step reduces the SMILES alphabet to 23 characters. Multi-character atom tokens are replaced with single characters (Cl → L, Br → R, [nH] → A). Only the organic atom subset (H, C, N, O, S, P, F, Cl, Br, I) is retained. Charged molecules, stereo information, and molecules with more than 5 ring closures are excluded. The training corpus totals 23,664,668 characters, with 40-character windows used as input sequences during training.

Training on ChEMBL and Generating One Million Molecules

Training Data

The training set consists of 509,000 bioactive molecules from ChEMBL with reported activity below 10 micromolar on any target.

Generation and Filtering

The LSTM generates SMILES strings character by character. The generated strings undergo a two-stage validation:

Bracket and ring closure check (fast text-based): 54% of generated SMILES are discarded for unpaired brackets or ring closures
Full chemical parsing with RDKit: An additional 14% fail due to unrealistic aromatic systems or incorrect valences
Final yield: 32% of generated SMILES correspond to valid molecules

One million valid molecules were generated in under 2 hours on 300 CPUs.

Novelty and Diversity

Out of one million generated molecules, only 2,774 (0.28%) were identical to molecules in the training ChEMBL set. The generated set contained 627,000 unique scaffolds compared to 172,000 in ChEMBL, with an overlap of only 18,000 scaffolds. This demonstrates substantial novelty and diversity.

Physicochemical Properties

Calculated molecular descriptors (molecular weight, logP, and topological polar surface area) for the generated molecules closely matched the distributions of the ChEMBL training set. The synthetic accessibility score distributions were also practically identical, indicating comparable molecular complexity.

Substructure Feature Comparison

The paper compares substructure features across three molecule sets: ChEMBL training data, LSTM-generated molecules, and a naive SMILES baseline generator. The naive generator uses only character frequency statistics and basic SMILES syntax rules, producing primarily macrocycles with very few fused aromatic systems.

Feature	ChEMBL (%)	LSTM Generated (%)	Naive Baseline (%)
No rings	0.4	0.4	0.1
1 ring	2.8	4.3	13.2
2 rings	14.8	23.1	17.7
3 rings	32.2	43.5	27.3
4 rings	32.7	23.9	25.2
>4 rings	17.2	4.8	16.5
Fused aromatic rings	38.8	30.9	0.2
Large rings (>8)	0.4	1.8	75.9
Spiro rings	1.9	0.6	0.6
Contains N	96.5	96.1	92.3
Contains O	93.0	92.0	85.5
Contains S	35.6	27.9	39.6
Contains halogen	40.7	38.8	49.4

The LSTM-generated molecules closely mirror the ChEMBL distributions, while the naive generator fails to capture drug-like structural patterns. The LSTM tends to slightly over-represent 2-3 ring systems and under-represent 4+ ring systems relative to ChEMBL. Functional group distributions also closely matched between ChEMBL and the LSTM output.

Virtual Screening Validation

The generated molecules were evaluated using profile QSAR models for 159 ChEMBL kinase assays. The six best models (with realistic test set R-squared > 0.75) were used to predict pIC50 values for both actual ChEMBL compounds and generated compounds. The cumulative frequency distributions of predicted activity were nearly identical between the two sets.

Kolmogorov-Smirnov (KS) tests on random samples of 1,000 compounds confirmed this quantitatively:

Assay	KS D	Distributions Differ?	Mean (Real)	Mean (Gen)	Stdev (Real)	Stdev (Gen)
688395	6.01%	No	4.66	4.69	0.25	0.24
668624	3.60%	No	4.86	4.86	0.25	0.24
809226	9.90%	Yes	5.33	5.26	0.34	0.30
809226	4.30%	No	5.18	5.13	0.47	0.43
688781	2.20%	No	4.83	4.82	0.26	0.25
809170	8.70%	Yes	5.12	5.07	0.51	0.46

For 4 of 6 models, the null hypothesis that the distributions are the same could not be rejected at the 95% confidence level (critical D = 6.04%). Even for the two assays where the KS test rejected the null hypothesis, the maximum vertical distance between distributions was below 10%.

Generated Molecules Are Novel, Drug-Like, and Potentially Bioactive

The key findings of this study are:

High novelty: Only 0.28% of generated molecules match training compounds; 627K novel scaffolds were produced versus 172K in ChEMBL
Drug-like quality: Physicochemical properties, substructure features, functional group distributions, and synthetic accessibility scores all closely match the ChEMBL training distribution, without these being explicit constraints
Predicted bioactivity: Virtual screening with profile QSAR models shows the generated molecules have comparable predicted activity profiles to known bioactive compounds
Scalability: One million valid molecules in under 2 hours on 300 CPUs, with the potential to scale to billions with GPU acceleration
LSTM superiority over naive baselines: A simple statistical SMILES generator using only character frequencies produces chemically unrealistic molecules (mostly macrocycles), demonstrating that the LSTM genuinely learns drug-like chemical patterns

The main limitations are the 32% validity rate (68% of generated SMILES are invalid), the exclusion of stereochemistry and charged molecules from the training set, and the lack of any goal-directed generation capability (the model produces unconditional samples from the training distribution). The code was described as “available on request” from the corresponding author rather than publicly released.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Training	ChEMBL bioactive molecules	509,000 molecules	Activity < 10 uM on any target; organic atoms only; no charges or stereo

Algorithms

Double-stacked LSTM layers with dropout
Softmax output over 23-character reduced SMILES alphabet
RMSProp optimizer with learning rate annealed from 0.01 to 0.0002
Temperature-based sampling at generation time
40-character input windows during training

Models

The architecture consists of two LSTM layers, a dropout layer, and a 23-neuron dense output layer. Exact hidden unit counts and dropout rates are not specified in the paper.

Evaluation

Metric	Value	Notes
Valid SMILES rate	32%	After bracket check and RDKit parsing
Novelty (vs. training)	99.72%	Only 2,774 of 1M match ChEMBL
Unique scaffolds	627,000	vs. 172,000 in ChEMBL
KS test (4/6 assays)	Not significantly different	At 95% confidence

Hardware

Generation: 300 CPUs for under 2 hours (1 million valid molecules)
Training hardware not specified

Paper Information

Citation: Ertl, P., Lewis, R., Martin, E., & Polyakov, V. (2017). In silico generation of novel, drug-like chemical matter using the LSTM neural network. arXiv preprint, arXiv:1712.07449.

@article{ertl2017silico,
  title={In silico generation of novel, drug-like chemical matter using the LSTM neural network},
  author={Ertl, Peter and Lewis, Richard and Martin, Eric and Polyakov, Valery},
  journal={arXiv preprint arXiv:1712.07449},
  year={2017}
}

ChemGE: Molecule Generation via Grammatical Evolution

Sat, 28 Mar 2026 00:00:00 +0000

Grammatical Evolution for De Novo Molecular Design

This is a Method paper that introduces ChemGE, a population-based molecular generation approach built on grammatical evolution. Rather than using deep neural networks, ChemGE evolves populations of SMILES strings through a context-free grammar, enabling concurrent evaluation by multiple molecular simulators and producing diverse molecular libraries. The method represents an alternative paradigm for de novo drug design: evolutionary optimization over formal grammars rather than learned latent spaces or autoregressive neural models.

Limitations of Sequential Deep Learning Generators

At the time of publication, the dominant approaches to de novo molecular generation included Bayesian optimization over VAE latent spaces (CVAE, GVAE), reinforcement learning with recurrent neural networks (ORGAN, REINVENT), sequential Monte Carlo search, and Monte Carlo tree search (ChemTS). These methods share two practical limitations:

Simulation concurrency: Most methods generate one molecule at a time, making it difficult to run multiple molecular simulations (e.g., docking) in parallel. This wastes computational resources in high-throughput virtual screening settings.
Molecular diversity: Deep learning generators tend to exploit narrow regions of chemical space. Deep reinforcement learning methods in particular often generate very similar molecules, requiring special countermeasures to maintain diversity. Since drug discovery is a multi-stage pipeline, limited diversity reduces survival rates in downstream ADMET screening.

ChemGE addresses both problems by maintaining a large population of molecules that are evolved and evaluated concurrently.

Core Innovation: Chromosome-to-SMILES Mapping via Grammar Rules

ChemGE encodes each molecule as a chromosome: a sequence of $N$ integers that deterministically maps to a SMILES string through a context-free grammar. The mapping process works as follows:

Start with the grammar’s start symbol
At each step $k$, look up the $k$-th integer $c = C[k]$ from the chromosome
Identify the leftmost non-terminal symbol and count its $r$ applicable production rules
Apply the $((c \bmod r) + 1)$-th rule
Repeat until no non-terminal symbols remain or the chromosome is exhausted

The context-free grammar is a subset of the OpenSMILES specification, defined formally as $G = (V, \Sigma, R, S)$ where $V$ is the set of non-terminal symbols, $\Sigma$ is the set of terminal symbols, $R$ is the set of production rules, and $S$ is the start symbol.

Evolution follows the $(\mu + \lambda)$ evolution strategy:

Create $\lambda$ new chromosomes by drawing random chromosomes from the population and mutating one integer at a random position
Translate each chromosome to a SMILES string and evaluate fitness (e.g., docking score). Invalid molecules receive fitness $-\infty$
Select the top $\mu$ molecules from the merged pool of $\mu + \lambda$ candidates

The authors did not use crossover, as it did not improve performance. Diversity is inherently maintained because a large fraction of molecules are mutated in each generation.

Experimental Setup and Benchmark Comparisons

Druglikeness Score Benchmark

The first experiment optimized the penalized logP score $J^{\log P}$, an indicator of druglikeness defined as:

$$ J^{\log P}(m) = \log P(m) - \text{SA}(m) - \text{ring-penalty}(m) $$

where $\log P(m)$ is the octanol-water partition coefficient, $\text{SA}(m)$ is the synthetic accessibility score, and ring-penalty$(m)$ penalizes carbon rings larger than size 6. All terms are normalized to zero mean and unit standard deviation. Initial populations were randomly sampled from the ZINC database (35 million compounds), with fitness set to $-\infty$ for molecules with molecular weight above 500 or duplicate structures.

ChemGE was compared against CVAE, GVAE, and ChemTS across population sizes $(\mu, \lambda) \in {(10, 20), (100, 200), (1000, 2000), (10000, 20000)}$.

Method	2h	4h	6h	8h	Mol/Min
ChemGE (10, 20)	4.46 +/- 0.34	4.46 +/- 0.34	4.46 +/- 0.34	4.46 +/- 0.34	14.5
ChemGE (100, 200)	5.17 +/- 0.26	5.17 +/- 0.26	5.17 +/- 0.26	5.17 +/- 0.26	135
ChemGE (1000, 2000)	4.45 +/- 0.24	5.32 +/- 0.43	5.73 +/- 0.33	5.88 +/- 0.34	527
ChemGE (10000, 20000)	4.20 +/- 0.33	4.28 +/- 0.28	4.40 +/- 0.27	4.53 +/- 0.26	555
CVAE	-30.18 +/- 26.91	-1.39 +/- 2.24	-0.61 +/- 1.08	-0.006 +/- 0.92	0.14
GVAE	-4.34 +/- 3.14	-1.29 +/- 1.67	-0.17 +/- 0.96	0.25 +/- 1.31	1.38
ChemTS	4.91 +/- 0.38	5.41 +/- 0.51	5.49 +/- 0.44	5.58 +/- 0.50	40.89

At $(\mu, \lambda) = (1000, 2000)$, ChemGE achieved the highest final score of 5.88 and generated 527 unique molecules per minute, roughly 13x faster than ChemTS and 3700x faster than CVAE. The small population (10, 20) converged prematurely with insufficient diversity, while the overly large population (10000, 20000) could not run enough generations to optimize effectively.

Docking Experiment with Thymidine Kinase

The second experiment applied ChemGE to generate molecules with high predicted binding affinity for thymidine kinase (KITH), a well-known antiviral drug target. The authors used rDock for docking simulation, taking the best intermolecular score $S_{\text{inter}}$ from three runs with different initial conformations. Fitness was defined as $-S_{\text{inter}}$ (lower scores indicate higher affinity). The protein structure was taken from PDB ID 2B8T.

With 32 parallel cores and $(\mu, \lambda) = (32, 64)$, ChemGE completed 1000 generations in approximately 26 hours, generating 9466 molecules total. Among these, 349 molecules achieved intermolecular scores better than the best known inhibitor in the DUD-E database.

Diversity Analysis

Molecular diversity was measured using internal diversity based on Morgan fingerprints:

$$ I(A) = \frac{1}{|A|^2} \sum_{(x,y) \in A \times A} T_d(x, y) $$

where $T_d(x, y) = 1 - \frac{|x \cap y|}{|x \cup y|}$ is the Tanimoto distance.

The 349 “ChemGE-active” molecules (those scoring better than the best known inhibitor) had an internal diversity of 0.55, compared to 0.46 for known inhibitors and 0.65 for the whole ZINC database. This is a substantial improvement over known actives, achieved without any explicit diversity-promoting mechanism.

ISOMAP visualizations showed that ChemGE populations migrated away from known inhibitors over generations, ultimately occupying a completely different region of chemical space by generation 1000. This suggests ChemGE discovered a novel structural class of potential binders.

High Throughput and Diversity Without Deep Learning

ChemGE demonstrates several notable findings:

Deep learning is not required for competitive de novo molecular generation. Grammatical evolution over SMILES achieves higher throughput and comparable or better optimization scores than VAE- and RNN-based methods.
Population size matters significantly. Too small a population leads to premature convergence. Too large a population prevents sufficient per-molecule optimization within the computational budget. The $(\mu, \lambda) = (1000, 2000)$ setting provided the best balance.
Inherent diversity is a key advantage of evolutionary methods. Without any explicit diversity loss or penalty, ChemGE maintains diversity comparable to the ZINC database and exceeds that of known active molecules.
Parallel evaluation is naturally supported. Each generation produces $\lambda$ independent molecules that can be evaluated by separate docking simulators simultaneously.

The authors acknowledge several limitations. Synthetic routes and ADMET properties were not evaluated for the generated molecules. The docking scores, while favorable, require confirmation through biological assays. The authors also note that incorporating probabilistic or neural models into the evolutionary process might further improve performance.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Initial population	ZINC	~35M compounds	Randomly sampled starting molecules
Docking target	PDB 2B8T	1 structure	Thymidine kinase-ligand complex
Baseline actives	DUD-E (KITH)	57 inhibitors	Known thymidine kinase inhibitors

Algorithms

Grammatical evolution with $(\mu + \lambda)$ evolution strategy
Mutation only (no crossover)
Context-free grammar subset of OpenSMILES specification
Chromosome length: $N$ integers per molecule
Fitness set to $-\infty$ for invalid SMILES, MW > 500, or duplicate molecules

Models

No neural network models are used. ChemGE is purely evolutionary.

Evaluation

Metric	Value	Baseline	Notes
Max $J^{\log P}$ (8h)	5.88 +/- 0.34	ChemTS: 5.58 +/- 0.50	ChemGE (1000, 2000)
Molecules/min	527	ChemTS: 40.89	~13x throughput improvement
Docking hits	349	Best DUD-E inhibitor	Molecules with better $S_{\text{inter}}$
Internal diversity	0.55	Known inhibitors: 0.46	Morgan fingerprint Tanimoto distance

Hardware

CPU: Intel Xeon E5-2630 v3 (benchmark experiments, single core)
Docking: 32 cores in parallel (thymidine kinase experiment, ~26 hours for 1000 generations)

Artifacts

Artifact	Type	License	Notes
ChemGE	Code	MIT	Official implementation

Paper Information

Citation: Yoshikawa, N., Terayama, K., Sumita, M., Homma, T., Oono, K., & Tsuda, K. (2018). Population-based de novo molecule generation, using grammatical evolution. Chemistry Letters, 47(11), 1431-1434. https://doi.org/10.1246/cl.180665

@article{yoshikawa2018chemge,
  title={Population-based De Novo Molecule Generation, Using Grammatical Evolution},
  author={Yoshikawa, Naruki and Terayama, Kei and Sumita, Masato and Homma, Teruki and Oono, Kenta and Tsuda, Koji},
  journal={Chemistry Letters},
  volume={47},
  number={11},
  pages={1431--1434},
  year={2018},
  publisher={Oxford University Press},
  doi={10.1246/cl.180665}
}

S4 Structured State Space Models for De Novo Drug Design

Thu, 26 Mar 2026 00:00:00 +0000

Structured State Spaces Meet Chemical Language Modeling

This is a Method paper that introduces structured state space sequence (S4) models to chemical language modeling (CLM) for de novo drug design. S4 models have a dual formulation: they process entire input sequences via convolution during training (like Transformers) and generate sequences element-by-element via recurrence during inference (like LSTMs). The authors benchmark S4 against LSTM and GPT architectures across multiple drug discovery tasks, including drug-like molecule generation, bioactivity learning, chemical space exploration, natural product design, and prospective kinase inhibitor design validated by molecular dynamics simulations.

Bridging the LSTM-Transformer Gap in Molecular Generation

Chemical language models (CLMs) generate molecules by learning the “chemical language” of SMILES string representations. The two dominant architectures for CLMs are LSTMs and GPTs, each with complementary strengths and limitations:

LSTMs generate sequences recurrently (element-by-element), which enables efficient generation and good learning of local/short-range dependencies. However, their sequential information bottleneck limits learning of global sequence properties.
GPTs (Transformer decoders) process the entire input at once, better capturing global properties like bioactivity. However, they become increasingly compute-intensive for longer SMILES strings and struggle with chemical space exploration at higher sampling temperatures.

Complex molecular properties like bioactivity can emerge from separated portions of a SMILES string (e.g., distant functional groups in the linear notation). Neither architecture fully addresses the need to learn these long-range dependencies while maintaining efficient, robust generation. The chemical space, estimated at up to $10^{60}$ small molecules, demands models that can both capture complex property relationships and explore diverse scaffolds efficiently.

The Dual Nature of S4: Convolution Meets Recurrence

S4 models are built on discrete state space models, which map an input sequence $\mathbf{u}$ to an output sequence $\mathbf{y}$ through learnable parameters $\overline{\mathbf{A}} \in \mathbb{R}^{N \times N}$, $\overline{\mathbf{B}} \in \mathbb{R}^{N \times 1}$, $\overline{\mathbf{C}} \in \mathbb{R}^{1 \times N}$, and $\overline{\mathbf{D}} \in \mathbb{R}^{1 \times 1}$:

$$ x_{k} = \overline{\mathbf{A}} x_{k-1} + \overline{\mathbf{B}} u_{k} $$

$$ y_{k} = \overline{\mathbf{C}} x_{k} + \overline{\mathbf{D}} u_{k} $$

This linear recurrence can equivalently be “unrolled” into a global convolution:

$$ \mathbf{y} = \mathbf{u} * \overline{\mathbf{K}} $$

where $\overline{\mathbf{K}}$ is a convolution filter parameterized by $\overline{\mathbf{A}}$, $\overline{\mathbf{B}}$, and $\overline{\mathbf{C}}$. This duality is the core innovation for CLMs:

Training: S4 uses the convolutional formulation to learn from entire SMILES sequences simultaneously, capturing global molecular properties.
Generation: S4 switches to the recurrent formulation, producing SMILES tokens one at a time for efficient, robust chemical space exploration.

S4 addresses the numerical instabilities of naive state space models through high-order polynomial projection operators (HiPPO) and reduction to the stable Cauchy kernel computation, enabling effective learning of long-range dependencies.

For molecular ranking after fine-tuning, the log-likelihood score subtracts the pre-training likelihood to isolate target-specific information:

$$ \mathcal{L}_{\text{score}}(\mathbf{M}) = \mathcal{L}(\mathbf{M}_{\text{ft}}) - \mathcal{L}(\mathbf{M}_{\text{pt}}) $$

where $\mathcal{L}(\mathbf{M}_{\text{ft}})$ and $\mathcal{L}(\mathbf{M}_{\text{pt}})$ are the fine-tuned and pre-trained model log-likelihoods, respectively.

Benchmarking S4 Across Drug Discovery Tasks

Drug-like molecule generation

All three CLMs (S4, LSTM, GPT) were pre-trained on 1.9M canonical SMILES from ChEMBL v31 (molecules with fewer than 100 tokens). Each model generated 102,400 SMILES strings de novo.

Model	Valid	Unique	Novel
S4	99,268 (97%)	98,712 (96%)	95,552 (93%)
LSTM	97,151 (95%)	96,618 (94%)	82,988 (81%)
GPT	93,580 (91%)	93,263 (91%)	91,590 (89%)

S4 produces the most valid, unique, and novel molecules. Error analysis reveals that each architecture shows different failure modes: LSTMs struggle most with branching errors, GPTs with ring and bond assignment errors, while S4 generates fewer branching and ring errors but more bond assignment errors than LSTM. This pattern supports the hypothesis that S4 captures long-range dependencies (branching, ring opening/closure) better while local dependencies (bond assignment) are handled better by recurrent processing.

Bioactivity learning via transfer learning

Five fine-tuning campaigns were conducted on targets from the LIT-PCBA dataset: PKM2, MAPK1, GBA, mTORC1, and TP53. After fine-tuning, models ranked held-out test molecules by learned log-likelihoods to evaluate bioactive compound prioritization.

S4 outperformed both benchmarks across targets. Wilcoxon signed-rank tests on pooled scores confirmed statistically significant superiority:

S4 vs. LSTM: $p$ [top 10] = 8.41e-6, $p$ [top 50] = 2.93e-7, $p$ [top 100] = 1.45e-7
S4 vs. GPT: $p$ [top 10] = 2.33e-3, $p$ [top 50] = 3.72e-3, $p$ [top 100] = 2.61e-2

TP53 was the most challenging target, where no model consistently retrieved actives in the top 10, possibly due to activity cliffs in the test set.

Chemical space exploration with temperature sampling

Models were evaluated across sampling temperatures from $T = 1.0$ to $T = 2.0$ on three metrics: SMILES validity, rediscovery rate of known actives, and scaffold diversity. Key findings:

Validity: S4 and LSTM maintain higher validity than GPT at elevated temperatures (GPT median validity drops below 40% at high T).
Rediscovery: S4 outperforms LSTM in rediscovering bioactive molecules at all temperatures.
Scaffold diversity: LSTM achieves the highest number of unique scaffold clusters (median 6,602 at $T = 1.75$), with S4 as close second (6,520 clusters).

S4 provides the best balance between bioactivity capture and structural diversity.

Natural product design

Models were trained on 32,360 large natural product SMILES (length > 100 tokens) from the COCONUT database and used to generate 102,400 designs each.

Metric	S4	LSTM	GPT	Training Set
Valid	82,633 (81%)	76,264 (74%)	70,117 (68%)	n.a.
Unique	53,293 (52%)	51,326 (50%)	50,487 (49%)	n.a.
Novel	40,897 (40%)	43,245 (42%)	43,168 (42%)	n.a.
NP-likeness	1.6 +/- 0.7	1.5 +/- 0.7	1.5 +/- 0.7	1.6 +/- 0.7

S4 designs the most valid molecules (6,000 to 12,000 more than benchmarks) and achieves significantly higher NP-likeness ($p = 1.41 \times 10^{-53}$ vs. LSTM, $p = 1.02 \times 10^{-82}$ vs. GPT). S4 also achieves the lowest Kolmogorov-Smirnov distances to the training/test distributions across multiple structural properties (sp3 carbons, aliphatic rings, spiro atoms, molecular weight, fused ring size, heavy atoms).

For computational efficiency, S4 trains as fast as GPT (both approximately 1.3x faster than LSTM) and generates fastest among all architectures.

Prospective MAPK1 inhibitor design

The pre-trained S4 model was fine-tuned on 68 manually curated MAPK1 inhibitors ($K_i < 1 \mu M$) from ChEMBL v33. The last five fine-tuning epochs generated 256K molecules across five temperature values. After ranking and filtering by log-likelihood score and scaffold similarity, the top 10 designs were evaluated via Umbrella Sampling molecular dynamics simulations.

Eight out of ten designs showed high predicted affinity, with $\Delta G$ values ranging from $-10.3 \pm 0.6$ to $-23 \pm 4$ kcal/mol. These affinities are comparable to or exceed those of the closest known active neighbors ($\Delta G = -9.1 \pm 0.8$ to $-13 \pm 2$ kcal/mol). The most potent predicted design (molecule 2, $\Delta G = -23 \pm 4$ kcal/mol) engages extensively with the MAPK1 binding pocket, though synthetic accessibility may be limited. Several designs incorporate halogen substitutions favorable for MAPK1 inhibition, consistent with known structure-activity relationships.

S4 Combines the Best of LSTMs and GPTs for Molecular Design

The main findings of this study are:

S4 outperforms both LSTM and GPT in learning complex molecular properties like bioactivity, while maintaining competitive or superior performance in syntax learning and chemical space exploration.
The dual formulation is key: holistic training (convolution) enables better capture of global molecular properties, while recurrent generation preserves robust chemical syntax and diverse scaffold exploration.
S4 is especially strong for longer sequences: natural product design (SMILES > 100 tokens) shows the largest advantages over benchmarks in validity and property matching.
Prospective validation: 8/10 S4-designed MAPK1 inhibitors are predicted as highly active by molecular dynamics, with affinities comparable to or exceeding known actives.

Limitations acknowledged by the authors:

All evaluations are computational; no wet-lab experimental validation is reported.
Bioactivity evaluation relies on likelihood-based ranking, which is an indirect proxy.
The MD simulations, while more rigorous than simple docking, still represent in silico predictions.
SMILES augmentation and improved ranking protocols could further boost performance.

Future directions include application to macrocyclic peptides and protein sequences, organic reaction planning, structure-based drug design, and integration with wet-lab experimental validation.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pre-training	ChEMBL v31	1.9M SMILES	Molecules with SMILES length <= 100 tokens
Fine-tuning (bioactivity)	LIT-PCBA (5 targets)	11-56 actives + ~10K inactives per target	PKM2, MAPK1, GBA, mTORC1, TP53
Natural product training	COCONUT	32,360 SMILES	SMILES length > 100 tokens
Prospective fine-tuning	ChEMBL v33 (MAPK1)	68 inhibitors	$K_i < 1 \mu M$, target ID CHEMBL4040

Algorithms

Pre-training: next-token prediction on SMILES strings
Fine-tuning: transfer learning with early stopping (patience 5, tolerance $10^{-5}$)
Molecule ranking: log-likelihood scoring with pre-training bias subtraction (Eq. 5)
Temperature sampling: $T$ from 1.0 to 2.0 (step 0.25) for chemical space exploration

Models

S4: Structured state space sequence model with HiPPO initialization; hyperparameter search over 242 + 108 configurations
LSTM: 40 configurations optimized via random search
GPT: 35 configurations optimized via random search
All models share the same pre-training data and fine-tuning protocol for fair comparison

Evaluation

Metric	Best Model	Value	Notes
Validity (ChEMBL)	S4	97%	Out of 102,400 generated SMILES
Uniqueness (ChEMBL)	S4	96%	Among valid designs
Novelty (ChEMBL)	S4	93%	Not in training set
Bioactivity ranking (top 10)	S4	Significant (p = 8.41e-6 vs LSTM)	Wilcoxon signed-rank test
NP validity	S4	81%	COCONUT, SMILES > 100 tokens
MAPK1 inhibitor success	S4	8/10 designs active	Validated by MD (Umbrella Sampling)

Hardware

Hyperparameter search: NVIDIA A100 40GB GPUs
LSTM/GPT search: 5 days on single A100
S4 search: 10 days on multiple A100 GPUs
MD simulations: Dutch supercomputer Snellius; 1.2-1.6 microseconds per ligand (Umbrella Sampling)

Artifacts

Artifact	Type	License	Notes
S4 for de novo drug design	Code	MIT	Official PyTorch implementation with data and trained models
Zenodo archive	Dataset	CC-BY-4.0	Source data and molecule designs

Paper Information

Citation: Ozcelik, R., de Ruiter, S., Criscuolo, E., & Grisoni, F. (2024). Chemical language modeling with structured state space sequence models. Nature Communications, 15, 6176.

@article{ozcelik2024chemical,
  title={Chemical language modeling with structured state space sequence models},
  author={\"O{}z\c{c}elik, R{\i}za and de Ruiter, Sarah and Criscuolo, Emanuele and Grisoni, Francesca},
  journal={Nature Communications},
  volume={15},
  number={1},
  pages={6176},
  year={2024},
  publisher={Nature Publishing Group},
  doi={10.1038/s41467-024-50469-9}
}

LMs Generate 3D Molecules from XYZ, CIF, PDB Files

Thu, 26 Mar 2026 00:00:00 +0000

Language Models as 3D Chemical Structure Generators

This is a Method paper that demonstrates transformer-based language models can generate molecules, crystalline materials, and protein binding sites directly in three dimensions by training on sequences derived from standard chemical file formats (XYZ, CIF, PDB). The key contribution is showing that unmodified autoregressive language models, using only next-token prediction, achieve performance comparable to domain-specific 3D generative models that incorporate SE(3) equivariance and other geometric inductive biases.

Beyond Graphs and Strings: The Need for 3D Chemical Generation

Molecular design with deep learning has largely relied on two representation paradigms: molecular graphs (processed with graph neural networks) and linearized string representations like SMILES and SELFIES (processed with sequence models). Both approaches have proven effective for drug-like organic molecules, but they share a fundamental limitation: they cannot represent structures whose identity depends on 3D spatial arrangement.

Crystalline materials, for example, have periodic lattice structures that cannot be reduced to simple graphs. Protein binding sites are defined by the 3D arrangement of hundreds of atoms across multiple residues. For tasks like catalysis design or structure-based drug discovery, the geometric positions of atoms are essential information that graphs and strings discard entirely.

Existing 3D generative models address this gap but typically require specialized architectures with SE(3) equivariance to handle rotational and translational symmetries. This work asks whether the general-purpose sequence modeling capability of transformers is sufficient to learn 3D chemical structure distributions without any domain-specific architectural modifications.

Direct Tokenization of Chemical File Formats

The core insight is straightforward: any 3D molecule, crystal, or biomolecule is already stored as text in standard file formats (XYZ, CIF, PDB). These files encode atom types and their Cartesian coordinates as sequences of characters and numbers. Rather than designing specialized architectures for point cloud generation, the authors simply tokenize these files and train a standard GPT-style transformer to predict the next token.

A molecule with $n$ atoms is represented as:

$$ \mathcal{M} = (e_1, x_1, y_1, z_1, \dots, e_n, x_n, y_n, z_n) $$

where $e_i$ is the element type and $(x_i, y_i, z_i)$ are Cartesian coordinates. Crystals additionally include lattice parameters:

$$ \mathcal{C} = (\ell_a, \ell_b, \ell_c, \alpha, \beta, \gamma, e_1, x_1, y_1, z_1, \dots, e_n, x_n, y_n, z_n) $$

Protein binding sites use residue-atom indicators (e.g., HIS-C, CYS-N) instead of bare element symbols:

$$ \mathcal{P} = (a_1, x_1, y_1, z_1, \dots, a_n, x_n, y_n, z_n) $$

The language model learns the joint distribution via autoregressive factorization:

$$ p(x) = \prod_{i=1}^{n} p(t_i \mid t_{i-1}, \dots, t_1) $$

Two tokenization strategies are explored:

Character-level (LM-CH): Every character in the file is a token, including digits, minus signs, spaces, and newlines. This produces long sequences but uses a small vocabulary (~30 tokens).
Atom+coordinate-level (LM-AC): Each atom placement requires exactly 4 tokens: one element/residue token and three coordinate tokens (e.g., ‘-1.98’). The vocabulary is larger (~100-10K tokens) but sequences are shorter.

Numerical precision is controlled by rounding coordinates to 1, 2, or 3 decimal places. Since the model lacks rotation and translation invariance, random rotation augmentation during training improves performance.

Experiments Across Molecules, Crystals, and Protein Binding Sites

Molecular Generation (ZINC)

The model is evaluated on 250K commercially available molecules from the ZINC dataset, with an average of 23 heavy atoms. XYZ files are generated using RDKit’s conformer tools. Coordinates use 2 decimal places of precision. The authors generate 10K molecules and evaluate both 3D geometry quality and standard generative metrics.

For 3D geometry assessment, root mean squared deviation (RMSD) between language model-generated conformers and RDKit-generated conformers shows most molecules fall between 1.0 and 2.0 RMSD, with a heavy tail extending to 4.0.

Standard metrics include validity, uniqueness, novelty, and earth mover’s distance (WA) for molecular property distributions (QED, SA score, molecular weight).

Model	3D	Valid (%)	Unique (%)	Novel (%)	WA MW	WA SA	WA QED
Train	No	100.0	100.0	100.0	0.816	0.013	0.002
SM-LM	No	98.35	100.0	100.0	3.640	0.049	0.005
SF-LM	No	100.0	100.0	100.0	3.772	0.085	0.006
JTVAE	No	100.0	98.56	100.0	22.63	0.126	0.023
ENF	Yes	1.05	96.37	99.72	168.5	1.886	0.160
G-SchNet	Yes	1.20	55.96	98.33	152.7	1.126	0.185
EDM	Yes	77.51	96.40	95.30	101.2	0.939	0.093
LM-CH	Yes	90.13	100.0	100.0	3.912	2.608	0.077
LM-AC	Yes	98.51	100.0	100.0	1.811	0.026	0.004

The atom+coordinate tokenization model (LM-AC) achieves 98.51% validity with 100% uniqueness and novelty. Its WA scores for molecular weight (1.811) and QED (0.004) are substantially better than all other 3D generative baselines and competitive with SMILES/SELFIES language models. The character-level model (LM-CH) at 90.13% validity performs comparably to graph-based models but falls short of the string-based language models.

Crystal Generation (Perov-5 and MP-20)

Crystal generation uses CIF-derived sequences with 3 decimal places of precision. Two datasets are used: Perov-5 (18,928 perovskite materials, 5 atoms per unit cell, 56 elements) and MP-20 (45,231 diverse materials, 1-20 atoms per unit cell, 89 elements).

Evaluation metrics include structural validity (minimum interatomic distance > 0.5 angstrom), compositional validity (charge neutrality via SMACT), coverage (recall and precision between generated and test sets), and earth mover’s distance for density and number of unique elements.

Data	Model	Struc. Valid (%)	Comp. Valid (%)	COV-R (%)	COV-P (%)	WA density	WA elements
Perov-5	CDVAE	100.0	98.59	99.45	98.46	0.126	0.063
Perov-5	LM-CH	100.0	98.51	99.60	99.42	0.071	0.036
Perov-5	LM-AC	100.0	98.79	98.78	99.36	0.089	0.028
MP-20	CDVAE	100.0	86.70	99.15	99.49	0.688	1.432
MP-20	LM-CH	84.81	83.55	99.25	97.89	0.864	0.132
MP-20	LM-AC	95.81	88.87	99.60	98.55	0.696	0.092

On Perov-5, both language models outperform CDVAE across most metrics. On the more diverse MP-20 dataset, LM-AC achieves the best scores on 3 of 6 metrics and remains competitive on the others. LM-CH struggles more with structural validity on MP-20 (84.81%).

Protein Binding Site Generation (PDB)

The most challenging task involves generating protein binding sites (~200-250 atoms each) from PDB-derived sequences. The dataset contains approximately 180K protein-ligand pairs. Residue-atom tokenization is used (e.g., CYS-C, CYS-N), with 2 decimal places of precision.

Validity is assessed per-residue using xyz2mol, with an additional check for inter-residue atomic overlap (atoms from different residues closer than the minimum bond distance). Approximately 99% of generated pockets pass the residue validity check, while about 5% fail the overlap check. Of generated pockets, 89.8% have unique residue orderings, and 83.6% have novel orderings not seen in training, indicating the model is generating novel binding site structures rather than memorizing.

Competitive 3D Generation Without Geometric Inductive Biases

The central finding is that standard transformer language models, without any equivariance or geometric inductive biases, can generate valid 3D chemical structures across three substantially different domains. The atom+coordinate tokenization (LM-AC) consistently outperforms character-level tokenization (LM-CH), likely because it produces shorter sequences and reduces the number of sequential decisions needed per atom placement.

Several limitations are worth noting. The model generates atoms using absolute Cartesian coordinates, which means it must learn rotation and translation invariance purely from data augmentation rather than having it built into the architecture. The authors acknowledge this becomes increasingly difficult as structure size grows. The vocabulary size also scales with coordinate precision and structure complexity, which could become prohibitive for very large systems.

The paper does not include computational cost comparisons with baseline models, making it difficult to assess the practical tradeoff between the simplicity of the language modeling approach and the efficiency of specialized architectures. The authors also note that further validation through computational simulation and experiment is needed to confirm the physical plausibility of generated structures.

Future directions identified include inverse design of molecules and materials conditioned on target properties, extension to more complex structures (metal-organic frameworks), and exploration of alternative tokenization strategies to handle larger systems.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Training/Eval	ZINC	250K molecules	~23 heavy atoms avg; XYZ files via RDKit conformer generation
Training/Eval	Perov-5	18,928 perovskites	5 atoms/unit cell, 56 elements
Training/Eval	MP-20	45,231 materials	1-20 atoms/unit cell, 89 elements
Training/Eval	Protein binding sites	~180K protein-ligand pairs	Processed to 200-250 atoms per pocket

Algorithms

Architecture: GPT-style transformer with ~1M to 100M parameters
Layers: 12
Embedding size: 128 to 1024
Attention heads: 4 to 12
Batch size: 4 to 32 structures
Learning rate: $10^{-4}$ to $10^{-5}$, decayed to $9 \times 10^{-6}$
Data augmentation: Random rotation of training structures at each epoch
Numerical precision: 2 decimal places (molecules, proteins), 3 decimal places (crystals)

Models

No pre-trained model weights are publicly available. The paper mentions “Example code can be found at” but the URL appears to be missing from the published version.

Evaluation

Metric	Domain	Description
Validity	Molecules	xyz2mol produces valid RDKit Mol object
Validity	Crystals	Structural (min distance > 0.5 angstrom) and compositional (charge neutral)
Uniqueness	All	Fraction of distinct generated structures
Novelty	All	Fraction not in training set
Earth mover’s distance	All	Distribution match for domain-specific properties
RMSD	Molecules	Deviation from RDKit conformer geometries
Coverage	Crystals	Recall and precision between generated and test sets

Hardware

Models were trained using the Canada Computing Systems (Compute Canada). Specific GPU types, counts, and training times are not reported.

Artifacts

No public code repository, model weights, or datasets specific to this work were found. The ZINC, Perov-5, and MP-20 datasets used for evaluation are publicly available from their original sources.

Paper Information

Citation: Flam-Shepherd, D. & Aspuru-Guzik, A. (2023). Language models can generate molecules, materials, and protein binding sites directly in three dimensions as XYZ, CIF, and PDB files. arXiv preprint arXiv:2305.05708.

@article{flamshepherd2023language,
  title={Language models can generate molecules, materials, and protein binding sites directly in three dimensions as {XYZ}, {CIF}, and {PDB} files},
  author={Flam-Shepherd, Daniel and Aspuru-Guzik, Al{\'a}n},
  journal={arXiv preprint arXiv:2305.05708},
  year={2023}
}

Back Translation for Semi-Supervised Molecule Generation

Wed, 25 Mar 2026 00:00:00 +0000

Semi-Supervised Data Augmentation for Molecular Tasks

This is a Method paper that introduces back translation, a semi-supervised technique from neural machine translation, to the domain of molecular generation. The primary contribution is a general-purpose data augmentation strategy that leverages large pools of unlabeled molecules (from databases like ZINC) to improve the performance of both sequence-based and graph-based models on molecule optimization and retrosynthesis prediction tasks.

Bridging the Labeled Data Gap in Molecular Generation

Molecular generation tasks, such as property optimization and retrosynthesis, require paired training data: an input molecule (or property specification) mapped to a desired output molecule. Obtaining these labeled pairs is expensive and labor-intensive. Meanwhile, enormous databases of unlabeled molecules exist. ZINC alone contains over 750 million compounds, and PubChem has 109 million.

Prior approaches to using unlabeled molecular data include variational autoencoders (VAEs) for learning latent representations, conditional recurrent neural networks for inverse design, and pretraining techniques borrowed from NLP. However, these methods either focus on representation learning rather than direct generation, or require task-specific architectural modifications. The authors identify back translation, a well-established technique in machine translation, as a natural fit for molecular generation tasks that can be treated as sequence-to-sequence mappings.

Back Translation as Molecular Data Augmentation

The core idea is straightforward. Given a main task that maps from source domain $\mathcal{X}$ to target domain $\mathcal{Y}$ (e.g., mapping low-QED molecules to high-QED molecules), the method trains a reverse model $g$ that maps from $\mathcal{Y}$ back to $\mathcal{X}$. This reverse model then “back translates” unlabeled molecules from $\mathcal{Y}$ to generate synthetic source molecules, creating pseudo-labeled training pairs.

The theoretical motivation comes from maximizing the reconstruction probability. Given an unlabeled molecule $y_u \in \mathcal{U}_y$, the logarithmic reconstruction probability through the reverse model $g$ and forward model $f$ is:

$$ \log P(y_u = \hat{y}_u \mid y_u; g, f) = \log \sum_{\hat{x}_u \in \mathcal{X}} P(\hat{x}_u \mid y_u; g) P(y_u = \hat{y}_u \mid \hat{x}_u; f) $$

Since summing over the exponentially large space $\mathcal{X}$ is intractable, the authors apply Jensen’s inequality to obtain a lower bound:

$$ \log P(y_u = \hat{y}_u \mid y_u; g, f) \geq \mathbb{E}_{\hat{x}_u \sim P(\cdot \mid y_u; g)} \log P(y_u = \hat{y}_u \mid \hat{x}_u; f) $$

This lower bound is optimized via Monte Carlo sampling in three steps:

Step 1: Train both forward model $f$ and reverse model $g$ on the labeled data $\mathcal{L}$:

$$ \begin{aligned} \min_{\theta_f} \sum_{(x,y) \in \mathcal{L}} -\log P(y \mid x; \theta_f) \\ \min_{\theta_g} \sum_{(x,y) \in \mathcal{L}} -\log P(x \mid y; \theta_g) \end{aligned} $$

Step 2: Use the trained reverse model $g$ to back translate each unlabeled molecule $y_u \in \mathcal{U}_y$, producing synthetic pairs:

$$ \hat{\mathcal{L}} = {(\hat{x}_u, y_u) \mid y_u \in \mathcal{U}_y, \hat{x}_u \text{ sampled from } P(\cdot \mid y_u; \theta_g)} $$

Step 3: Retrain the forward model $f$ on the combined labeled and synthetic data $\mathcal{L} \cup \hat{\mathcal{L}}$, warm-starting from the parameters obtained in Step 1:

$$ \min_{\theta_f^} \sum_{(x,y) \in \mathcal{L} \cup \hat{\mathcal{L}}} -\log P(y \mid x; \theta_f^) $$

A key practical finding is that data filtration matters. When using large amounts of unlabeled data (1M molecules), keeping only the synthetic pairs that satisfy the same constraints as the labeled data (e.g., similarity thresholds and property ranges) significantly improves performance over using all back-translated data unfiltered.

Experiments on Property Optimization and Retrosynthesis

Molecular Property Improvement

The authors evaluate on four tasks from Jin et al. (2019, 2020), each requiring the model to improve a specific molecular property while maintaining structural similarity (measured by Dice similarity on Morgan fingerprints):

LogP (penalized partition coefficient): two settings with similarity thresholds $\delta \geq 0.4$ and $\delta \geq 0.6$
QED (quantitative estimation of drug-likeness): translate molecules from QED range [0.7, 0.8] to [0.9, 1.0]
DRD2 (dopamine type 2 receptor activity): translate inactive ($P < 0.5$) to active ($P \geq 0.5$)

Two backbone architectures are tested: a Transformer (6 layers, 4 heads, 128-dim embeddings, 512-dim FFN) and HierG2G, a hierarchical graph-to-graph translation model. Unlabeled molecules are sampled from ZINC at 250K and 1M scales.

Method	LogP ($\delta \geq 0.6$)	LogP ($\delta \geq 0.4$)	QED (%)	DRD2 (%)
JT-VAE	0.28	1.03	8.8	3.4
GCPN	0.79	2.49	9.4	4.4
JTNN	2.33	3.55	59.9	77.8
Transformer baseline	2.45	3.69	71.9	60.2
+BT (1M, filtered)	2.86	4.41	82.9	67.4
HierG2G baseline	2.49	3.98	76.9	85.9
+BT (250K, filtered)	2.75	4.24	79.1	87.3

Retrosynthesis Prediction

On the USPTO-50K benchmark (50K reactions, 10 reaction types, 80/10/10 train/val/test split), the method is applied to Transformer and GLN (Graph Logic Network) backbones. For other approaches to this benchmark, see Tied Two-Way Transformers and Data Transfer for Retrosynthesis. Unlabeled reactant sets are constructed by sampling molecules from ZINC and concatenating them following the training data’s reactant count distribution ($N_1 : N_2 : N_3 = 29.3% : 70.4% : 0.3%$).

Method	Top-1	Top-3	Top-5	Top-10
Reaction type given
GLN	64.2	79.1	85.2	90.0
Ours + GLN	67.9	82.5	87.3	91.5
Transformer	52.2	68.2	72.7	77.4
Ours + Transformer	55.9	72.8	77.8	79.7
Reaction type unknown
GLN	52.5	69.0	75.6	83.7
Ours + GLN	54.7	70.2	77.0	84.4
Transformer	37.9	57.3	62.7	68.1
Ours + Transformer	43.5	58.8	64.6	69.7

The improvements are largest at lower $k$ values (top-1 and top-3), suggesting that back translation helps the model make more precise high-confidence predictions.

Ablation Studies

Effect of unlabeled data size: On retrosynthesis with Transformer, performance improves as unlabeled data increases from 50K to 250K, then plateaus or declines beyond 250K. The authors attribute this to noise in the back-translated data outweighing the benefits at larger scales.

Effect of labeled data size: With only 5K labeled samples, adding back-translated data hurts performance because the reverse model is too weak to generate useful synthetic data. As labeled data increases (10K, 25K, 50K), the benefit of back translation grows. This confirms that the method requires a reasonably well-trained reverse model to be effective.

Data filtration: Using 1M unfiltered back-translated molecules sometimes hurts performance (e.g., QED drops from 71.9% to 75.1% vs. 82.9% with filtering), while filtering to enforce the same constraints as the labeled data recovers and exceeds the 250K filtered results.

Consistent Gains Across Architectures and Tasks

The method achieves state-of-the-art results on all four molecular property improvement tasks and the USPTO-50K retrosynthesis benchmark at time of publication. Several observations stand out:

Architecture agnosticism: Back translation improves both sequence-based (Transformer) and graph-based (HierG2G, GLN) models, confirming that the approach is independent of the underlying architecture.
Filtration is essential at scale: Unfiltered 1M back-translated data can degrade performance, but filtered data at the same scale consistently outperforms smaller unfiltered sets.
Training overhead is moderate: On the DRD2 task, back translation with Transformer takes about 2.5x the supervised training time (11.0h vs. 8.5h for initial training), with the back-translation step itself taking under 1 hour.
Diversity and novelty increase: Back translation improves both diversity (average pairwise distance among generated molecules) and novelty (fraction of generated molecules not seen in training) across QED and DRD2 tasks.

The authors acknowledge limitations: the method does not form a closed loop between forward and reverse models (as in dual learning approaches), and the data filtration strategy is rule-based rather than learned. They suggest joint training of forward and reverse models and learned filtration as future directions.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Training (property improvement)	Jin et al. (2019, 2020) datasets	34K-99K pairs	LogP, QED, DRD2 tasks
Training (retrosynthesis)	USPTO-50K	40K reactions	80/10/10 split from Dai et al. (2019)
Unlabeled molecules	ZINC	250K or 1M	Randomly sampled
Evaluation	Same as training	800-1000 test samples	Per-task test sets

Algorithms

Back translation with optional data filtration
Beam search with $k=20$ for inference
Random sampling for back-translation step (Equation 5)
Dice similarity on Morgan fingerprints for similarity constraint

Models

Transformer: 6 layers, 4 attention heads, 128-dim embeddings, 512-dim FFN (for property improvement); 4 layers, 8 heads, 256-dim embeddings, 2048-dim FFN (for retrosynthesis)
HierG2G: Settings from Jin et al. (2020)
GLN: Settings from Dai et al. (2019)

Evaluation

Metric	Task	Best Value	Baseline	Notes
LogP improvement	LogP ($\delta \geq 0.6$)	2.86	2.49 (HierG2G)	Transformer + BT(1M, filtered)
LogP improvement	LogP ($\delta \geq 0.4$)	4.41	3.98 (HierG2G)	Transformer + BT(1M, filtered)
Success rate	QED	82.9%	76.9% (HierG2G)	Transformer + BT(1M, filtered)
Success rate	DRD2	87.3%	85.9% (HierG2G)	HierG2G + BT(250K, filtered)
Top-1 accuracy	USPTO-50K (known type)	67.9%	64.2% (GLN)	Ours + GLN

Hardware

The paper reports training times (8.5h for Transformer, 16.8h for HierG2G on DRD2 with 1M unlabeled data) but does not specify the GPU hardware used.

Artifacts

Artifact	Type	License	Notes
BT4MolGen	Code	MIT	Official implementation in Python

Paper Information

Citation: Fan, Y., Xia, Y., Zhu, J., Wu, L., Xie, S., & Qin, T. (2021). Back translation for molecule generation. Bioinformatics, 38(5), 1244-1251. https://doi.org/10.1093/bioinformatics/btab817

@article{fan2022back,
  title={Back translation for molecule generation},
  author={Fan, Yang and Xia, Yingce and Zhu, Jinhua and Wu, Lijun and Xie, Shufang and Qin, Tao},
  journal={Bioinformatics},
  volume={38},
  number={5},
  pages={1244--1251},
  year={2021},
  publisher={Oxford University Press},
  doi={10.1093/bioinformatics/btab817}
}

RetMol: Retrieval-Based Controllable Molecule Generation

Sun, 22 Mar 2026 00:00:00 +0000

Retrieval-Augmented Generation for Molecules

This is a Method paper that introduces RetMol, a retrieval-based framework for controllable molecule generation. The key idea is to guide a pre-trained generative model using a small set of exemplar molecules that partially satisfy the desired design criteria, retrieved from a task-specific database. The approach requires no task-specific fine-tuning of the generative backbone and works effectively with very few exemplar molecules (as few as 23).

Limitations of Existing Controllable Generation

Existing approaches to controllable molecule generation fall into three categories, each with drawbacks:

Reinforcement learning (RL)-based methods require task-specific fine-tuning of the generative model for each new objective
Supervised learning (SL)-based methods need molecules with desired properties as training data, which may be scarce
Latent optimization-based methods require training property predictors in the latent space, which is challenging with limited active molecules and incompatible with variable-length latent spaces like those in transformers

RetMol addresses all three issues by keeping the generative backbone frozen and using a lightweight, task-agnostic retrieval module that can be applied to new tasks simply by swapping the retrieval database.

The RetMol Framework

RetMol consists of four components built around a pre-trained encoder-decoder backbone (Chemformer, a BART variant trained on ZINC):

Retrieval Database

A task-specific collection of exemplar molecules that at least partially satisfy the design criteria. The database can be very small (e.g., 23 known inhibitors for the SARS-CoV-2 task) and is dynamically updated during inference with newly generated molecules.

Molecule Retriever

A heuristic-based module that selects the $K$ most relevant exemplar molecules (default $K = 10$). It first constructs a feasible set of molecules satisfying all constraints, then selects those with the best property scores. If too few molecules satisfy all constraints, it progressively relaxes constraints until enough candidates are available.

Information Fusion via Cross-Attention

The core trainable component. Retrieved exemplar embeddings are fused with the input molecule embedding using cross-attention:

$$\boldsymbol{e} = f_{\text{CA}}(\boldsymbol{e}_{\text{in}}, \boldsymbol{E}_r; \theta) = \text{Attn}(\text{Query}(\boldsymbol{e}_{\text{in}}), \text{Key}(\boldsymbol{E}_r)) \cdot \text{Value}(\boldsymbol{E}_r)$$

where $\boldsymbol{e}_{\text{in}} = \text{Enc}(x_{\text{in}}) \in \mathbb{R}^{L \times D}$ is the input embedding and $\boldsymbol{E}_r = [\boldsymbol{e}_r^1, \ldots, \boldsymbol{e}_r^K]$ are the retrieved exemplar embeddings. This module adds less than 5% parameter overhead (460K parameters over the 10M base model).

Self-Supervised Training: Nearest Neighbor Prediction

Rather than reconstructing the input molecule (which would make the retrieval module unnecessary), RetMol trains the fusion module to predict the nearest neighbor of the input:

$$\mathcal{L}(\theta) = \sum_{i=1}^{B} \text{CE}\left(\text{Dec}\left(f_{\text{CA}}(\boldsymbol{e}_{\text{in}}^{(i)}, \boldsymbol{E}_r^{(i)}; \theta)\right), x_{\text{1NN}}^{(i)}\right)$$

The remaining $K - 1$ nearest neighbors serve as the retrieved exemplar molecules. This forces the fusion module to learn how to use exemplar molecules to transform the input toward a related target. Only the fusion module parameters are updated; the encoder and decoder remain frozen.

During inference, RetMol uses an iterative process:

Encode the input molecule and retrieved exemplars
Fuse embeddings via cross-attention
Perturb the fused embedding $M$ times with Gaussian noise
Greedily decode $M$ candidate molecules
Replace the input with the best candidate if it improves upon the current score
Add remaining good candidates to the retrieval database
Repeat until convergence or a maximum number of iterations

The dynamic update of the retrieval database is critical for extrapolating beyond the initial set of exemplar molecules.

Experiments and Results

RetMol is evaluated on four tasks of increasing difficulty:

QED Optimization Under Similarity Constraint

Goal: generate molecules with QED $\geq$ 0.9 while maintaining Tanimoto similarity $\geq$ 0.4 to the input. RetMol achieves 94.5% success rate, compared to 92.8% for the previous best (QMO).

Penalized LogP Optimization

Goal: maximize penalized LogP while maintaining structural similarity. At $\delta = 0.4$, RetMol achieves 11.55 average improvement, compared to 7.71 for QMO.

GSK3$\beta$ + JNK3 Dual Inhibitor Design

Goal: simultaneously satisfy four constraints (GSK3$\beta$ inhibition $\geq$ 0.5, JNK3 inhibition $\geq$ 0.5, QED $\geq$ 0.6, SA $\leq$ 4). Results:

Method	Success %	Novelty	Diversity
REINVENT	47.9	0.561	0.621
RationaleRL	74.8	0.568	0.701
MARS	92.3	0.824	0.719
MolEvol	93.0	0.757	0.681
RetMol	96.9	0.862	0.732

RetMol achieves this without task-specific fine-tuning and requires only 80 iterations compared to MARS’s 550.

SARS-CoV-2 Main Protease Inhibitor Optimization

A real-world task using only 23 known inhibitors as the retrieval database and optimizing 8 weakly-binding drugs. Under the milder similarity constraint ($\delta = 0.4$), RetMol achieves 2.84 kcal/mol average binding affinity improvement versus 1.67 for Graph GA. Under the stricter constraint ($\delta = 0.6$), RetMol succeeds on 5/8 molecules versus 3/8 for Graph GA.

Key Analysis Findings

Database size: Strong performance even with 100 molecules, already outperforming baselines on success rate
Database quality: Molecules satisfying all four constraints give the best results (96.9%), but partial satisfaction still works reasonably (84.7% with two properties)
Training objective: The nearest neighbor prediction objective outperforms conventional reconstruction on validity (0.902 vs. 0.834) and uniqueness (0.922 vs. 0.665)
Dynamic database update: Essential for extrapolating beyond the initial retrieval database, generating molecules with property values exceeding the best in the original database

Limitations

RetMol requires exemplar molecules that at least partially satisfy the design criteria. When such molecules are entirely unavailable, the framework cannot be applied. The method also relies on property predictors (for scoring and retrieval), whose accuracy directly affects generation quality. The iterative refinement process adds computational overhead at inference time, and the results depend on the Chemformer backbone’s generation capabilities.

Reproducibility

Artifact	Type	License	Notes
NVlabs/RetMol	Code	NVIDIA Source Code License-NC	Full training and inference code
NVlabs/RetMol (checkpoints)	Model	CC BY-NC-SA 4.0	Pre-trained model checkpoints

Data: ZINC250k and ChEMBL datasets for training. Task-specific retrieval databases constructed from these datasets. COVID-19 task uses 23 known SARS-CoV-2 Mpro inhibitors.

Training: Information fusion module trained on 4x V100 GPUs (16GB each) for approximately 2 hours. Batch size of 256 per GPU, 50K iterations.

Inference: Single V100 GPU. Greedy decoding with Gaussian perturbation ($\sigma = 1$) for sampling multiple candidates per iteration.

Backbone: Chemformer (BART variant) pre-trained on ZINC. Frozen during RetMol training and inference.

Paper Information

Citation: Wang, Z., Nie, W., Qiao, Z., Xiao, C., Baraniuk, R. G., & Anandkumar, A. (2023). Retrieval-based Controllable Molecule Generation. Proceedings of the Eleventh International Conference on Learning Representations (ICLR 2023).

Publication: International Conference on Learning Representations (ICLR) 2023

Additional Resources:

@inproceedings{wang2023retrieval,
  title={Retrieval-based Controllable Molecule Generation},
  author={Wang, Zichao and Nie, Weili and Qiao, Zhuoran and Xiao, Chaowei and Baraniuk, Richard G. and Anandkumar, Anima},
  booktitle={International Conference on Learning Representations},
  year={2023},
  url={https://openreview.net/forum?id=vDFA1tpuLvk}
}

MolGen: Molecular Generation with Chemical Feedback

Fri, 20 Mar 2026 00:00:00 +0000

A SELFIES-Based Method for Molecular Generation

This is a Method paper that introduces MolGen, a pre-trained molecular language model for generating molecules with desired chemical properties. The primary contribution is a three-part framework: (1) pre-training on 100M+ molecular SELFIES to learn structural and grammatical knowledge, (2) domain-agnostic molecular prefix tuning for cross-domain knowledge transfer, and (3) a chemical feedback paradigm that aligns the model’s generative probabilities with real-world chemical preferences. MolGen is the first language model pre-trained on SELFIES rather than SMILES, which guarantees 100% syntactic validity of generated molecules.

Challenges in Language Model-Based Molecule Generation

Generating novel molecules with desirable properties is a central task in drug discovery and chemical design. The molecular space is estimated at $10^{33}$ possible structures, making exhaustive search impractical. Prior deep generative approaches face several limitations:

Syntactic invalidity: SMILES-based language models frequently generate strings that do not correspond to valid molecular graphs. A single random mutation of a SMILES string has only a 9.9% chance of remaining valid.
Narrow domain focus: Most existing models focus exclusively on synthetic molecules and neglect natural products, which have distinct structural complexity and scaffold diversity.
Molecular hallucinations: Generated molecules may satisfy chemical structural rules yet fail to exhibit anticipated chemical activity in practical applications. The authors formally define this as molecules that “comply with chemical structural rules, yet fail to exhibit practical utility or the anticipated properties.”
Limited optimization signals: Existing approaches rely on reinforcement learning (high variance), fixed-dimensional latent spaces, or expert-provided generation rules, all of which impede efficient exploration of chemical space.

Core Innovation: Pre-training with SELFIES and Chemical Feedback

MolGen’s novelty rests on three interconnected components.

SELFIES-Based Pre-training

MolGen uses SELFIES (Self-Referencing Embedded Strings) instead of SMILES. SELFIES guarantees that every possible combination of symbols in the alphabet corresponds to a chemically valid molecular graph. The model uses a compact vocabulary of 185 tokens.

The first pre-training stage uses a BART-style encoder-decoder. Tokens from a SELFIES string $S = {s_1, \ldots, s_l}$ are randomly replaced with [MASK], then the corrupted input is encoded bidirectionally and decoded left-to-right. The reconstruction loss is:

$$ \mathcal{L}_{\text{ce}}(S) = -\sum_{j=1}^{l} \sum_{s} p_{\text{true}}(s \mid S, S_{< j}) \log p_{\theta}(s \mid S, S_{< j}; \theta) $$

where $S_{< j}$ denotes the partial sequence ${s_0, \ldots, s_{j-1}}$ and $p_{\text{true}}$ is the one-hot distribution under standard maximum likelihood estimation.

Domain-Agnostic Molecular Prefix Tuning

The second pre-training stage introduces shared prefix vectors $P_k, P_v \in \mathbb{R}^{m \times d}$ prepended to the keys and values of multi-head attention at each layer. Unlike conventional prefix tuning that freezes model parameters, MolGen updates the entire model. The attention output becomes:

$$ \text{head} = \text{Attn}\left(xW_q, [P_k, XW_k], [P_v, XW_v]\right) $$

This decomposes into a linear interpolation between prefix attention and standard attention:

$$ \text{head} = \lambda(x) \cdot \text{Attn}(xW_q, P_k, P_v) + (1 - \lambda(x)) \cdot \text{Attn}(xW_q, XW_k, XW_v) $$

where $\lambda(x)$ is a scalar representing the sum of normalized attention weights on the prefixes. The prefixes are trained simultaneously across synthetic and natural product domains, acting as a domain instructor.

Chemical Feedback Paradigm

To address molecular hallucinations, MolGen aligns the model’s probabilistic rankings with chemical preference rankings. Given a molecule $S$ and a set of candidate outputs $\mathcal{S}^*$ with distinct property scores $\text{Ps}(\cdot)$, the model should satisfy:

$$ p_{\text{true}}(S_i \mid S) > p_{\text{true}}(S_j \mid S), \quad \forall S_i, S_j \in \mathcal{S}^*, \text{Ps}(S_i) > \text{Ps}(S_j) $$

This is enforced via a rank loss:

$$ \mathcal{L}_{\text{rank}}(S) = \sum_{i} \sum_{j > i} \max\left(0, f(S_j) - f(S_i) + \gamma_{ij}\right) $$

where $\gamma_{ij} = (j - i) \cdot \gamma$ is a margin scaled by rank difference and $f(S) = \sum_{t=1}^{l} \log p_{\theta}(s_t \mid S, S_{< t}; \theta)$ is the estimated log-probability. The overall training objective combines cross-entropy and rank loss:

$$ \mathcal{L} = \mathcal{L}_{\text{ce}} + \alpha \mathcal{L}_{\text{rank}} $$

Label smoothing is applied to the target distribution in $\mathcal{L}_{\text{ce}}$, allocating probability mass $\beta$ to non-target tokens to maintain generative diversity.

Experiments Across Distribution Learning and Property Optimization

Datasets

Stage 1 pre-training: 100M+ unlabeled molecules from ZINC-15 (molecular weight $\leq$ 500 Da, LogP $\leq$ 5)
Stage 2 pre-training: 2.22M molecules spanning synthetic (ZINC, MOSES) and natural product (NPASS, 30,926 compounds) domains
Downstream evaluation: MOSES synthetic dataset, ZINC250K, and natural product molecules

Molecular Distribution Learning

MolGen generates 10,000 synthetic and 80,000 natural product molecules, evaluated on seven metrics (Validity, Fragment similarity, Scaffold similarity, SNN, Internal Diversity, FCD, and Novelty). Baselines include AAE, LatentGAN, CharRNN, VAE, JT-VAE, LIMO, and Chemformer.

Model	Validity	Frag	Scaf	SNN	IntDiv	FCD	Novelty
Chemformer	.9843	.9889	.9248	.5622	.8553	.0061	.9581
MolGen	1.000	.9999	.9999	.9996	.8567	.0015	1.000

On synthetic molecules, MolGen achieves 100% validity, near-perfect fragment and scaffold similarity, and the lowest FCD (0.0015). For natural products, MolGen achieves FCD of 0.6519 compared to Chemformer’s 0.8346.

Targeted Molecule Discovery

For penalized logP maximization (top-3 scores):

Model	1st	2nd	3rd
MARS (no length limit)	44.99	44.32	43.81
MolGen (no length limit)	80.30	74.70	69.85
MolGen (length-limited)	30.51	28.98	28.95

For QED maximization, MolGen achieves the maximum score of 0.948 across the top-3.

Molecular Docking

MolGen optimizes binding affinity for two protein targets (ESR1 and ACAA1), measured by dissociation constant $K_D$ (lower is better):

Model	ESR1 1st	ESR1 2nd	ESR1 3rd	ACAA1 1st	ACAA1 2nd	ACAA1 3rd
LIMO	0.72	0.89	1.4	37	37	41
MolGen	0.13	0.35	0.47	3.36	3.98	8.50

MolGen achieves the lowest dissociation constants across both targets. Optimization of the 1,000 worst-affinity molecules yields 96.7% relative improvement for ESR1 and 70.4% for ACAA1.

Constrained Molecular Optimization

Optimizing 800 molecules from ZINC250K with lowest p-logP scores under Tanimoto similarity constraints:

Model	$\delta = 0.6$	$\delta = 0.4$
RetMol	3.78 (3.29)	11.55 (11.27)
MolGen	12.08 (0.82)	12.35 (1.21)

MolGen achieves the highest mean improvement with the lowest standard deviation under both constraints.

Ablation Studies

Chemical feedback: Without it, the model generates molecules with property scores similar to initial molecules. With it ($\alpha = 3$), property scores increase progressively across generation rounds.
Prefix tuning: Removing prefix tuning reduces constrained optimization improvement by 0.45 at $\delta = 0.6$ and 2.12 at $\delta = 0.4$.
Label smoothing: Enhances diversity of generated molecules as measured by Internal Diversity.
Substructure attention: MolGen focuses attention on chemically meaningful functional groups (fluoro, phenyl, hydroxyl), while SMILES-based PLMs scatter attention across syntactic tokens. The Substructure Attention Level (SAL) metric confirms MolGen’s superior focus.

Key Findings, Limitations, and Future Directions

Key Findings

SELFIES pre-training guarantees 100% molecular validity, eliminating the need for external valency checks.
Domain-agnostic prefix tuning enables effective knowledge transfer between synthetic and natural product domains.
The chemical feedback paradigm aligns model outputs with chemical preferences without requiring external annotated data or reference databases.
MolGen achieves the best or competitive results across all evaluated tasks: distribution learning, targeted molecule discovery, constrained optimization, and molecular docking.

Limitations

The authors acknowledge several limitations:

Computational cost: Training and fine-tuning on large datasets is computationally intensive.
Model interpretability: The transformer architecture makes it difficult to understand explicit rationale behind decisions.
Single-target optimization only: The chemical feedback paradigm handles single-target optimization; multiple conflicting objectives could create ambiguous optimization trajectories.
Task specificity: MolGen is designed for 2D molecular generation; 3D conformation information is not incorporated.
Reaction prediction: When applied to reaction prediction (an off-target task), MolGen achieves only 71.4% accuracy on 39,990 reaction samples.

Future Directions

The authors suggest applying MolGen to retrosynthesis and reaction prediction, exploring multimodal pre-training, and incorporating additional knowledge sources.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Stage 1 pre-training	ZINC-15	100M+ molecules	MW $\leq$ 500 Da, LogP $\leq$ 5
Stage 2 pre-training	ZINC + MOSES + NPASS	2.22M molecules	Synthetic and natural product domains
Distribution learning (synthetic)	MOSES	~1.9M molecules	Standard benchmark split
Distribution learning (natural)	NPASS	30,926 compounds	30,126 train / 800 test
Constrained optimization	ZINC250K	800 molecules	Lowest p-logP scores

Algorithms

Architecture: BART-based encoder-decoder with SELFIES vocabulary (185 tokens)
Prefix length: 5 tunable vectors per layer
Optimizer: LAMB (pre-training), AdamW (fine-tuning)
Pre-training: 600M steps with linear warm-up (180,000 steps) followed by linear decay
Rank loss weight ($\alpha$): Recommended values of 3 or 5
Candidate generation: 30 candidates per molecule (synthetic), 8 candidates (natural products)

Models

MolGen is publicly available on Hugging Face. The model uses a vocabulary of 185 SELFIES tokens and is comparable in size to Chemformer-large.

Evaluation

Metric	Domain	MolGen	Best Baseline	Notes
FCD (lower is better)	Synthetic	0.0015	0.0061 (Chemformer)	Distribution learning
p-logP top-1 (no limit)	Synthetic	80.30	44.99 (MARS)	Targeted discovery
QED top-1	Synthetic	0.948	0.948 (several)	Tied at maximum
ESR1 $K_D$ top-1	Docking	0.13	0.72 (LIMO)	Binding affinity
p-logP improvement ($\delta=0.4$)	Synthetic	12.35 (1.21)	11.55 (11.27) (RetMol)	Constrained optimization

Hardware

6 NVIDIA V100 GPUs
Pre-training batch size: 256 molecules per GPU
Fine-tuning batch size: 6 (synthetic and natural product)
Training: 100 epochs for fine-tuning tasks

Artifacts

Artifact	Type	License	Notes
zjunlp/MolGen	Code	Unknown	Official PyTorch implementation
zjunlp/MolGen-large	Model	Unknown	Pre-trained weights on Hugging Face

Paper Information

Citation: Fang, Y., Zhang, N., Chen, Z., Guo, L., Fan, X., & Chen, H. (2024). Domain-Agnostic Molecular Generation with Chemical Feedback. Proceedings of the Twelfth International Conference on Learning Representations (ICLR 2024).

Additional Resources:

@inproceedings{fang2024domain,
  title={Domain-Agnostic Molecular Generation with Chemical Feedback},
  author={Fang, Yin and Zhang, Ningyu and Chen, Zhuo and Guo, Lingbing and Fan, Xiaohui and Chen, Huajun},
  booktitle={The Twelfth International Conference on Learning Representations},
  year={2024},
  url={https://openreview.net/forum?id=9rPyHyjfwP}
}

GP-MoLFormer: Molecular Generation via Transformers

Thu, 25 Dec 2025 00:00:00 +0000

Contribution and Taxonomic Focus

This is primarily a Methodological paper, as it proposes a specific neural architecture (GP-MoLFormer) and a novel fine-tuning algorithm (Pair-tuning) for molecular generation. It validates these contributions against standard baselines (e.g., JT-VAE, MolGen-7b).

It also contains a secondary Theoretical contribution by establishing an empirical scaling law that relates inference compute (generation size) to the novelty of the generated molecules.

Motivation: Data Scale and Prompt-Based Optimization

While large language models (LLMs) have transformed text generation, the impact of training data scale and memorization on molecular generative models remains under-explored. Specifically, there is a need to understand how training on billion-scale datasets affects the novelty of generated molecules and whether biases in public databases (like ZINC and PubChem) perpetuate memorization. Furthermore, existing optimization methods often require computationally expensive property predictors or reinforcement learning loops; there is a practical need for more efficient “prompt-based” optimization techniques.

Core Innovations: Architecture and Pair-Tuning

Architecture: The application of a linear-attention transformer decoder with Rotary Positional Embeddings (RoPE) to generative chemistry, allowing for efficient training on 1.1 billion SMILES.
Pair-Tuning: A novel, parameter-efficient fine-tuning method that uses property-ordered molecular pairs to learn “soft prompts” for optimization without updating the base model weights.
Scaling Analysis: An extensive empirical investigation mapping the trade-off between inference compute (up to 10B generations) and chemical novelty, fitting an exponential decay curve that demonstrates how novelty saturates as generation volume grows.

Experimental Methodology and Downstream Tasks

The authors evaluated GP-MoLFormer on three distinct tasks, though the comparisons highlight the difficulty of evaluating foundation models against classical baselines:

De Novo Generation: Comparing validity, uniqueness, and novelty against baselines (CharRNN, VAE, LIMO, MolGen-7b) on a held-out test set. Notably, this is an unequal comparison; most baselines were trained on the 1.6M molecule MOSES dataset, whereas GP-MoLFormer uses up to 1.1B molecules, meaning performance gains are heavily driven by data scale.
Scaffold-Constrained Decoration: Generating molecules from DRD2 active binder scaffolds and measuring the hit rate of active compounds against specialized scaffold decorators.
Property-Guided Optimization: Using Pair-tuning to optimize for Drug-likeness (QED), Penalized logP, and DRD2 binding activity, comparing the results to graph-based and reinforcement learning benchmarks.

Additionally, they performed a Scaling Study:

Comparing models trained on raw (1.1B) vs. de-duplicated (650M) data.
Generating up to 10 billion molecules to fit empirical scaling laws for novelty.

Key Findings and Scaling Laws

Scale Driven Performance: GP-MoLFormer achieves high internal diversity and validity on generation metrics. However, its baseline novelty percentage (~32%) is considerably lower than classical models. The authors attribute this to the massive training scale forcing the model to heavily prioritize matching real-world molecule frequencies over pure exploration. GP-MoLFormer’s advantage in generation metrics over LLM-baselines like MolGen-7b likely stems heavily from its 10x larger training dataset rather than fundamental architectural superiority.
Pair-Tuning Efficacy: The proposed pair-tuning method effectively optimizes properties (e.g., improving DRD2 activity scores) without requiring full model fine-tuning or external reward loops. While successful, the text-based generation yields ~94.5% validity during optimization, which lags behind graph and SELFIES-based baselines that guarantee 100% structural validity.
Memorization vs. Novelty: Training on de-duplicated data (GP-MoLFormer-UNIQ) yields higher novelty (approx. 5-8% higher) than training on raw data, confirming that duplication bias in public databases leads directly to memorization.
Inference Scaling Law: Novelty decays exponentially with generation size ($y = ae^{-bx}$), yet the model maintains generative capability (~16.7% novelty) even after generating an unprecedented 10 billion molecules.

Reproducibility Details

Data

Sources: A combination of PubChem (111M SMILES) and ZINC (1B SMILES) databases. Downloading and pre-training instructions are located in the repository’s data/README.md.
Preprocessing:
- All SMILES were canonicalized using RDKit (no isomeric information).
- GP-MoLFormer (Base): Trained on the full 1.1B dataset (includes duplicates).
- GP-MoLFormer-UNIQ: Trained on a de-duplicated subset of 650M SMILES.
Tokenization: Uses the tokenizer from Schwaller et al. (2019) with a vocabulary size of 2,362 tokens.
Filtering: Sequences restricted to a maximum length of 202 tokens.

Algorithms

Pair-Tuning (Algorithm 1):

Objective: Learn task-specific soft prompts $\phi_T$ to maximize the conditional probability of target molecule $b$ given a seed molecule $a$, where pair $(a, b)$ satisfies the property condition $b > a$. The base model parameters $\theta$ remain frozen.
Prompt Structure: Autoregressive training optimizes the continuous embeddings of $n$ enhancement tokens against the cross-entropy loss of the target sequence: $$ \mathcal{L}(\phi_T) = - \sum_{i=1}^{|b|} \log P_{\theta}(b_i | \phi_T, a, b_{
Hyperparameters: Trained for 1,000 epochs with a batch size of 35 and a fixed learning rate of $3 \times 10^{-2}$.
Inference: The learned prompt $\phi_T$ and seed molecule $a$ are prepended as context, and candidates are sampled autoregressively until a termination token is produced.

Models

Availability: The model trained on deduplicated data (GP-MoLFormer-UNIQ) is publicly available on Hugging Face. The full 1.1B base model is not explicitly hosted. The source code repository includes a disclosure that IBM will not maintain the code going forward.
Architecture: Transformer decoder (~47M parameters: 12 layers, 12 heads, hidden size 768).
Attention Mechanism: Combines Linear Attention (Generalized Random Feature map, $\phi$) with Rotary Positional Embeddings (RoPE). To avoid the quadratic complexity of standard attention while maintaining relative positional awareness, RoPE is applied to queries ($Q$) and keys ($K$) prior to the random feature mapping: $$ \text{Attention}(Q, K, V) = \frac{\sum_{n=1}^N \langle \phi(R_m q_m), \phi(R_n k_n) \rangle v_n}{\sum_{n=1}^N \langle \phi(R_m q_m), \phi(R_n k_n) \rangle} $$
Inference Speed: ~3ms per forward pass on a single A100 GPU.

Evaluation

Generation Quality Metrics: Validity, Uniqueness, Novelty (MOSES suite), Fréchet ChemNet Distance (FCD), Scaffold similarity (Scaf), and Similarity to Nearest Neighbor (SNN).
MoLFormer-Based Metrics: The authors introduce Fréchet MoLFormer Distance (FMD) and MoLFormer-space IntDiv2 to measure distributional similarity using their own pre-trained continuous embeddings instead of standard fingerprints.
Optimization Metrics: Penalized logP (calculated as $\text{logP} - \text{SA} - \text{max}(\text{maxrings}(size) - 6, 0)$), Drug-likeness (QED), and DRD2 activity scores.
Scaling Metrics: Empirical fit for novelty decay: $y = ae^{-bx}$.

Hardware

Compute: 16 x NVIDIA A100 (80 GB) GPUs across 2 nodes connected via EDR Infiniband.
Training Time:
- GP-MoLFormer (1.1B data): ~115 hours total (28.75 hours/epoch for 4 epochs).
- GP-MoLFormer-UNIQ (650M data): ~80 hours total.
Hyperparameters: Used a batch size of 1,600 molecules per GPU with a fixed learning rate of $1.6 \times 10^{-4}$ (scaled up to $8\times$ factor as GPUs increased).
Optimization: Used distributed data-parallel training and adaptive bucketing by sequence length to handle scale.

Artifacts

Artifact	Type	License	Notes
GP-MoLFormer (GitHub)	Code	Apache 2.0	Official implementation; IBM will not maintain going forward
GP-MoLFormer-Uniq (Hugging Face)	Model	Apache 2.0	Pre-trained on 650M de-duplicated SMILES

The full 1.1B base model weights are not publicly hosted. The training data (PubChem and ZINC) is publicly available, and instructions for downloading and pre-processing are in the repository’s data/README.md.

Paper Information

Citation: Ross, J., Belgodere, B., Hoffman, S. C., Chenthamarakshan, V., Navratil, J., Mroueh, Y., & Das, P. (2025). GP-MoLFormer: A Foundation Model For Molecular Generation. Digital Discovery, 4(10), 2684–2696. https://doi.org/10.1039/D5DD00122F

Publication: Digital Discovery, vol. 4, no. 10, pp. 2684–2696 (2025)

@article{ross2025gpmolformer,
  title={GP-MoLFormer: a foundation model for molecular generation},
  author={Ross, Jerret and Belgodere, Brian and Hoffman, Samuel C and Chenthamarakshan, Vijil and Navratil, Jiri and Mroueh, Youssef and Das, Payel},
  journal={Digital Discovery},
  volume={4},
  number={10},
  pages={2684--2696},
  year={2025},
  publisher={Royal Society of Chemistry},
  doi={10.1039/D5DD00122F}
}

Chemformer: A Pre-trained Transformer for Comp Chem

Tue, 23 Dec 2025 00:00:00 +0000

Paper Contribution and Methodological Classification

This is a Methodological ($\Psi_{\text{Method}}$) paper. It proposes an architecture adaptation (Chemformer based on BART) and a specific pre-training strategy (“Combined” masking and augmentation). The paper validates this method by benchmarking against established models on multiple tasks, including direct synthesis, retrosynthesis, and molecular optimization. It also includes a secondary Resource ($\Psi_{\text{Resource}}$) contribution by making the pre-trained models and code available.

Motivation: Computational Bottlenecks in Cheminformatics

Existing Transformer models for cheminformatics are often developed for single applications and are computationally expensive to train from scratch. For example, training a Molecular Transformer for reaction prediction can take days, limiting hyperparameter exploration. Self-supervised pre-training (like BERT or T5) has significantly advanced NLP by reducing fine-tuning time and improving performance. In chemistry, applications have traditionally focused on task-specific datasets or encoder-only architectures, which perform poorly on sequence generation tasks. The authors aim to use transfer learning on a large unlabelled dataset to create a model that converges quickly and performs well across diverse sequence-to-sequence and discriminative tasks.

Core Innovation: BART Architecture and Combined Pre-training

The primary insight lies in the adaptation of the BART architecture for chemistry and the introduction of a “Combined” self-supervised pre-training task.

Architecture: Chemformer uses the BART encoder-decoder structure, allowing it to handle both discriminative (property prediction) and generative (reaction prediction) tasks efficiently. This provides an alternative to encoder-only (BERT) or decoder-only (GPT) models.
Combined Pre-training: The authors introduce a task that applies both Span Masking (randomly replacing tokens with ) and SMILES Augmentation (permuting atom order, see Randomized SMILES) simultaneously. Formally, given a canonical SMILES sequence $x$, a corrupted sequence $\tilde{x} = \text{Mask}(\text{Augment}(x))$ is generated. The model is trained using an autoregressive cross-entropy loss to reconstruct the canonical sequence from the corrupted input: $$ \mathcal{L}_{\text{pre-train}} = -\sum_{t=1}^{|x|} \log P(x_t \mid x_{
Tunable Augmentation: A downstream augmentation strategy is proposed where the probability of augmenting the input/output SMILES ($p_{aug}$) is a tunable hyperparameter, performed on-the-fly.

Experimental Setup and Pre-training Tasks

The authors pre-trained Chemformer on 100 million molecules from ZINC-15 and fine-tuned it on three distinct task types:

Seq2Seq Reaction Prediction:
- Direct Synthesis: USPTO-MIT dataset (Mixed and Separated).
- Retrosynthesis: USPTO-50K dataset (see also Molecular Transformer, Tied Two-Way Transformers).
Molecular Optimization: Generating molecules with improved properties (LogD, solubility, clearance) starting from ChEMBL matched molecular pairs.
Discriminative Tasks:
- QSAR: Predicting properties (ESOL, FreeSolv, Lipophilicity) from MoleculeNet.
- Bioactivity: Predicting pXC50 values for 133 genes using ExCAPE data.

Ablation studies compared three pre-training strategies (Masking, Augmentation, Combined) against a randomly initialized baseline.

Results, Trade-offs, and Conclusions

Performance: Chemformer achieved competitive top-1 accuracy on USPTO-MIT (91.3% Mixed) and USPTO-50K (53.6-54.3%), outperforming the Augmented Transformer and graph-based models (GLN, GraphRetro).
Convergence Speed: Pre-training significantly accelerated training; fine-tuning for just 20 epochs (30 mins) outperformed the previous baselines trained for significantly longer.
Pre-training Tasks: The “Combined” task generally performed best for reaction prediction and bioactivity, while “Masking” was superior for molecular optimization.
Augmentation Trade-off: The augmentation strategy improved top-1 accuracy but significantly degraded top-5/10 accuracy because beam search outputs became populated with augmented versions of the same molecule. This presents a considerable limitation for practical applications like retrosynthesis mapping, where retrieving a diverse set of candidate reactions is often critical.
Discriminative Evaluation Caveats: Chemformer underperformed specialized baselines (like D-MPNN or MolBERT) on small discriminative datasets. The authors note that direct comparison is difficult: Chemformer was trained simultaneously on multiple subtasks (multi-task learning), while the literature baselines were trained and tuned on each subtask separately. Additionally, the Chemformer encoder uses fewer than 20M parameters compared to MolBERT’s approximately 85M, and Chemformer’s pre-training does not include molecular property objectives. For other transfer learning approaches to QSAR, see MolPMoFiT.
Pre-training Data Scope: The 100M pre-training dataset from ZINC-15 was selected with constraints on molecular weight ($\le 500$ Da) and LogP ($\le 5$), focusing the learned representations on small, drug-like molecules.

Reproducibility Details

Data

Note: The primary GitHub repository for Chemformer was officially archived on February 11, 2026. Pre-trained weights and datasets used in the paper are still hosted externally on Box. Active development of Chemformer models has moved to the AiZynthModels repository.

Artifact	Type	License	Notes
Chemformer (GitHub)	Code	Apache-2.0	Archived; original PyTorch implementation
AiZynthModels (GitHub)	Code	Apache-2.0	Active successor repository
Pre-trained weights (Box)	Model	Unknown	Base and Large model checkpoints

The following datasets were used for pre-training and benchmarking.

Purpose	Dataset	Size	Notes
Pre-training	ZINC-15	100M	Selected subset (reactive, annotated purchasability, MW $\le 500$, LogP $\le 5$). Split: 99% Train / 0.5% Val / 0.5% Test.
Direct Synthesis	USPTO-MIT	~470k	Evaluated on “Mixed” and “Separated” variants.
Retrosynthesis	USPTO-50K	~50k	Standard benchmark for retrosynthesis.
Optimization	ChEMBL MMPs	~160k Train	Matched Molecular Pairs for LogD, solubility, and clearance optimization.
Properties	MoleculeNet	Small	ESOL (1128), FreeSolv (642), Lipophilicity (4200).
Bioactivity	ExCAPE	~312k	133 gene targets; >1200 compounds per gene.

Preprocessing:

Tokenization: Regex-based tokenization (523 tokens total) derived from ChEMBL 27 canonical SMILES.
Augmentation: SMILES enumeration (permuting atom order) used for pre-training and on-the-fly during fine-tuning ($p_{aug}=0.5$ for Seq2Seq, $p_{aug}=1.0$ for discriminative).

Algorithms

Pre-training Tasks:
1. Masking: Span masking (BART style).
2. Augmentation: Input is a randomized SMILES; target is canonical SMILES.
3. Combined: Input is augmented then masked; target is canonical SMILES.
Optimization:
- Optimizer: Adam ($\beta_1=0.9, \beta_2=0.999$).
- Schedule: Linear warm-up (8000 steps) for pre-training; One-cycle schedule for fine-tuning.
Inference: Beam search with width 10 for Seq2Seq tasks. Used molbart/inference_score.py and molbart/retrosynthesis/round_trip_inference.py for standard and round-trip validation.

Models

Two model sizes were trained. Both use the Pre-Norm Transformer layout with GELU activation.

Hyperparameter	Chemformer (Base)	Chemformer-Large
Layers	6	8
Model Dimension	512	1024
Feed-forward Dim	2048	4096
Attention Heads	8	16
Parameters	~45M	~230M
Pre-training Task	All 3 variants	Combined only

Evaluation

Comparisons relied on Top-N accuracy for reaction tasks and validity metrics for optimization.

Metric	Task	Key Result	Baseline
Top-1 Acc	Direct Synthesis (Sep)	92.8% (Large)	91.1% (Aug Transformer)
Top-1 Acc	Retrosynthesis	54.3% (Large)	53.7% (GraphRetro) / 52.5% (GLN)
Desirable %	Mol Optimization	75.0% (Base-Mask)	70.2% (Transformer-R)
RMSE	Lipophilicity	0.598 (Combined)	0.555 (D-MPNN)

Hardware

Compute: 4 NVIDIA V100 GPUs (batch size 128 per GPU).
Training Time:
- Pre-training: 2.5 days (Base) / 6 days (Large) for 1M steps.
- Fine-tuning: ~20-40 epochs for reaction prediction (<12 hours).

Paper Information

Citation: Irwin, R., Dimitriadis, S., He, J., & Bjerrum, E. J. (2022). Chemformer: a pre-trained transformer for computational chemistry. Machine Learning: Science and Technology, 3(1), 015022. https://doi.org/10.1088/2632-2153/ac3ffb

Publication: Machine Learning: Science and Technology 2022

@article{irwinChemformerPretrainedTransformer2022,
  title = {Chemformer: A Pre-Trained Transformer for Computational Chemistry},
  shorttitle = {Chemformer},
  author = {Irwin, Ross and Dimitriadis, Spyridon and He, Jiazhen and Bjerrum, Esben Jannik},
  year = 2022,
  month = jan,
  journal = {Machine Learning: Science and Technology},
  volume = {3},
  number = {1},
  pages = {015022},
  publisher = {IOP Publishing},
  issn = {2632-2153},
  doi = {10.1088/2632-2153/ac3ffb}
}