Molecular Generation on Hunter Heidenreich | ML Research Scientist

REINVENT: Reinforcement Learning for Mol. Design

Sat, 28 Mar 2026 00:00:00 +0000

Augmented Episodic Likelihood for Goal-Directed Generation

This is a Method paper that introduces REINVENT, a policy-based reinforcement learning framework for molecular de novo design. The primary contribution is a novel cost function, the augmented episodic likelihood, that fine-tunes a SMILES-based recurrent neural network (RNN) pre-trained on ChEMBL toward generating molecules satisfying user-defined property objectives. The method anchors the agent to the prior distribution of valid drug-like molecules, addressing failure modes of standard REINFORCE algorithms (reward exploitation and mode collapse to trivially simple structures).

De Novo Design Needs Flexible, Data-Driven Approaches

Traditional de novo design methods fall into three categories, each with limitations:

Structure-based approaches grow ligands to fit binding pockets but often produce molecules with poor DMPK profiles and synthetic intractability.
Ligand-based virtual library approaches generate large libraries and score them, but are constrained by pre-defined reaction rules or transformation rules that limit chemical diversity.
Inverse QSAR methods attempt to map favorable activity regions back to molecular structures, but require descriptors suitable for both forward prediction and inverse mapping.

RNN-based generative models trained on SMILES offer a data-driven alternative that can learn the underlying distribution of drug-like chemical space without rigid rules. Segler et al. (2017) showed that fine-tuning a pre-trained RNN on focused actives yields high fractions of predicted actives. However, this maximum likelihood fine-tuning cannot use negative or continuous scores and risks catastrophic forgetting.

Prior RL approaches had significant issues. Jaques et al. (2016) used Deep Q-learning with prior likelihood regularization for sequence generation, but reported dependence on hand-written rules to penalize undesirable sequences and still observed reward exploitation producing unrealistically simple molecules. Standard REINFORCE algorithms tend to converge on trivial solutions (e.g., generating only “C” to satisfy a scoring function).

The Augmented Episodic Likelihood Framework

The core innovation is a formulation where the agent learns a policy that minimizes the squared difference between its own log-likelihood and an augmented target likelihood.

The RNN is first pre-trained on 1.5 million canonical SMILES from ChEMBL via maximum likelihood estimation:

$$ J(\Theta) = -\sum_{t=1}^{T} \log P(x^{t} \mid x^{t-1}, \dots, x^{1}) $$

The pre-trained model (the Prior) is then used as the starting point for the Agent. For a generated SMILES sequence $A = a_1, a_2, \dots, a_T$, the model likelihood is $P(A) = \prod_{t=1}^{T} \pi(a_t \mid s_t)$, and a scoring function $S(A) \in [-1, 1]$ rates desirability.

The augmented likelihood combines prior likelihood with the score:

$$ \log P(A)_{\mathbb{U}} = \log P(A)_{Prior} + \sigma S(A) $$

where $\sigma$ is a scalar coefficient controlling the trade-off between prior fidelity and score optimization.

The return is defined as the negative squared difference between the augmented likelihood and the agent’s likelihood:

$$ G(A) = -\left[\log P(A)_{\mathbb{U}} - \log P(A)_{\mathbb{A}}\right]^{2} $$

The agent minimizes $J(\Theta) = -G$, effectively learning a policy whose sequence likelihoods match the prior modulated by the scoring function. The authors show in supplementary material that this is equivalent to a REINFORCE algorithm with a specific final-step reward formulation.

This design has three key advantages over standard REINFORCE:

The target policy is explicitly stochastic, preserving diversity in generated molecules
The prior anchoring prevents catastrophic forgetting of SMILES syntax and chemical space coverage
No hand-written rules are needed to penalize degenerate solutions

The Agent is trained on-policy with batches of 128 generated sequences, using SGD with learning rate 0.0005 and gradient clipping to $[-3, 3]$.

Three Experiments: Sulphur Avoidance, Celecoxib Analogues, and DRD2 Activity

Prior Network Architecture

The Prior is a 3-layer RNN with 1024 Gated Recurrent Units per layer, trained on RDKit canonical SMILES from ChEMBL (molecules with 10-50 heavy atoms, elements from ${H, B, C, N, O, F, Si, P, S, Cl, Br, I}$). Training used Adam ($\beta_1 = 0.9$, $\beta_2 = 0.999$, $\epsilon = 10^{-8}$) for 50,000 steps with batch size 128 and learning rate decay of 0.02 every 100 steps. The Prior generates 94% valid SMILES, of which 90% are novel.

Experiment 1: Learning to Avoid Sulphur

A proof-of-principle task where the scoring function assigns $S(A) = 1$ for valid sulphur-free molecules, $S(A) = 0$ for invalid SMILES, and $S(A) = -1$ for sulphur-containing molecules.

The Agent method was compared against three alternatives:

Method	Fraction Valid	Fraction No S	Avg MW	Avg cLogP	Avg RotBonds	Avg AromRings
Prior	0.94	0.66	371	3.36	5.39	2.26
Agent	0.95	0.98	367	3.37	5.41	2.26
Action basis	0.95	0.92	372	3.39	6.08	2.09
REINFORCE	0.98	0.98	585	11.3	30.0	0.57
REINFORCE + Prior	0.98	0.92	232	3.05	2.8	2.11

Standard REINFORCE exploited the reward by generating sequences of predominantly “C” (average MW 585, cLogP 11.3). REINFORCE + Prior avoided this but collapsed to small, simplistic structures (MW 232). The Agent achieved 98% sulphur-free structures while maintaining molecular properties nearly identical to the Prior, demonstrating that augmented episodic likelihood preserves the prior distribution.

Experiment 2: Similarity-Guided Generation (Celecoxib Analogues)

The scoring function uses Jaccard similarity on FCFP4 fingerprints:

$$ S(A) = -1 + 2 \times \frac{\min{J_{i,j}, k}}{k} $$

where $k$ caps the rewarded similarity. With $k = 1$ and $\sigma = 15$, the Agent recovers Celecoxib itself within 200 training steps. Even when all structures with $J > 0.5$ to Celecoxib (1,804 molecules) were removed from the Prior training set, the Agent still found Celecoxib after 400 steps, despite a 700-fold reduction in prior likelihood ($\log_e P$ from $-12.7$ to $-19.2$).

With moderate similarity targets ($k = 0.7$, $\sigma = 12$), the Agent generates diverse analogues including scaffold hops where functional groups are rearranged.

Experiment 3: Target Activity (DRD2)

The most drug-discovery-relevant task: generating molecules predicted active against the dopamine receptor type 2 (DRD2). An SVM classifier (Gaussian kernel, $C = 2^7$, $\gamma = 2^{-6}$) was trained on bioactivity data from ExCAPE-DB (7,218 actives with pIC50 > 5, 100,000 sampled inactives). The actives were split by Butina clustering (ECFP6, cutoff 0.4) to decrease nearest-neighbor similarity between train and test sets.

Metric	Prior	Agent	Prior (reduced)	Agent (reduced)
Fraction valid SMILES	0.94	0.99	0.94	0.99
Fraction predicted actives	0.03	0.97	0.02	0.96
Fraction similar to train active	0.02	0.79	0.02	0.75
Fraction similar to test active	0.01	0.46	0.01	0.38
Test actives recovered (x10^-3)	13.5	126	2.85	72.6

The Agent increased the fraction of predicted actives from 2-3% (Prior) to 96-97%, representing a 250-fold enrichment in the probability of generating a test set active. The Agent based on the reduced Prior (DRD2 actives removed from ChEMBL) still recovered 7% of test actives, meaning it generated experimentally confirmed actives that appeared in neither the generative model nor the activity prediction model training data.

Anchored Policy Learning Prevents Reward Exploitation

The key finding is that augmented episodic likelihood successfully balances score optimization with prior distribution preservation. The Agent achieves task objectives (sulphur avoidance, similarity targets, activity prediction) while maintaining the molecular property distributions learned from ChEMBL. This is a significant improvement over standard REINFORCE, which either exploits rewards trivially or collapses to simple structures.

Analysis of the conditional probability distributions between the Prior and Agent (for DRD2 active generation) shows that the policy changes are not drastic: most trends learned by the Prior carry over, with targeted modifications at specific steps that substantially alter sequence likelihoods and generated structure types.

Limitations acknowledged by the authors:

All experiments use single-parameter scoring functions; multi-parametric optimization (activity + DMPK + synthetic accessibility) is left for future work
The quality of generated structures depends heavily on the Prior’s coverage of chemical space
The activity model (SVM) has limited domain of applicability, and structures outside this domain may be falsely scored
No exhaustive study of how Prior training set size, model size, and regularization affect generation quality

Future directions include multi-parametric scoring functions, exploration of token embeddings, and adversarial training where the scoring function is replaced by a discriminator network (GAN-style training).

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Prior training	ChEMBL	1.5M structures	10-50 heavy atoms, filtered elements
DRD2 activity model	ExCAPE-DB	7,218 actives + 100K inactives	Butina clustering split (ECFP6, cutoff 0.4)
Similarity target	Celecoxib	1 query structure	FCFP4 fingerprints for Jaccard similarity

Algorithms

Prior: 3-layer GRU RNN (1024 units/layer), Adam optimizer, 50K steps, batch size 128, LR 0.001 with 0.02 decay/100 steps
Agent: Same architecture, SGD with LR 0.0005, gradient clipping [-3, 3], on-policy batches of 128
DRD2 model: SVM with Gaussian kernel ($C = 2^7$, $\gamma = 2^{-6}$), grid search on validation set

Models

Artifact	Type	License	Notes
REINVENT	Code	MIT	Original implementation in TensorFlow/Python 2.7
Archived version	Code	MIT	Zenodo archive (DOI: 10.5281/zenodo.572576)

Evaluation

SMILES validity rate (RDKit parsing)
Fraction of structures satisfying scoring function
Molecular property distributions (MW, cLogP, rotatable bonds, aromatic rings)
Jaccard similarity on ECFP6/FCFP4 fingerprints
Recovery rate of known actives from test set

Hardware

Not specified in the paper. The implementation uses TensorFlow 1.0.1 with Python 2.7, RDKit, and Scikit-learn.

Paper Information

Citation: Olivecrona, M., Blaschke, T., Engkvist, O., & Chen, H. (2017). Molecular de-novo design through deep reinforcement learning. Journal of Cheminformatics, 9(1), 48.

@article{olivecrona2017molecular,
  title={Molecular de-novo design through deep reinforcement learning},
  author={Olivecrona, Marcus and Blaschke, Thomas and Engkvist, Ola and Chen, Hongming},
  journal={Journal of Cheminformatics},
  volume={9},
  number={1},
  pages={48},
  year={2017},
  publisher={Springer},
  doi={10.1186/s13321-017-0235-x}
}

ORGAN: Objective-Reinforced GANs for Molecule Design

Sat, 28 Mar 2026 00:00:00 +0000

Combining GANs and Reinforcement Learning for Goal-Directed Sequence Generation

This is a Method paper that introduces ORGAN (Objective-Reinforced Generative Adversarial Network), a framework for generating sequences that are both realistic (close to the training distribution) and optimized for domain-specific objectives. ORGAN extends SeqGAN by adding external reward functions to the reinforcement learning signal, with a tunable parameter $\lambda$ controlling the balance between adversarial (discriminator) and objective-based rewards. The authors demonstrate ORGAN on two domains: molecular generation using SMILES strings (optimizing druglikeness, solubility, and synthesizability) and musical melody generation (optimizing tonality and step ratios).

Exposure Bias and Mode Collapse in Discrete Sequence Generation

Generating discrete sequences with desirable properties presents two intertwined challenges. First, RNNs trained via maximum likelihood estimation (MLE) suffer from exposure bias, where the model sees only ground-truth prefixes during training but must condition on its own (potentially erroneous) outputs at generation time. Second, while GANs can address some of these issues through adversarial training, they were not initially applicable to discrete data due to non-differentiability of the sampling step. SeqGAN resolved this by framing the generator as an RL agent, but it optimizes only for distributional fidelity (fooling the discriminator) without any mechanism to steer generation toward specific property targets.

In drug discovery, simply generating valid, drug-like molecules is insufficient. Practitioners need to optimize for particular pharmaceutical properties (e.g., solubility, synthesizability, druglikeness) while maintaining structural diversity. Naive RL approaches can optimize properties effectively but tend to collapse onto trivial solutions (e.g., repeating “CCCCCCC” to maximize solubility). The challenge is to combine the distributional regularization of adversarial training with the goal-directedness of RL.

Mixed Reward: Interpolating Between Adversarial and Objective Signals

ORGAN’s core innovation is a reward function that linearly interpolates between the discriminator score and domain-specific objectives:

$$R(Y_{1:T}) = \lambda \cdot D_{\phi}(Y_{1:T}) + (1 - \lambda) \cdot O_{i}(Y_{1:T})$$

When $\lambda = 1$, the model reduces to SeqGAN (pure adversarial training). When $\lambda = 0$, it becomes naive RL optimizing only the objective. Intermediate values allow the adversarial component to regularize the generator, keeping samples within the distribution while the objective component steers toward desired properties.

The generator $G_{\theta}$ is an LSTM-based RNN that produces sequences token-by-token. Training follows the REINFORCE algorithm, where the expected long-term reward is:

$$J(\theta) = \mathbb{E}\left[R(Y_{1:T}) \mid s_{0}, \theta\right] = \sum_{y_{1} \in Y} G_{\theta}(y_{1} \mid s_{0}) \cdot Q(s_{0}, y_{1})$$

For intermediate timesteps (partial sequences), the action-value function $Q$ is estimated via $N$-time Monte Carlo rollouts:

$$Q(Y_{1:t-1}, y_{t}) = \begin{cases} \frac{1}{N} \sum_{n=1}^{N} R(Y_{1:T}^{n}), & \text{if } t < T \\ R(Y_{1:T}), & \text{if } t = T \end{cases}$$

where $Y_{1:T}^{n}$ are completions sampled by rolling out the current policy $G_{\theta}$ from state $Y_{1:t}$.

The policy gradient is:

$$\nabla_{\theta} J(\theta) \simeq \frac{1}{T} \sum_{t=1}^{T} \mathbb{E}_{y_{t} \sim G_{\theta}(y_{t} \mid Y_{1:t-1})} \left[\nabla_{\theta} \log G_{\theta}(y_{t} \mid Y_{1:t-1}) \cdot Q(Y_{1:t-1}, y_{t})\right]$$

Two additional mechanisms improve training:

Diversity penalty: Repeated sequences have their reward divided by their copy count, providing diminishing returns for non-unique outputs.
Wasserstein distance: The authors also implement a variant (OR(W)GAN) that replaces the standard GAN discriminator loss with the Wasserstein-1 distance via Kantorovich-Rubinstein duality, which can improve training stability and diversity.

Molecular and Musical Melody Generation Experiments

Architecture

The generator $G_{\theta}$ is an RNN with LSTM cells. The discriminator $D_{\phi}$ is a CNN for text classification following Kim (2014), with 75% dropout and L2 regularization. All optimization uses Adam. Molecular metrics are computed with RDKit.

Molecular Generation Setup

Training data consists of 5,000 random molecules from the QM9 dataset (134k stable small molecules with up to 9 heavy atoms), encoded as SMILES strings with maximum sequence length 51 and alphabet size 43. Each generator is pre-trained for 250 MLE epochs, with the discriminator trained for 10 epochs. Adversarial/RL training then proceeds for up to 100 additional epochs. The default $\lambda$ is 0.5.

Three molecular objectives are evaluated:

Solubility (LogP): water-octanol partition coefficient via RDKit’s Crippen function
Synthesizability: SA score estimating ease of synthesis (0 = hard, 1 = easy)
Druglikeness: QED score capturing medicinal chemistry aesthetics

Diversity is measured using average Jaccard distance of molecular fingerprints relative to a random training subset.

Molecular Generation Results

Objective	Algorithm	Validity (%)	Diversity	Druglikeness	Synthesizability	Solubility
None	MLE	75.9	0.64	0.48 (0%)	0.23 (0%)	0.30 (0%)
None	SeqGAN	80.3	0.61	0.49 (+2%)	0.25 (+6%)	0.31 (+3%)
Druglikeness	ORGAN	88.2	0.55	0.52 (+8%)	0.32 (+38%)	0.35 (+18%)
Druglikeness	OR(W)GAN	85.0	0.95	0.60 (+25%)	0.54 (+130%)	0.47 (+57%)
Druglikeness	Naive RL	97.1	0.80	0.57 (+19%)	0.53 (+126%)	0.50 (+67%)
Synthesizability	ORGAN	96.5	0.92	0.51 (+6%)	0.83 (+255%)	0.45 (+52%)
Synthesizability	OR(W)GAN	97.6	1.00	0.20 (-59%)	0.75 (+223%)	0.84 (+184%)
Solubility	ORGAN	94.7	0.76	0.50 (+4%)	0.63 (+171%)	0.55 (+85%)
Solubility	OR(W)GAN	94.1	0.90	0.42 (-12%)	0.66 (+185%)	0.54 (+81%)
Solubility	Naive RL	92.7	0.75	0.49 (+3%)	0.70 (+200%)	0.78 (+162%)
All (alternated)	ORGAN	96.1	92.3	0.52 (+9%)	0.71 (+206%)	0.53 (+79%)

Key observations: OR(W)GAN consistently achieves higher diversity than standard ORGAN. Naive RL often achieves higher raw objective scores but at the cost of generating trivial solutions (e.g., simple atom chains for solubility). The Wasserstein variant provides better diversity properties. Multi-objective training via alternating objectives across epochs achieves gains comparable to individually optimized models.

Music Generation Setup

Using 1,000 melodies from the EsAC folk dataset, each encoded as 36-token sequences where tokens represent sixteenth-note events across three octaves (C3-B5). Two metrics are optimized: tonality (proportion of perfect fifths) and ratio of steps (conjunct melodic motion). Diversity is measured as average pairwise edit distance.

Music Results

Objective	Algorithm	Diversity	Tonality	Ratio of Steps
None	MLE	0.221	0.007	0.010
None	SeqGAN	0.187	0.005	0.010
Tonality	Naive RL	0.100	0.478	2.9E-05
Tonality	ORGAN	0.268	0.372	1.78E-04
Tonality	OR(W)GAN	0.268	0.177	2.4E-04
Ratio of Steps	Naive RL	0.321	0.001	0.829
Ratio of Steps	ORGAN	0.433	0.001	0.632
Ratio of Steps	OR(W)GAN	0.134	5.95E-05	0.622

ORGAN outperforms SeqGAN and MLE on all metrics. Naive RL achieves higher raw scores but with lower diversity, producing simpler, less interesting outputs.

Capacity Ceilings, Trade-offs, and Future Directions

The authors identify several limitations and findings:

Capacity ceiling: GAN-based models tend to generate sequences matching the training set’s average length (15.42 characters). RL-only approaches can break this constraint, generating shorter (9.4) or longer (21.3) sequences depending on the objective. The upper bound of optimized properties also matches the training data’s maximum, suggesting dataset-dependent limits.

Lambda trade-off: Varying $\lambda$ reveals an optimal balance between objective optimization and distributional fidelity. This optimum depends on the model, dataset, and metric, suggesting that hyperparameter search over $\lambda$ is important in practice.

Tonality vs. steps inverse relationship: In the music task, optimizing for tonality (perfect fifths) inherently conflicts with optimizing for step ratios (consecutive notes), since consecutive scale notes do not form perfect fifths.

Limitations: The paper evaluates on relatively small datasets (5k molecules, 1k melodies) and short sequences. The molecular experiments use QM9 (small molecules with up to 9 heavy atoms), which limits the scope of conclusions for drug-like chemical space. The Wasserstein variant sometimes lags behind the standard GAN loss in raw metric scores, though it offers better diversity.

Future directions: The authors propose extending ORGAN to non-sequential data (images, audio) by framing GANs as RL problems more broadly, and investigating how different heuristic choices affect performance. They also suggest exploring other discrete GAN formulations (MaliGAN, BGAN) with RL extensions.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Molecular training	QM9 subset	5,000 molecules	Random subset from 134k stable small molecules with up to 9 heavy atoms
Music training	EsAC folk dataset	1,000 melodies	36-token sequences, processed following Chen et al. (2017)

Algorithms

Generator pre-trained for 250 epochs via MLE; discriminator for 10 epochs
Adversarial/RL training for up to 100 epochs
Default $\lambda = 0.5$ for reward mixing
Monte Carlo rollouts for intermediate reward estimation
Duplicate penalty: reward divided by copy count

Models

Generator: RNN with LSTM cells
Discriminator: CNN for text classification (Kim, 2014) with 75% dropout, L2 regularization
Optimizer: Adam for all gradient descent steps

Evaluation

Metric	Description	Domain
Validity (%)	Fraction of generated SMILES that decode to valid molecules	Molecules
Diversity	Average Jaccard distance of fingerprints to training subset	Molecules
Druglikeness (QED)	Quantitative Estimate of Drug-likeness	Molecules
Synthesizability (SA)	Synthetic accessibility score	Molecules
Solubility (LogP)	Water-octanol partition coefficient	Molecules
Tonality	Proportion of perfect fifths	Music
Ratio of Steps	Proportion of conjunct melodic intervals	Music
Diversity (edit)	Average pairwise edit distance	Music

Hardware

Not specified in the paper.

Artifacts

Artifact	Type	License	Notes
ORGAN	Code	GPL-2.0	Official implementation including metrics for molecules and music

Paper Information

Citation: Guimaraes, G. L., Sánchez-Lengeling, B., Outeiral, C., Farias, P. L. C., & Aspuru-Guzik, A. (2017). Objective-Reinforced Generative Adversarial Networks (ORGAN) for Sequence Generation Models. arXiv preprint arXiv:1705.10843.

@article{guimaraes2017organ,
  title={Objective-Reinforced Generative Adversarial Networks (ORGAN) for Sequence Generation Models},
  author={Guimaraes, Gabriel Lima and Sanchez-Lengeling, Benjamin and Outeiral, Carlos and Farias, Pedro Luis Cunha and Aspuru-Guzik, Al{\'a}n},
  journal={arXiv preprint arXiv:1705.10843},
  year={2017}
}

MolecularRNN: Graph-Based Molecular Generation and RL

Sat, 28 Mar 2026 00:00:00 +0000

A Graph Recurrent Model for Molecular Generation with Property Optimization

This is a Method paper that introduces MolecularRNN, a graph-based recurrent generative model for molecular structures. The model extends GraphRNN to handle typed nodes (atom types) and typed edges (bond types), enabling direct generation of molecular graphs rather than working through string representations like SMILES. Three key contributions are combined: (1) the MolecularRNN architecture for autoregressive graph generation, (2) valency-based rejection sampling for guaranteed 100% validity at inference, and (3) policy gradient reinforcement learning for shifting molecular property distributions toward desired ranges.

Why Generate Molecules as Graphs Rather Than Strings

Computational de novo molecular design aims to create novel molecules with desired properties, a task central to drug discovery. At the time of this work, most deep generative models for molecules operated on SMILES strings, inheriting the complications of SMILES grammar and the problem that structurally similar molecules can have very different string representations. Graph-based representations are more natural for molecules, with atoms mapping to nodes and bonds to edges, and they allow direct enforcement of chemical constraints during generation.

Existing graph-based methods had their own limitations. Junction tree VAE (JT-VAE) generates molecules from structural fragments, which introduces ambiguity when converting junction trees back to molecules, particularly problematic during property optimization since molecules sharing a junction tree can have very different property values. The GCPN model uses graph convolutional networks with reinforcement learning but was evaluated only on top-3 generated molecules, making it difficult to assess overall distribution quality. Prior atom-level graph generation models like Li et al. (2018a) were restricted to molecules with at most 20 heavy atoms, limiting practical applicability.

Core Innovation: Extending GraphRNN with Chemical Constraints and RL

MolecularRNN builds on the GraphRNN architecture by introducing atom type predictions alongside edge type predictions. The model generates molecular graphs sequentially: at each step, a NodeRNN predicts the type of the next atom, then an EdgeRNN predicts bond types to all preceding atoms within a BFS-ordered window.

Autoregressive Graph Generation

The joint likelihood over atom types $C^{\pi}$ and adjacency vectors $S^{\pi}$ under BFS ordering $\pi$ is factorized as:

$$ p\left(S^{\pi}, C^{\pi}\right) = \prod_{i=1}^{n+1} p\left(C_{i}^{\pi} \mid S_{

NodeRNN processes embeddings of previous atom types and adjacency vectors to produce a hidden state, from which a two-layer MLP with softmax predicts the next atom type $\psi_{i}$:

$$ h_{i}^{\text{node}} = \text{NodeRNN}\left(h_{i-1}^{\text{node}}, \left[\text{emb}(S_{i-1}^{\pi}), \text{emb}(C_{i-1}^{\pi})\right]\right) $$

$$ \psi_{i} = \text{NodeMLP}\left(h_{i}^{\text{node}}\right) $$

EdgeRNN then unrolls across preceding atoms to predict bond types $\phi_{i,j}$, initialized with the NodeRNN hidden state:

$$ h_{i,j}^{\text{edge}} = \text{EdgeRNN}\left(h_{i,j-1}^{\text{edge}}, \text{emb}(S_{i,j-1}^{\pi})\right), \quad h_{i,0}^{\text{edge}} = h_{i}^{\text{node}} $$

$$ \phi_{i,j} = \text{EdgeMLP}\left(h_{i,j}^{\text{edge}}\right) $$

Bond types are categorical over {no bond, single, double, triple}, and molecules are represented in kekulized form. BFS ordering limits the EdgeRNN window to $M = 12$ preceding atoms.

Valency-Based Rejection Sampling

During inference, each proposed bond of order $k$ between atoms $i$ and $j$ is accepted only if both atoms remain within their allowed valencies:

$$ \sum_{j} A_{i,j}^{\pi} + k \leq \text{valency}_{C_{i}^{\pi}} \quad \text{and} \quad \sum_{i} A_{i,j}^{\pi} + k \leq \text{valency}_{C_{j}^{\pi}} $$

Atoms that do not fill their valencies are complemented with hydrogens. This constraint can be enforced directly on graphs (unlike SMILES, where intermediate substrings are not chemically meaningful), yielding 100% valid molecules.

Property Optimization via Policy Gradient

For property optimization, MolecularRNN is formulated as a policy network in a Markov Decision Process. The loss function uses REINFORCE with a discounted final reward:

$$ L(\theta) = -\sum_{i=1}^{N} r(s_{N}) \cdot \gamma^{i} \cdot \log p(s_{i} \mid s_{i-1}; \theta) $$

where $r(s_{N})$ is the reward from a property critic and $\gamma$ is a discount factor. The authors also introduce a structural penalty during RL training that assigns a penalty of $-10$ to atoms violating valency constraints, providing a learning signal from invalid intermediate molecules.

Experimental Setup: Pretraining and Property Optimization

Pretraining

MolecularRNN is pretrained on three datasets: ChEMBL (~1.5M bioactive molecules), ZINC 250k (250K randomly selected commercially available compounds), and MOSES (~1.9M drug-like molecules from ZINC). The model considers 9 atom types (C, N, O, F, P, S, Cl, Br, I), 3 bond types (single, double, triple), and molecules with 10-50 heavy atoms. Architecture: NodeRNN with 4 GRU layers (hidden size 256), EdgeRNN with 4 GRU layers (hidden size 128), node embedding size 128, edge embedding size 16. Training uses Adam with learning rate 0.001 and multiplicative decay on 4 GPUs with batch size 512 per GPU for 250 epochs.

Generation Quality at Scale

The pretrained model generates 1 million molecules per dataset (far larger than prior work: JT-VAE used 5K samples, Li et al. used 100K). Results with valency-based rejection sampling:

Training Set	Valid	Unique	Novel	IntDiv (p=1)	IntDiv (p=2)	SA Score	QED
ChEMBL	100%	99.2%	99.3%	0.895	0.890	3.67 +/- 1.20	0.56 +/- 0.20
ZINC 250k	100%	99.8%	100%	0.892	0.887	3.60 +/- 1.01	0.68 +/- 0.16
MOSES	100%	99.4%	100%	0.881	0.876	3.24 +/- 0.97	0.74 +/- 0.14

Comparison with baselines on ZINC 250k (30K samples):

Method	Valid	Unique	Novel	SA Score	QED	IntDiv
JT-VAE	99.8%	100%	100%	3.37	0.76	0.85
GCPN	100%	99.97%	100%	4.62	0.61	0.90
MolecularRNN	100%	99.89%	100%	3.59	0.68	0.89

GCPN generates overly complex molecules (high SA score of 4.62), while MolecularRNN produces more realistic structures with higher internal diversity than JT-VAE.

Property Optimization Results

Policy gradient optimization is run for 300 iterations with batch size 512 and constant learning rate $10^{-5}$, discount factor $\gamma = 0.97$. Top-3 scores for penalized logP and QED:

Method	logP 1st	logP 2nd	logP 3rd	QED 1st	QED 2nd	QED 3rd
ORGAN	3.63	3.49	3.44	0.896	0.824	0.820
JT-VAE	5.30	4.93	4.49	0.925	0.911	0.910
GCPN	7.98	7.85	7.80	0.948	0.947	0.946
MolecularRNN	10.34	10.19	10.14	0.948	0.948	0.947

MolecularRNN achieves the highest penalized logP scores (10.34 vs. GCPN’s 7.98) while matching GCPN on QED. The authors also demonstrate melting temperature optimization using a GCN-based property predictor as the critic (RMSE 39.5 degrees C), showing that the RL framework generalizes to properties that cannot be computed directly from molecular graphs.

Distribution-Level Evaluation and Learned Chemical Patterns

The authors emphasize that reporting only top-3 scores is not informative, and they compare full property distributions. MolecularRNN shifts the QED distribution further toward maximum values compared to GCPN. They also note that during melting temperature optimization, the model rediscovered two chemical phenomena: fusing aromatic rings increases melting point, and the presence of polar groups (C=O, OH, NH2, heterocyclic nitrogens) enhances dipole-dipole interactions and raises melting temperature.

Without valency-based rejection sampling, the pretrained model achieves 65% validity. After structural penalty training (assigning -10 to valency-violating atoms and optimizing with policy gradient), validity increases to 90%. Enabling rejection sampling then achieves 100%.

Several limitations are worth noting. The BFS ordering introduces an arbitrary sequencing over equivalent graph traversals (the node order permutation problem is not addressed). The evaluation uses top-3 scores for property optimization, though the authors do advocate for distributional evaluation. The molecule size is capped at 50 heavy atoms. The paper does not report training time or wall-clock generation speed. Future directions mentioned include multi-objective property optimization and scaffold completion (graph completion from a given core structure).

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pretraining	ChEMBL	~1.5M molecules	Bioactive molecules with experimental measurements
Pretraining	ZINC 250k	250K molecules	Random subset of ZINC database
Pretraining	MOSES	~1.9M molecules	Drug-like subset of ZINC
Melting point critic	Custom split	37,940 train / 9,458 test	Melting temperatures from -196 to 517 degrees C

Algorithms

Pretraining: Maximum likelihood with Adam optimizer, learning rate 0.001 with multiplicative decay to $10^{-5}$, 250 epochs
Structural penalty: Policy gradient with -10 penalty per valency-violating atom
Property optimization: REINFORCE (policy gradient), 300 iterations, batch size 512, learning rate $10^{-5}$, discount factor $\gamma = 0.97$
Melting point critic: GCN regression (4 layers, hidden size 128), Adam with learning rate 0.001, exponential decay $\gamma = 0.8$, 30 epochs, batch size 32

Models

NodeRNN: 4 GRU layers, hidden size 256, node embedding 128
EdgeRNN: 4 GRU layers, hidden size 128, edge embedding 16
NodeMLP/EdgeMLP: 2-layer MLP with 128 hidden units, ReLU activation, softmax output
BFS window: $M = 12$ preceding atoms
Atom types: 9 (C, N, O, F, P, S, Cl, Br, I)
Bond types: 3 (single, double, triple) + no bond

Evaluation

Metric	Description
Validity	% chemically valid molecules (RDKit)
Uniqueness	% unique in generated pool (up to 1M)
Novelty	% not in training set
Internal Diversity	Average pairwise Tanimoto distance
SA Score	Synthetic accessibility (2-4 optimal range)
QED	Drug-likeness score (0-1)
Penalized logP	Lipophilicity with ring and SA penalties

Hardware

4 GPUs (NVIDIA, specific model not stated)
Per-GPU batch size of 512 for pretraining
Training time not reported

Paper Information

Citation: Popova, M., Shvets, M., Oliva, J., & Isayev, O. (2019). MolecularRNN: Generating realistic molecular graphs with optimized properties. arXiv preprint arXiv:1905.13372.

@article{popova2019molecularrnn,
  title={MolecularRNN: Generating realistic molecular graphs with optimized properties},
  author={Popova, Mariya and Shvets, Mykhailo and Oliva, Junier and Isayev, Olexandr},
  journal={arXiv preprint arXiv:1905.13372},
  year={2019}
}

Memory-Assisted RL for Diverse De Novo Mol. Design

Sat, 28 Mar 2026 00:00:00 +0000

A Memory Module for Diverse Molecular Generation via RL

This is a Method paper that introduces a memory unit for reinforcement learning (RL)-based molecular generation. The primary contribution is a hash-table-based memory mechanism that integrates into the REINVENT framework’s scoring function. By tracking previously generated high-scoring molecules and penalizing the reward when new molecules are too similar to those already stored, the memory unit forces the generative model to explore different regions of chemical space rather than collapsing onto a single scaffold family.

Policy Collapse Limits RL-Based De Novo Design

Recurrent neural networks (RNNs) trained with reinforcement learning can generate novel molecules optimized for desired properties. The REINVENT algorithm and related approaches (ORGANIC, GENTRL) demonstrated the viability of coupling a pretrained SMILES-based generative model with a scoring function via RL. However, a persistent problem is policy collapse (also called mode collapse): once the model discovers a high-scoring region of chemical space, it continues to exploit that region, producing structurally similar compounds with minor substitution differences. This severely limits the practical utility of RL-based generation in drug design, where medicinal chemists need diverse scaffolds to explore structure-activity relationships and manage intellectual property concerns.

Prior work by Liu et al. [31] attempted to address this by engineering an explorative RNN alongside the standard generative RNN, but it did not substantially increase diversity compared to standard REINVENT. Other approaches like Generative Examination Networks (GEN) performed statistical analysis during training but were not evaluated in optimization scenarios.

Core Innovation: Hash-Table Memory Unit for Reward Modification

The key insight is to dynamically modify the reward surface during RL by maintaining a memory of previously explored chemical space. The memory unit is a hash table of index-bucket pairs. Each bucket stores up to a fixed number of high-scoring molecules (default: 25) that are chemically similar to a seed molecule (the index).

Integration with REINVENT

The memory unit modifies the augmented likelihood used in REINVENT. For a generated compound $c$, the augmented log-likelihood becomes:

$$ \log P(c)_{Aug} = \log P(c)_{PriorNetwork} + \sigma \times S(c) \times M(c) $$

where $\sigma$ is a scalar coefficient, $S(c)$ is the scoring function output, and $M(c)$ is the memory unit output (either 0 or 1). The reward is:

$$ R(c) = \left(\log P(c)_{Aug} - \log P(c)_{AgentNetwork}\right)^2 $$

and the loss is $\text{loss} = -R(c)$.

Memory Unit Operation

When a high-scoring molecule is generated:

Its fingerprint or scaffold is compared against all index structures in the memory
If it is similar to an index (above a Tanimoto cutoff, default 0.6) and the corresponding bucket is not full, $M(c) = 1$ and the molecule is added to the bucket
If the bucket is full, $M(c) = 0$, effectively zeroing the reward contribution and discouraging the model from generating similar molecules
If no similar index exists, a new index-bucket pair is created

Four Similarity Criteria

The authors evaluate four criteria for grouping molecules in the memory:

Compound similarity: ECFP4 Tanimoto similarity at the whole-molecule level
Identical Bemis-Murcko (BM) scaffold: exact match of Bemis-Murcko frameworks
Identical carbon skeleton: exact match of carbon skeletons (BM scaffolds with all heteroatoms replaced by carbon and bonds set to single)
Scaffold similarity: atom pair fingerprint Tanimoto similarity between carbon skeletons (fuzzy matching)

Alternative Output Modes

Beyond the binary output ($M(c) \in {0, 1}$), the authors also explored smooth output functions. The linear mode:

$$ M(c) = 1 - \frac{\text{compounds in bucket}}{\text{bucket size}} $$

And the sigmoid mode:

$$ M(c) = 1 - \frac{1}{1 + e^{-\left(\frac{\frac{\text{compounds in bucket}}{\text{bucket size}} \times 2 - 1}{0.15}\right)}} $$

Both smooth modes yielded slightly fewer analogs than the binary mode and were not pursued further.

Experimental Setup: LogP Optimization and Target Activity Prediction

Case Study 1: LogP Optimization

As a proof of concept, the authors optimized LogP values for known DRD2 inhibitors. Starting from 487 DRD2 compounds with LogP >= 5 (from ExCAPE-DB), they applied transfer learning to the prior model for 20 epochs, then ran RL for 150 iterations (100 compounds per iteration, 15,000 total). The scoring function was:

$$ S = 1 - \tanh\left(\min\left(|2 - \text{AlogP}|, |3 - \text{AlogP}|\right)\right) $$

targeting LogP values between 2.0 and 3.0.

Case Study 2: HTR1A and DRD2 Activity Prediction

For a more complex scenario, the authors trained SVM classifiers (with Platt scaling for probabilistic output) on bioactivity data from ExCAPE-DB to predict activity against two neurotransmitter receptors:

HTR1A: 3,599 actives (pIC50 >= 7) and 66,684 inactives
DRD2: 2,981 actives (pIC50 >= 7) and 346,206 inactives (100,000 sampled)

Data was split using Butina clustering on ECFP6 at a 0.4 Tanimoto cutoff (60/20/20 train/val/test). The SVM models achieved excellent performance:

Target	Set	Balanced Accuracy	ROC AUC	F1	MCC
HTR1A	Test	0.96	0.99	0.75	0.75
DRD2	Test	0.95	0.99	0.71	0.72

RL was run for 300 iterations (100 compounds each, 30,000 total). Compounds with predicted activity >= 0.7 were considered active.

Generative Model Architecture

The RNN prior model followed the REINVENT architecture: an embedding layer, three GRU layers with 256 dimensions, and a linear output layer. It was pretrained on ~1.5 million ChEMBL 25 compounds (filtered to remove known HTR1A actives and DRD2 analogs) for 10 epochs using Adam with a learning rate of 0.01.

Comparisons

The authors compared memory-assisted RL against:

Standard REINVENT RL (no memory)
Experience replay (re-presenting 8 high-scoring compounds per iteration)
Temperature scaling (values from 1.0 to 10.0)
Memory + experience replay combined

Results: Up to Fourfold Increase in Diverse Active Compounds

LogP Optimization Results

Memory-assisted RL increased the number of optimized compounds (LogP 2-3) by roughly threefold:

Memory Type	Optimized Compounds	Unique BM Scaffolds	Unique Carbon Skeletons
No memory	938	727	396
Compound similarity	3,451	2,963	1,472
Identical BM Scaffold	3,428	2,865	1,398
Identical Carbon Skeleton	3,315	3,002	1,799
Scaffold Similarity	3,591	3,056	1,538

The memory unit also increased the generation of relevant analogs. ECFP6 analogs (Tanimoto >= 0.4 to training set) increased from 145 to up to 549, and shared MMP cores increased from 5 to up to 19, confirming that the memory unit promoted exploration of chemically relevant space rather than random drift.

HTR1A and DRD2 Activity Optimization Results

The improvements were even more pronounced for target activity optimization:

Target	Memory Type	Active Compounds	Unique BM Scaffolds	Unique Carbon Skeletons
HTR1A	No memory	9,323	7,312	5,446
HTR1A	Compound similarity	16,779	13,304	9,887
HTR1A	Identical Carbon Skeleton	17,597	15,531	12,408
DRD2	No memory	5,143	2,635	1,949
DRD2	Compound similarity	21,486	17,844	12,749
DRD2	Scaffold Similarity	22,784	20,712	16,434

For DRD2, the effect was particularly striking: standard RL showed clear policy collapse with only 576 ECFP6 analogs to the training set, while memory-assisted RL generated up to 6,315. The compound similarity memory unit produced the most MMP analogs (217 to the training set vs. 7 without memory).

Parameter Sensitivity

Bucket size had a modest effect: larger buckets (allowing more compounds before penalization) slightly increased analog generation. The Tanimoto similarity threshold of 0.6 was near-optimal for the scaffold similarity memory; higher thresholds reduced diversity gains. The compound similarity memory showed increasing analogs with higher thresholds, but BM scaffold and carbon skeleton counts plateaued above 0.6.

Comparison with Experience Replay and Temperature Scaling

Experience replay alone increased diversity compared to vanilla RL but was less effective than the memory unit alone
Memory + experience replay achieved the best results overall, as experience replay provided the model with diverse starting points for exploration after the memory unit altered the reward landscape
Temperature scaling was largely ineffective: only a value of 1.25 showed improvement, and even then it achieved only about 50% of the analogs generated by memory-assisted RL. Temperatures above 2.0 degraded SMILES validity, and above 4.0 prevented valid molecule generation entirely

Limitations

The authors acknowledge several limitations:

All evaluations are retrospective; no synthesized compounds were experimentally tested
The SVM activity models, while accurate, may have applicability domain limitations for highly novel scaffolds
The binary memory output mode was found to work best, but the transition from exploration to exploitation is abrupt
The method was only tested with two biological targets and one physicochemical property
Computational overhead of the memory unit is not discussed

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Prior model training	ChEMBL 25	~1.5M compounds	Filtered: max 50 heavy atoms, no stereochemistry, removed HTR1A actives and DRD2 analogs
HTR1A activity data	ExCAPE-DB	3,599 actives + 66,684 inactives	pIC50 >= 7 threshold for actives
DRD2 activity data	ExCAPE-DB	2,981 actives + 100,000 inactives (sampled)	pIC50 >= 7 threshold for actives

Algorithms

Generative model: RNN with embedding + 3 GRU layers (256 dim) + linear output (REINVENT architecture)
RL: Augmented likelihood formulation with sigma scaling coefficient
SVM classifiers: Non-linear SVM with MinMax kernel, Platt scaling, ECFP6 count-based fingerprints (2048 dim)
Butina clustering: ECFP6 Tanimoto cutoff 0.4 for train/val/test splitting

Evaluation

Metric	Description
Unique compounds	Number of distinct valid SMILES generated
Unique BM scaffolds	Bemis-Murcko framework diversity
Unique carbon skeletons	Carbon skeleton diversity (stripped BM scaffolds)
ECFP6 analogs	Compounds with Tanimoto >= 0.4 to known actives
MMP analogs	Matched molecular pair relationships with known actives
Shared MMP cores	Scaffold cores shared between generated and known compounds

Artifacts

Artifact	Type	License	Notes
reinvent-memory	Code	MIT	Official implementation with prepared datasets

Hardware

Not specified in the paper.

Paper Information

Citation: Blaschke, T., Engkvist, O., Bajorath, J., & Chen, H. (2020). Memory-assisted reinforcement learning for diverse molecular de novo design. Journal of Cheminformatics, 12, 68. https://doi.org/10.1186/s13321-020-00473-0

@article{blaschke2020memory,
  title={Memory-assisted reinforcement learning for diverse molecular de novo design},
  author={Blaschke, Thomas and Engkvist, Ola and Bajorath, J{\"u}rgen and Chen, Hongming},
  journal={Journal of Cheminformatics},
  volume={12},
  number={1},
  pages={68},
  year={2020},
  publisher={Springer},
  doi={10.1186/s13321-020-00473-0}
}

LSTM Neural Network for Drug-Like Molecule Generation

Sat, 28 Mar 2026 00:00:00 +0000

An Early Method for LSTM-Based Molecular Generation

This is a Method paper that applies character-level LSTM networks to the task of de novo drug-like molecule generation. The primary contribution is demonstrating that an LSTM trained on SMILES strings from a large bioactive compound database (ChEMBL) can produce novel, diverse molecules whose chemical properties closely match those of known drug-like compounds. The paper also validates the generated molecules through virtual screening with profile QSAR models, showing comparable predicted bioactivity to the training set.

The Challenge of Exploring Drug-Like Chemical Space

The theoretical space of drug-like molecules is astronomically large. Brute-force enumeration approaches such as GDB-17 (which catalogued 166 billion molecules) are feasible only for small molecules, and full enumeration of molecules with 25-30 heavy atoms (the typical size of drug molecules) remains computationally intractable. Traditional cheminformatics approaches to sampling this space rely on fragment combination, evolutionary algorithms, or particle swarm optimization.

The authors position LSTM networks as a viable alternative. LSTMs had already demonstrated the ability to learn sequential structure in domains like text and music generation, making them natural candidates for learning SMILES grammar and generating novel valid molecular strings. At the time of writing (late 2017), several groups were exploring this direction, including Bjerrum and Threlfall (ZINC-based generation), Gomez-Bombarelli et al. (VAE-based latent space design), Olivecrona et al. (RL-guided generation), and Segler et al. (focused library design). This paper contributes a large-scale empirical study with detailed analysis of the generated molecules’ chemical quality.

Character-Level LSTM with Temperature-Based Sampling

The core approach is straightforward: train an LSTM to predict the next character in a SMILES string, then sample from the trained model to generate new molecules character by character.

The network architecture consists of:

Two stacked LSTM layers (which learn the SMILES grammar)
A dropout layer for regularization
A dense output layer with 23 neurons (one per character in the reduced SMILES alphabet) and softmax activation

The RMSProp optimizer was used for training. The learning rate was gradually decreased from 0.01 to 0.0002 during training. At generation time, a temperature parameter controls the randomness of character sampling to produce more diverse structures rather than reproducing training molecules too closely.

A key preprocessing step reduces the SMILES alphabet to 23 characters. Multi-character atom tokens are replaced with single characters (Cl → L, Br → R, [nH] → A). Only the organic atom subset (H, C, N, O, S, P, F, Cl, Br, I) is retained. Charged molecules, stereo information, and molecules with more than 5 ring closures are excluded. The training corpus totals 23,664,668 characters, with 40-character windows used as input sequences during training.

Training on ChEMBL and Generating One Million Molecules

Training Data

The training set consists of 509,000 bioactive molecules from ChEMBL with reported activity below 10 micromolar on any target.

Generation and Filtering

The LSTM generates SMILES strings character by character. The generated strings undergo a two-stage validation:

Bracket and ring closure check (fast text-based): 54% of generated SMILES are discarded for unpaired brackets or ring closures
Full chemical parsing with RDKit: An additional 14% fail due to unrealistic aromatic systems or incorrect valences
Final yield: 32% of generated SMILES correspond to valid molecules

One million valid molecules were generated in under 2 hours on 300 CPUs.

Novelty and Diversity

Out of one million generated molecules, only 2,774 (0.28%) were identical to molecules in the training ChEMBL set. The generated set contained 627,000 unique scaffolds compared to 172,000 in ChEMBL, with an overlap of only 18,000 scaffolds. This demonstrates substantial novelty and diversity.

Physicochemical Properties

Calculated molecular descriptors (molecular weight, logP, and topological polar surface area) for the generated molecules closely matched the distributions of the ChEMBL training set. The synthetic accessibility score distributions were also practically identical, indicating comparable molecular complexity.

Substructure Feature Comparison

The paper compares substructure features across three molecule sets: ChEMBL training data, LSTM-generated molecules, and a naive SMILES baseline generator. The naive generator uses only character frequency statistics and basic SMILES syntax rules, producing primarily macrocycles with very few fused aromatic systems.

Feature	ChEMBL (%)	LSTM Generated (%)	Naive Baseline (%)
No rings	0.4	0.4	0.1
1 ring	2.8	4.3	13.2
2 rings	14.8	23.1	17.7
3 rings	32.2	43.5	27.3
4 rings	32.7	23.9	25.2
>4 rings	17.2	4.8	16.5
Fused aromatic rings	38.8	30.9	0.2
Large rings (>8)	0.4	1.8	75.9
Spiro rings	1.9	0.6	0.6
Contains N	96.5	96.1	92.3
Contains O	93.0	92.0	85.5
Contains S	35.6	27.9	39.6
Contains halogen	40.7	38.8	49.4

The LSTM-generated molecules closely mirror the ChEMBL distributions, while the naive generator fails to capture drug-like structural patterns. The LSTM tends to slightly over-represent 2-3 ring systems and under-represent 4+ ring systems relative to ChEMBL. Functional group distributions also closely matched between ChEMBL and the LSTM output.

Virtual Screening Validation

The generated molecules were evaluated using profile QSAR models for 159 ChEMBL kinase assays. The six best models (with realistic test set R-squared > 0.75) were used to predict pIC50 values for both actual ChEMBL compounds and generated compounds. The cumulative frequency distributions of predicted activity were nearly identical between the two sets.

Kolmogorov-Smirnov (KS) tests on random samples of 1,000 compounds confirmed this quantitatively:

Assay	KS D	Distributions Differ?	Mean (Real)	Mean (Gen)	Stdev (Real)	Stdev (Gen)
688395	6.01%	No	4.66	4.69	0.25	0.24
668624	3.60%	No	4.86	4.86	0.25	0.24
809226	9.90%	Yes	5.33	5.26	0.34	0.30
809226	4.30%	No	5.18	5.13	0.47	0.43
688781	2.20%	No	4.83	4.82	0.26	0.25
809170	8.70%	Yes	5.12	5.07	0.51	0.46

For 4 of 6 models, the null hypothesis that the distributions are the same could not be rejected at the 95% confidence level (critical D = 6.04%). Even for the two assays where the KS test rejected the null hypothesis, the maximum vertical distance between distributions was below 10%.

Generated Molecules Are Novel, Drug-Like, and Potentially Bioactive

The key findings of this study are:

High novelty: Only 0.28% of generated molecules match training compounds; 627K novel scaffolds were produced versus 172K in ChEMBL
Drug-like quality: Physicochemical properties, substructure features, functional group distributions, and synthetic accessibility scores all closely match the ChEMBL training distribution, without these being explicit constraints
Predicted bioactivity: Virtual screening with profile QSAR models shows the generated molecules have comparable predicted activity profiles to known bioactive compounds
Scalability: One million valid molecules in under 2 hours on 300 CPUs, with the potential to scale to billions with GPU acceleration
LSTM superiority over naive baselines: A simple statistical SMILES generator using only character frequencies produces chemically unrealistic molecules (mostly macrocycles), demonstrating that the LSTM genuinely learns drug-like chemical patterns

The main limitations are the 32% validity rate (68% of generated SMILES are invalid), the exclusion of stereochemistry and charged molecules from the training set, and the lack of any goal-directed generation capability (the model produces unconditional samples from the training distribution). The code was described as “available on request” from the corresponding author rather than publicly released.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Training	ChEMBL bioactive molecules	509,000 molecules	Activity < 10 uM on any target; organic atoms only; no charges or stereo

Algorithms

Double-stacked LSTM layers with dropout
Softmax output over 23-character reduced SMILES alphabet
RMSProp optimizer with learning rate annealed from 0.01 to 0.0002
Temperature-based sampling at generation time
40-character input windows during training

Models

The architecture consists of two LSTM layers, a dropout layer, and a 23-neuron dense output layer. Exact hidden unit counts and dropout rates are not specified in the paper.

Evaluation

Metric	Value	Notes
Valid SMILES rate	32%	After bracket check and RDKit parsing
Novelty (vs. training)	99.72%	Only 2,774 of 1M match ChEMBL
Unique scaffolds	627,000	vs. 172,000 in ChEMBL
KS test (4/6 assays)	Not significantly different	At 95% confidence

Hardware

Generation: 300 CPUs for under 2 hours (1 million valid molecules)
Training hardware not specified

Paper Information

Citation: Ertl, P., Lewis, R., Martin, E., & Polyakov, V. (2017). In silico generation of novel, drug-like chemical matter using the LSTM neural network. arXiv preprint, arXiv:1712.07449.

@article{ertl2017silico,
  title={In silico generation of novel, drug-like chemical matter using the LSTM neural network},
  author={Ertl, Peter and Lewis, Richard and Martin, Eric and Polyakov, Valery},
  journal={arXiv preprint arXiv:1712.07449},
  year={2017}
}

LatentGAN: Latent-Space GAN for Molecular Generation

Sat, 28 Mar 2026 00:00:00 +0000

A GAN Operating in Learned Latent Space for Molecular Design

LatentGAN is a Method paper that introduces a two-stage architecture for de novo molecular generation. The first stage trains a heteroencoder to map SMILES strings into a continuous latent vector space. The second stage trains a Wasserstein GAN with gradient penalty (WGAN-GP) to generate new latent vectors that, when decoded, produce valid and novel molecular structures. The key contribution is decoupling the GAN from direct SMILES string generation, allowing the adversarial training to focus on learning the distribution of molecular latent representations rather than character-level sequence generation.

Limitations of Direct SMILES Generation with GANs

Prior GAN-based molecular generation methods such as ORGAN and ORGANIC operated directly on SMILES strings. This created a fundamental challenge: the generator had to simultaneously learn valid SMILES syntax and the distribution of chemically meaningful molecules. ORGAN struggled with optimizing discrete molecular properties like Lipinski’s Rule of Five, while ORGANIC showed limited success beyond the QED drug-likeness score. Other approaches (RANC, ATNC) substituted more advanced recurrent architectures but still operated in the discrete SMILES space.

Meanwhile, variational autoencoders (VAEs) demonstrated that working in continuous latent space could enable molecular generation, but they relied on forcing the latent distribution to match a Gaussian prior through KL divergence. This assumption is not necessarily appropriate for chemical space, which is inherently discontinuous.

RNN-based methods with transfer learning offered an alternative for target-biased generation, but the authors hypothesized that combining GANs with learned latent representations could produce complementary chemical space coverage.

Heteroencoder Plus Wasserstein GAN Architecture

The core innovation of LatentGAN is separating molecular representation learning from adversarial generation through a two-component pipeline.

Heteroencoder

The heteroencoder is an autoencoder trained on pairs of different non-canonical (randomized) SMILES representations of the same molecule. This is distinct from a standard autoencoder because the input and target SMILES are different representations of the same structure.

The encoder uses a two-layer bidirectional LSTM with 512 units per layer (256 forward, 256 backward). The concatenated output feeds into a 512-dimensional feed-forward layer. During training, zero-centered Gaussian noise with $\sigma = 0.1$ is added to the latent vector as regularization. The decoder is a four-layer unidirectional LSTM with a softmax output layer. Batch normalization with momentum 0.9 is applied to all hidden layers except the noise layer.

Training uses teacher forcing with categorical cross-entropy loss for 100 epochs. The learning rate starts at $10^{-3}$ for the first 50 epochs and decays exponentially to $10^{-6}$ by the final epoch. After training, the noise layer is deactivated for deterministic encoding and decoding.

An important design choice is that the heteroencoder makes no assumption about the latent space distribution (unlike VAEs with their KL divergence term). The latent space is shaped purely by reconstruction loss, and the GAN later learns to sample from this unconstrained distribution.

Wasserstein GAN with Gradient Penalty

The GAN uses the WGAN-GP formulation. The critic (discriminator) consists of three feed-forward layers of 256 dimensions each with leaky ReLU activations (no activation on the final layer). The generator has five feed-forward layers of 256 dimensions each with batch normalization and leaky ReLU between layers.

The training ratio is 5:1, with five critic updates for every generator update. The generator takes random vectors sampled from a uniform distribution and learns to produce latent vectors indistinguishable from the real encoded molecular latent vectors.

The WGAN-GP loss for the critic is:

$$L_{\text{critic}} = \mathbb{E}_{\tilde{x} \sim \mathbb{P}_g}[D(\tilde{x})] - \mathbb{E}_{x \sim \mathbb{P}_r}[D(x)] + \lambda \mathbb{E}_{\hat{x} \sim \mathbb{P}_{\hat{x}}}[(|\nabla_{\hat{x}} D(\hat{x})|_2 - 1)^2]$$

where $\lambda$ is the gradient penalty coefficient, $\mathbb{P}_r$ is the real data distribution (encoded latent vectors), $\mathbb{P}_g$ is the generator distribution, and $\mathbb{P}_{\hat{x}}$ samples uniformly along straight lines between pairs of real and generated points.

Generation Pipeline

At inference time, the full pipeline operates as: (1) sample a random vector, (2) pass through the trained generator to produce a latent vector, (3) decode the latent vector into a SMILES string using the pretrained heteroencoder decoder.

Experiments on Drug-Like and Target-Biased Generation

Datasets

The heteroencoder was trained on 1,347,173 SMILES from ChEMBL 25, standardized with MolVS and restricted to molecules with atoms from {H, C, N, O, S, Cl, Br} and at most 50 heavy atoms.

For general drug-like generation, a random subset of 100,000 ChEMBL compounds was used to train the GAN model for 30,000 epochs.

For target-biased generation, three datasets were extracted from ExCAPE-DB for EGFR, HTR1A, and S1PR1 targets. These were clustered into training and test sets to ensure chemical series were not split across sets.

Target	Training Set	Test Set	SVM ROC-AUC	SVM Kappa
EGFR	2,949	2,326	0.850	0.56
HTR1A	48,283	23,048	0.993	0.90
S1PR1	49,381	23,745	0.995	0.91

SVM target prediction models using 2048-bit FCFP6 fingerprints were built with scikit-learn to evaluate generated compounds.

Baselines

RNN-based generative models with transfer learning served as the primary baseline. A prior RNN model was trained on the same ChEMBL set, then fine-tuned on each target dataset. The LatentGAN was also benchmarked on the MOSES platform against VAE, JTN-VAE, and AAE architectures.

Heteroencoder Performance

The heteroencoder achieved 99% valid SMILES on the training set and 98% on the test set. Reconstruction error (decoding to a different molecule) was 18% on training and 20% on test. Notably, decoding to a different valid SMILES of the same molecule is not counted as an error.

Target-Biased Generation Results

From 50,000 sampled SMILES per target model:

Target	Arch.	Valid (%)	Unique (%)	Novel (%)	Active (%)	Recovered Actives (%)	Recovered Neighbors
EGFR	GAN	86	56	97	71	5.26	196
EGFR	RNN	96	46	95	65	7.74	238
HTR1A	GAN	86	66	95	71	5.05	284
HTR1A	RNN	96	50	90	81	7.28	384
S1PR1	GAN	89	31	98	44	0.93	24
S1PR1	RNN	97	35	97	65	3.72	43

MOSES Benchmark

On the MOSES benchmark (trained on a ZINC subset of 1,584,663 compounds, sampled 30,000 SMILES), LatentGAN showed comparable or better results than JTN-VAE and AAE on Frechet ChemNet Distance (FCD), Fragment similarity, and Scaffold similarity, while producing slightly worse nearest-neighbor cosine similarity (SNN). The standard VAE showed signs of mode collapse with high test metric overlap and low novelty.

Complementary Generation and Drug-Likeness Preservation

Key Findings

Validity and novelty: LatentGAN achieved 86-89% validity on target-biased tasks (lower than RNN’s 96-97%) but produced higher uniqueness on two of three targets and comparable or higher novelty (95-98%).

Complementary chemical space: The overlap between LatentGAN-generated and RNN-generated active compounds was very small at both compound and scaffold levels. A probabilistic analysis showed that the RNN model would be very unlikely to eventually cover the LatentGAN output space. This suggests the two architectures can work complementarily in de novo design campaigns.

Drug-likeness: QED score distributions of LatentGAN-generated compounds closely matched training set distributions across all three targets, with training compounds showing only slightly higher drug-likeness. SA score distributions were similarly well-preserved.

Chemical space coverage: PCA analysis using MQN fingerprints confirmed that generated compounds occupy most of the chemical space of the training sets. Some regions of the PCA plots contained compounds predicted as inactive, which corresponded to non-drug-like outliers in the training data.

Novel scaffolds: About 14% of scaffolds in the sampled sets had similarity below 0.4 to the training set across all three targets, indicating LatentGAN can generate genuinely novel chemical scaffolds. Around 5% of generated compounds were identical to training set compounds, while 21-25% had Tanimoto similarity below 0.4.

Limitations

The paper acknowledges several limitations. The 18-20% heteroencoder reconstruction error means a non-trivial fraction of encoded molecules decode to different structures. Validity rates (86-89%) are lower than RNN baselines (96-97%). The S1PR1 target showed notably lower uniqueness (31%) and predicted activity (44%) compared to the other targets, possibly due to the smaller effective training set of active compounds. The paper does not report specific hardware requirements or training times. No wet-lab experimental validation of generated compounds was performed.

Future Directions

The authors envision LatentGAN as a complementary tool to existing RNN-based generative models, with the two architectures covering different regions of chemical space. The approach of operating in learned latent space rather than directly on SMILES strings offers a general framework that could be extended to other molecular representations or generation objectives.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Heteroencoder training	ChEMBL 25 (subset)	1,347,173 SMILES	Standardized with MolVS; atoms restricted to H, C, N, O, S, Cl, Br; max 50 heavy atoms
General GAN training	ChEMBL 25 (random subset)	100,000	Subset of heteroencoder training set
Target-biased training	ExCAPE-DB (EGFR)	2,949 actives	Clustered train/test split
Target-biased training	ExCAPE-DB (HTR1A)	48,283 actives	Clustered train/test split
Target-biased training	ExCAPE-DB (S1PR1)	49,381 actives	Clustered train/test split
Benchmarking	ZINC (MOSES subset)	1,584,663	Canonical SMILES

Algorithms

Heteroencoder: Bidirectional LSTM encoder (2 layers, 512 units) + unidirectional LSTM decoder (4 layers), trained with teacher forcing and categorical cross-entropy for 100 epochs
GAN: WGAN-GP with 5:1 critic-to-generator training ratio. General model trained 30,000 epochs; target models trained 10,000 epochs
Evaluation: SVM classifiers with FCFP6 fingerprints (2048 bits) for activity prediction; MQN fingerprints for PCA-based chemical space analysis; Murcko scaffolds for scaffold-level analysis

Models

Heteroencoder: 512-dim latent space, bidirectional LSTM encoder, unidirectional LSTM decoder
Generator: 5 feed-forward layers of 256 dims with batch norm and leaky ReLU
Critic: 3 feed-forward layers of 256 dims with leaky ReLU

Evaluation

Metric	LatentGAN (EGFR)	RNN Baseline (EGFR)	Notes
Validity	86%	96%	Percent valid SMILES
Uniqueness	56%	46%	Percent unique among valid
Novelty	97%	95%	Not in training set
Predicted active	71%	65%	By SVM model

Hardware

Not specified in the paper.

Artifacts

Artifact	Type	License	Notes
LatentGAN source code	Code	Not specified	Includes trained heteroencoder model and training sets

Paper Information

Citation: Prykhodko, O., Johansson, S.V., Kotsias, P.-C., Arús-Pous, J., Bjerrum, E.J., Engkvist, O., & Chen, H. (2019). A de novo molecular generation method using latent vector based generative adversarial network. Journal of Cheminformatics, 11(1), 74. https://doi.org/10.1186/s13321-019-0397-9

@article{prykhodko2019latentgan,
  title={A de novo molecular generation method using latent vector based generative adversarial network},
  author={Prykhodko, Oleksii and Johansson, Simon Viet and Kotsias, Panagiotis-Christos and Ar{\'u}s-Pous, Josep and Bjerrum, Esben Jannik and Engkvist, Ola and Chen, Hongming},
  journal={Journal of Cheminformatics},
  volume={11},
  number={1},
  pages={74},
  year={2019},
  publisher={Springer},
  doi={10.1186/s13321-019-0397-9}
}

Grammar VAE: Generating Valid Molecules via CFGs

Sat, 28 Mar 2026 00:00:00 +0000

A Grammar-Constrained VAE for Discrete Data Generation

This is a Method paper that introduces the Grammar Variational Autoencoder (GVAE), a variant of the variational autoencoder that operates directly on parse trees from context-free grammars (CFGs) rather than on raw character sequences. The primary contribution is a decoding mechanism that uses a stack and grammar-derived masks to restrict the output at every timestep to only syntactically valid production rules. This guarantees that every decoded output is a valid string under the grammar, addressing a fundamental limitation of character-level VAEs when applied to structured discrete data such as SMILES molecular strings and arithmetic expressions.

Why Character-Level VAEs Fail on Structured Discrete Data

Generative models for continuous data (images, audio) had achieved impressive results by 2017, but generating structured discrete data remained difficult. The key challenge is that string representations of molecules and mathematical expressions are brittle: small perturbations to a character sequence often produce invalid outputs. Gomez-Bombarelli et al. (2016) demonstrated a character-level VAE (CVAE) for SMILES strings that could encode molecules into a continuous latent space and decode them back, enabling latent-space optimization for molecular design. However, the CVAE frequently decoded latent points into strings that were not valid SMILES, particularly when exploring regions of latent space far from training data.

The fundamental issue is that character-level decoders must implicitly learn the syntactic rules of the target language from data alone. For SMILES, this includes matching parentheses, valid atom types, proper bonding, and ring closure notation. The GVAE addresses this by giving the decoder explicit knowledge of the grammar, so it can focus entirely on learning the semantic structure of the data.

Core Innovation: Stack-Based Grammar Masking in the Decoder

The GVAE encodes and decodes sequences of production rules from a context-free grammar rather than sequences of characters.

Encoding. Given an input string (e.g., a SMILES molecule), the encoder first parses it into a parse tree using the CFG, then performs a left-to-right pre-order traversal of the tree to extract an ordered sequence of production rules. Each rule is represented as a one-hot vector of dimension $K$ (total number of production rules in the grammar). The resulting $T(\mathbf{X}) \times K$ matrix is processed by a convolutional neural network to produce the mean and variance of a Gaussian posterior $q_{\phi}(\mathbf{z} \mid \mathbf{X})$.

Decoding with grammar masks. The decoder maps a latent vector $\mathbf{z}$ through an RNN to produce a matrix of logits $\mathbf{F} \in \mathbb{R}^{T_{max} \times K}$. The key innovation is a last-in first-out (LIFO) stack that tracks the current parsing state. At each timestep $t$, the decoder:

Pops the top non-terminal $\alpha$ from the stack
Applies a fixed binary mask $\mathbf{m}_{\alpha} \in {0, 1}^K$ that zeros out all production rules whose left-hand side is not $\alpha$
Samples a production rule from the masked softmax distribution:

$$ p(\mathbf{x}_{t} = k \mid \alpha, \mathbf{z}) = \frac{m_{\alpha,k} \exp(f_{tk})}{\sum_{j=1}^{K} m_{\alpha,j} \exp(f_{tj})} $$

Pushes the right-hand-side non-terminals of the selected rule onto the stack (right-to-left, so the leftmost is on top)

This process continues until the stack is empty or $T_{max}$ timesteps are reached. Because the mask restricts selection to only those rules applicable to the current non-terminal, every generated sequence of production rules is guaranteed to be a valid derivation under the grammar.

Training. The model is trained by maximizing the ELBO:

$$ \mathcal{L}(\phi, \theta; \mathbf{X}) = \mathbb{E}_{q(\mathbf{z} \mid \mathbf{X})} \left[ \log p_{\theta}(\mathbf{X}, \mathbf{z}) - \log q_{\phi}(\mathbf{z} \mid \mathbf{X}) \right] $$

where the likelihood factorizes as:

$$ p(\mathbf{X} \mid \mathbf{z}) = \prod_{t=1}^{T(\mathbf{X})} p(\mathbf{x}_{t} \mid \mathbf{z}) $$

During training, the masks at each timestep are determined by the ground-truth production rule sequence, so no stack simulation is needed. The stack-based decoding is only required at generation time.

Syntactic vs. semantic validity. The grammar guarantees syntactic validity but not semantic validity. The GVAE can still produce chemically implausible molecules (e.g., an oxygen atom with three bonds) because such constraints are not context-free. SMILES ring-bond digit matching is also not context-free, so the grammar cannot enforce it. Additionally, sequences that have not emptied the stack by $T_{max}$ are marked invalid.

Experiments on Symbolic Regression and Molecular Optimization

The authors evaluate the GVAE on two domains: arithmetic expressions and molecules. Both use Bayesian optimization (BO) over the learned latent space.

Setup. After training each VAE, the authors encode training data into latent vectors and train a sparse Gaussian process (SGP) with 500 inducing points to predict properties from latent representations. They then run batch BO with expected improvement, selecting 50 candidates per iteration.

Arithmetic Expressions

Data: 100,000 randomly generated univariate expressions from a simple grammar (3 binary operators, 2 unary operators, 3 constants), each with at most 15 production rules
Target: Find an expression minimizing $\log(1 + \text{MSE})$ against the true function $1/3 + x + \sin(x \cdot x)$
BO iterations: 5, averaged over 10 repetitions

Method	Fraction Valid	Average Score
GVAE	0.99 +/- 0.01	3.47 +/- 0.24
CVAE	0.86 +/- 0.06	4.75 +/- 0.25

The GVAE’s best expression ($x/1 + \sin(3) + \sin(x \cdot x)$, score 0.04) nearly exactly recovers the true function, while the CVAE’s best ($x \cdot 1 + \sin(3) + \sin(3/1)$, score 0.39) misses the sinusoidal component.

Molecular Optimization

Data: 250,000 SMILES strings from the ZINC database
Target: Maximize penalized logP (water-octanol partition coefficient penalized for ring size and synthetic accessibility)
BO iterations: 10, averaged over 5 trials

Method	Fraction Valid	Average Score
GVAE	0.31 +/- 0.07	-9.57 +/- 1.77
CVAE	0.17 +/- 0.05	-54.66 +/- 2.66

The GVAE produces roughly twice as many valid molecules as the CVAE and finds molecules with substantially better penalized logP scores (best: 2.94 vs. 1.98).

Latent Space Quality

Interpolation experiments show that the GVAE produces valid outputs at every intermediate point when linearly interpolating between two encoded expressions, while the CVAE passes through invalid strings. Grid searches around encoded molecules in the GVAE latent space show smooth transitions where neighboring points differ by single atoms.

Predictive Performance

Sparse GP models trained on GVAE latent features achieve better test RMSE and log-likelihood than those trained on CVAE features for both expressions and molecules:

Metric	GVAE (Expressions)	CVAE (Expressions)	GVAE (Molecules)	CVAE (Molecules)
Test LL	-1.320 +/- 0.001	-1.397 +/- 0.003	-1.739 +/- 0.004	-1.812 +/- 0.004
Test RMSE	0.884 +/- 0.002	0.975 +/- 0.004	1.404 +/- 0.006	1.504 +/- 0.006

Reconstruction and Prior Sampling

On held-out molecules, the GVAE achieves 53.7% reconstruction accuracy vs. 44.6% for the CVAE. When sampling from the prior $p(\mathbf{z}) = \mathcal{N}(0, \mathbf{I})$, 7.2% of GVAE samples are valid molecules vs. 0.7% for the CVAE.

Key Findings, Limitations, and Impact

Key findings. Incorporating grammar structure into the VAE decoder consistently improves validity rates, latent space smoothness, downstream predictive performance, and Bayesian optimization outcomes across both domains. The approach is general: any domain with a context-free grammar can benefit.

Limitations acknowledged by the authors.

The GVAE guarantees syntactic but not semantic validity. For molecules, invalid ring-bond patterns and chemically implausible structures can still be generated.
The molecular validity rate during BO (31%) is substantially higher than the CVAE (17%) but still means most decoded molecules are invalid, largely due to non-context-free constraints in SMILES.
The approach requires a context-free grammar for the target domain, which limits applicability to well-defined formal languages.
Sequences that do not complete parsing within $T_{max}$ timesteps are discarded as invalid.

Impact. The GVAE was an influential early contribution to constrained molecular generation. It directly inspired the Syntax-Directed VAE (SD-VAE) by Dai et al. (2018), which uses attribute grammars for tighter semantic constraints, and contributed to the broader movement toward structured molecular generation methods including graph-based approaches. The paper demonstrated that encoding domain knowledge into the decoder architecture is more effective than relying on the model to learn structural constraints from data alone.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Training (expressions)	Generated arithmetic expressions	100,000	Up to 15 production rules each
Training (molecules)	ZINC database subset	250,000 SMILES	Same subset as Gomez-Bombarelli et al. (2016)

Algorithms

Encoder: 1D convolutional neural network over one-hot rule sequences
Decoder: RNN with stack-based grammar masking
Latent space: 56 dimensions (molecules), isotropic Gaussian prior
Property predictor: Sparse Gaussian process with 500 inducing points
Optimization: Batch Bayesian optimization with expected improvement, 50 candidates per iteration, Kriging Believer for batch selection

Models

Architecture details follow Gomez-Bombarelli et al. (2016) with modifications for grammar-based encoding/decoding. Specific layer sizes and hyperparameters are described in the supplementary material.

Evaluation

Metric	GVAE	CVAE	Notes
Fraction valid (expressions)	0.99	0.86	During BO
Fraction valid (molecules)	0.31	0.17	During BO
Best penalized logP	2.94	1.98	Best molecule found
Reconstruction accuracy	53.7%	44.6%	On held-out molecules
Prior validity	7.2%	0.7%	Sampling from N(0,I)

Hardware

Not specified in the paper.

Artifacts

Artifact	Type	License	Notes
grammarVAE	Code	Not specified	Official implementation

Paper Information

Citation: Kusner, M. J., Paige, B., & Hernández-Lobato, J. M. (2017). Grammar Variational Autoencoder. Proceedings of the 34th International Conference on Machine Learning (ICML), 1945-1954.

@inproceedings{kusner2017grammar,
  title={Grammar Variational Autoencoder},
  author={Kusner, Matt J. and Paige, Brooks and Hern{\'a}ndez-Lobato, Jos{\'e} Miguel},
  booktitle={Proceedings of the 34th International Conference on Machine Learning},
  pages={1945--1954},
  year={2017},
  publisher={PMLR}
}

DrugEx v2: Pareto Multi-Objective RL for Drug Design

Sat, 28 Mar 2026 00:00:00 +0000

Multi-Objective De Novo Drug Design with Pareto Optimization

This is a Method paper that extends the DrugEx framework (v1) to handle multi-objective optimization in de novo drug design. The primary contribution is integrating Pareto-based ranking with evolutionary algorithm concepts (crossover and mutation) into an RNN-based reinforcement learning pipeline. The system generates SMILES-based molecules optimized simultaneously for activity toward multiple protein targets while avoiding off-targets, addressing polypharmacology scenarios where drugs must bind multiple specific receptors.

Polypharmacology and the Limits of Single-Objective Generation

Traditional drug discovery follows the “one drug, one target, one disease” paradigm, but drug molecules interact with an average of six protein targets. Off-target binding causes side effects that remain a leading cause of clinical failure and post-approval drug withdrawals (over 500 drugs withdrawn due to fatal toxicity). Complex diseases often require modulating multiple targets simultaneously, making polypharmacology an important design objective.

Prior deep learning approaches for de novo design, including DrugEx v1, focused on generating molecules active against a single target. Extending these methods to multiple objectives introduces fundamental challenges: objectives are often contradictory (high affinity for one target may correlate with high affinity for an undesired off-target), and naive weighted-sum approaches can collapse diversity by over-optimizing a single dominant objective. The authors specifically target the adenosine receptor system, where $A_1AR$ and $A_{2A}AR$ selectivity profiles matter for therapeutic efficacy, and hERG channel binding must be avoided to prevent cardiac toxicity.

Evolutionary Exploration and Pareto Ranking in RL

The core innovation of DrugEx v2 has two components: an evolutionary exploration strategy and Pareto-based reward assignment.

Evolutionary Exploration Strategy

The generation process uses three RNN networks with identical LSTM architectures:

Agent net ($G_A$): the primary generator, updated at each training epoch via policy gradient
Crossover net ($G_C$): initialized from the fine-tuned model, updated iteratively from $G_A$ after each convergence period
Mutation net ($G_M$): initialized from the pre-trained model, parameters fixed throughout training

At each token-generation step, a random number determines whether the token probability comes from the combination of $G_A$ and $G_C$ (with probability $1 - \varepsilon$) or from $G_M$ (with probability $\varepsilon$). This mirrors crossover and mutation operations from evolutionary algorithms, maintaining diversity while steering toward desired properties.

Pareto Front Reward Scheme

For $n$ objectives (three in this study: $A_1AR$, $A_{2A}AR$, hERG), each molecule receives a score $R_i$ based on its predicted bioactivity:

$$ R_{i} = \begin{cases} \text{minmax}(pX_{i}), & \text{if high affinity required} \\ 1 - \text{minmax}(pX_{i}), & \text{if low affinity required} \\ 0, & \text{if SMILES invalid} \end{cases} $$

where $pX_i$ is the predicted bioactivity (range 3.0 to 10.0), normalized to [0, 1].

For the multi-target case, high affinity is required for both $A_1AR$ and $A_{2A}AR$ while low affinity is required for hERG. For the target-specific case, high affinity is required only for $A_{2A}AR$ while low affinity is required for both $A_1AR$ and hERG.

Molecules are ranked using a non-dominated sorting algorithm to construct Pareto fronts. Within each front, molecules are ranked by average Tanimoto distance (using ECFP6 fingerprints) rather than crowding distance, favoring chemically diverse solutions. The final reward is:

$$ R_i^{*} = \begin{cases} 0.5 + \frac{k - N_{undesired}}{2N_{desired}}, & \text{if desired} \\ \frac{k}{2N_{undesired}}, & \text{if undesired} \end{cases} $$

where $k$ is the molecule’s index in the Pareto rank. Rewards for undesired and desired solutions are distributed in $(0, 0.5]$ and $(0.5, 1.0]$, respectively.

The agent is trained via policy gradient:

$$ J(\theta) = \mathbb{E}\left[R^{*}(y_{1:T}) \middle|\theta\right] = \sum_{t=1}^{T} \log G(y_t | y_{1:t-1}) \cdot R^{*}(y_{1:T}) $$

Weighted Sum Alternative

The authors also implement a weighted sum (WS) scheme with dynamic weights proportional to the ratio of undesired to desired molecules per objective:

$$ w_i = \frac{r_i}{\sum_{k=1}^{M} r_k}, \quad R^{*} = \sum_{i=1}^{n} w_i R_i $$

This auto-adjusts importance toward under-performing objectives during training.

Molecular Diversity Metric

Diversity is measured using the Solow-Polasky metric adapted from ecological biodiversity:

$$ I(A) = \frac{1}{|A|} \mathbf{e}^{\top} F(\mathbf{s})^{-1} \mathbf{e} $$

where $F(\mathbf{s})$ is a distance matrix with entries $f(d_{ij}) = e^{-\theta d_{ij}}$ and $d_{ij}$ is the Tanimoto distance between ECFP6 fingerprints of molecules $s_i$ and $s_j$.

Multi-Target and Target-Specific Experiments

QSAR Environment

Four ML algorithms were benchmarked for the bioactivity prediction environment: Random Forest (RF), SVM, PLS, and Multi-task DNN (MT-DNN). Input features combined 2048-bit ECFP6 fingerprints with 19 physicochemical descriptors (2067D total). The training data came from ChEMBL v26: 25,731 ligands with bioactivity measurements toward $A_1AR$, $A_{2A}AR$, and hERG. RF was selected as the final predictor based on superior performance in temporal-split independent testing ($R^2$ and RMSE), prioritizing robustness over cross-validation metrics.

Generative Model Architecture

The RNN generator uses six layers: input, embedding (128D), three LSTM recurrent layers (512 hidden units), and output. LSTM was chosen over GRU based on higher valid SMILES rates (97.5% vs. 93.1% for pre-trained, 97.9% vs. 95.7% for fine-tuned). Pre-training used 1.7M molecules from ChEMBL; fine-tuning used the 25,731 LIGAND set molecules.

Baselines

DrugEx v2 was compared against DrugEx v1, REINVENT, and ORGANIC, all using the same RNN architecture and pre-trained/fine-tuned models, with only the RL framework differing. Both Pareto front (PF) and weighted sum (WS) reward schemes were tested.

Multi-Target Results

In the multi-target case (high affinity for $A_1AR$ and $A_{2A}AR$, low affinity for hERG):

Method	Scheme	Validity	Desirability	Uniqueness	Diversity
DrugEx v2	PF	99.57%	80.81%	87.29%	0.70
DrugEx v2	WS	99.80%	97.45%	89.08%	0.49
REINVENT	PF	99.54%	57.43%	98.84%	0.77
ORGANIC	PF	98.84%	66.01%	82.67%	0.65
DrugEx v1	PF	98.28%	43.27%	88.96%	0.71

DrugEx v2 achieved the highest desirability under both schemes. The WS scheme maximized desirability (97.45%) but at the cost of diversity (0.49). The PF scheme maintained higher diversity (0.70) with still-strong desirability (80.81%).

Target-Specific Results

In the target-specific case (high $A_{2A}AR$, low $A_1AR$ and hERG):

Method	Scheme	Validity	Desirability	Uniqueness	Diversity
DrugEx v2	PF	99.53%	89.49%	90.55%	0.73
DrugEx v2	WS	99.62%	97.86%	90.54%	0.31
REINVENT	WS	99.55%	81.27%	98.87%	0.34
ORGANIC	PF	98.29%	86.98%	80.30%	0.64

DrugEx v2 with PF achieved high desirability (89.49%) while maintaining diversity (0.73), outperforming both the WS scheme’s diversity collapse (0.31) and competing methods.

Chemical Space Coverage

t-SNE visualization with ECFP6 descriptors showed that the PF scheme guided generators to cover chemical space more broadly than the WS scheme. DrugEx v1 and v2 covered nearly all of the chemical space occupied by known active ligands, while REINVENT and ORGANIC covered only partial regions in the target-specific case.

Substructure Distribution

Generated molecules were evaluated for purine ring, furan ring, and benzene ring frequencies. DrugEx v2 with PF produced substructure distributions closest to the LIGAND set, suggesting it better preserves the chemical characteristics of known active molecules compared to REINVENT (which over-represented benzene rings) and ORGANIC.

GuacaMol Benchmark

DrugEx v2 was tested on 20 goal-directed tasks from the GuacaMol benchmark, achieving the best score in 12 of 20 tasks and an overall second place. The method struggled with tasks requiring contradictory objectives in narrow chemical spaces (e.g., the Sitagliptin MPO task), reflecting its emphasis on diverse feasible molecules rather than optimal individual solutions.

Diversity-Desirability Trade-off and Limitations

The key finding is that the Pareto front scheme and weighted sum scheme offer complementary strengths: PF produces molecules with higher diversity and more realistic substructure distributions, while WS achieves higher raw desirability scores. The Pareto front scheme is preferred for polypharmacology applications where chemical diversity matters for lead optimization.

The mutation rate $\varepsilon$ controls the diversity-desirability trade-off. Higher $\varepsilon$ increases diversity at the cost of desirability. The authors tested $\varepsilon \in {10^{-2}, 10^{-3}, 10^{-4}, 0}$ and found that appropriate tuning is important.

Limitations acknowledged by the authors include:

The method is less effective for tasks with contradictory objectives in narrow chemical spaces
Emphasis is on generating diverse feasible molecules rather than individual optimal solutions
REINVENT 2.0 did not converge with the PF scheme, suggesting the Pareto approach may not be universally compatible with all RL frameworks
Bioactivity predictions rely on QSAR models (RF), which may not generalize perfectly to novel chemical scaffolds

Future directions mentioned include adopting newer architectures (BERT, Transformer, GPT-2), handling graph and fragment representations, and integrating additional objectives like stability and synthesizability.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pre-training	ChEMBL v26 (ChEMBL set)	1.7M molecules	SMILES syntax learning, drug-like molecules
Fine-tuning / Environment	LIGAND set	25,731 ligands	Bioactivities for $A_1AR$, $A_{2A}AR$, hERG from ChEMBL
Benchmark	GuacaMol	20 tasks	Goal-directed generation tasks

Active/inactive thresholds: $pX \geq 6.5$ (active), $pX < 6.5$ (inactive). Low-quality data without exact pX assigned $pX = 3.99$ with sample weight 0.1.

Algorithms

QSAR predictor: Random Forest, 1000 trees, Gini criterion. Input: 2048-bit ECFP6 + 19 physicochemical properties (2067D). MinMax normalization.
Generator: 6-layer RNN with LSTM cells (512 hidden units), embedding dim 128, vocabulary 84 tokens. Adam optimizer, lr $10^{-3}$, batch size 512, 1000 epochs.
RL training: Policy gradient with Pareto-based or weighted-sum reward. Mutation rates tested: $\varepsilon \in {10^{-2}, 10^{-3}, 10^{-4}, 0}$.
Pareto ranking: GPU-accelerated non-dominated sorting via PyTorch. Tanimoto-based crowding distance with ECFP6 fingerprints.

Models

Component	Architecture	Parameters
Generator	LSTM (3 layers, 512 hidden)	Embedding 128D, vocab 84
Predictor	Random Forest	1000 trees, 2067D input
MT-DNN (alternative)	3 hidden layers (4000, 2000, 1000)	ReLU, 20% dropout

Evaluation

Metric	Description
Validity	Fraction of generated SMILES that parse to valid molecules
Desirability	Fraction of molecules meeting all activity thresholds ($pX \geq 6.5$ on-targets, $pX < 6.5$ off-targets)
Uniqueness	Fraction of non-duplicate molecules
Diversity	Solow-Polasky metric on ECFP6 Tanimoto distances
SA score	Synthetic accessibility (1-10, lower is easier)
QED	Quantitative estimate of drug-likeness (0-1, higher is better)

Hardware

GPU acceleration was used for Pareto optimization via PyTorch. Specific hardware details (GPU model, training time) are not reported in the paper.

Artifacts

Artifact	Type	License	Notes
DrugEx GitHub	Code	MIT	Official implementation (Python, PyTorch)
ChEMBL v26	Dataset	CC BY-SA 3.0	Source of training molecules and bioactivity data

Paper Information

Citation: Liu, X., Ye, K., van Vlijmen, H. W. T., Emmerich, M. T. M., IJzerman, A. P., & van Westen, G. J. P. (2021). DrugEx v2: de novo design of drug molecules by Pareto-based multi-objective reinforcement learning in polypharmacology. Journal of Cheminformatics, 13(1), 85. https://doi.org/10.1186/s13321-021-00561-9

@article{liu2021drugex,
  title={DrugEx v2: de novo design of drug molecules by Pareto-based multi-objective reinforcement learning in polypharmacology},
  author={Liu, Xuhan and Ye, Kai and van Vlijmen, Herman W. T. and Emmerich, Michael T. M. and IJzerman, Adriaan P. and van Westen, Gerard J. P.},
  journal={Journal of Cheminformatics},
  volume={13},
  number={1},
  pages={85},
  year={2021},
  doi={10.1186/s13321-021-00561-9}
}

ChemGE: Molecule Generation via Grammatical Evolution

Sat, 28 Mar 2026 00:00:00 +0000

Grammatical Evolution for De Novo Molecular Design

This is a Method paper that introduces ChemGE, a population-based molecular generation approach built on grammatical evolution. Rather than using deep neural networks, ChemGE evolves populations of SMILES strings through a context-free grammar, enabling concurrent evaluation by multiple molecular simulators and producing diverse molecular libraries. The method represents an alternative paradigm for de novo drug design: evolutionary optimization over formal grammars rather than learned latent spaces or autoregressive neural models.

Limitations of Sequential Deep Learning Generators

At the time of publication, the dominant approaches to de novo molecular generation included Bayesian optimization over VAE latent spaces (CVAE, GVAE), reinforcement learning with recurrent neural networks (ORGAN, REINVENT), sequential Monte Carlo search, and Monte Carlo tree search (ChemTS). These methods share two practical limitations:

Simulation concurrency: Most methods generate one molecule at a time, making it difficult to run multiple molecular simulations (e.g., docking) in parallel. This wastes computational resources in high-throughput virtual screening settings.
Molecular diversity: Deep learning generators tend to exploit narrow regions of chemical space. Deep reinforcement learning methods in particular often generate very similar molecules, requiring special countermeasures to maintain diversity. Since drug discovery is a multi-stage pipeline, limited diversity reduces survival rates in downstream ADMET screening.

ChemGE addresses both problems by maintaining a large population of molecules that are evolved and evaluated concurrently.

Core Innovation: Chromosome-to-SMILES Mapping via Grammar Rules

ChemGE encodes each molecule as a chromosome: a sequence of $N$ integers that deterministically maps to a SMILES string through a context-free grammar. The mapping process works as follows:

Start with the grammar’s start symbol
At each step $k$, look up the $k$-th integer $c = C[k]$ from the chromosome
Identify the leftmost non-terminal symbol and count its $r$ applicable production rules
Apply the $((c \bmod r) + 1)$-th rule
Repeat until no non-terminal symbols remain or the chromosome is exhausted

The context-free grammar is a subset of the OpenSMILES specification, defined formally as $G = (V, \Sigma, R, S)$ where $V$ is the set of non-terminal symbols, $\Sigma$ is the set of terminal symbols, $R$ is the set of production rules, and $S$ is the start symbol.

Evolution follows the $(\mu + \lambda)$ evolution strategy:

Create $\lambda$ new chromosomes by drawing random chromosomes from the population and mutating one integer at a random position
Translate each chromosome to a SMILES string and evaluate fitness (e.g., docking score). Invalid molecules receive fitness $-\infty$
Select the top $\mu$ molecules from the merged pool of $\mu + \lambda$ candidates

The authors did not use crossover, as it did not improve performance. Diversity is inherently maintained because a large fraction of molecules are mutated in each generation.

Experimental Setup and Benchmark Comparisons

Druglikeness Score Benchmark

The first experiment optimized the penalized logP score $J^{\log P}$, an indicator of druglikeness defined as:

$$ J^{\log P}(m) = \log P(m) - \text{SA}(m) - \text{ring-penalty}(m) $$

where $\log P(m)$ is the octanol-water partition coefficient, $\text{SA}(m)$ is the synthetic accessibility score, and ring-penalty$(m)$ penalizes carbon rings larger than size 6. All terms are normalized to zero mean and unit standard deviation. Initial populations were randomly sampled from the ZINC database (35 million compounds), with fitness set to $-\infty$ for molecules with molecular weight above 500 or duplicate structures.

ChemGE was compared against CVAE, GVAE, and ChemTS across population sizes $(\mu, \lambda) \in {(10, 20), (100, 200), (1000, 2000), (10000, 20000)}$.

Method	2h	4h	6h	8h	Mol/Min
ChemGE (10, 20)	4.46 +/- 0.34	4.46 +/- 0.34	4.46 +/- 0.34	4.46 +/- 0.34	14.5
ChemGE (100, 200)	5.17 +/- 0.26	5.17 +/- 0.26	5.17 +/- 0.26	5.17 +/- 0.26	135
ChemGE (1000, 2000)	4.45 +/- 0.24	5.32 +/- 0.43	5.73 +/- 0.33	5.88 +/- 0.34	527
ChemGE (10000, 20000)	4.20 +/- 0.33	4.28 +/- 0.28	4.40 +/- 0.27	4.53 +/- 0.26	555
CVAE	-30.18 +/- 26.91	-1.39 +/- 2.24	-0.61 +/- 1.08	-0.006 +/- 0.92	0.14
GVAE	-4.34 +/- 3.14	-1.29 +/- 1.67	-0.17 +/- 0.96	0.25 +/- 1.31	1.38
ChemTS	4.91 +/- 0.38	5.41 +/- 0.51	5.49 +/- 0.44	5.58 +/- 0.50	40.89

At $(\mu, \lambda) = (1000, 2000)$, ChemGE achieved the highest final score of 5.88 and generated 527 unique molecules per minute, roughly 13x faster than ChemTS and 3700x faster than CVAE. The small population (10, 20) converged prematurely with insufficient diversity, while the overly large population (10000, 20000) could not run enough generations to optimize effectively.

Docking Experiment with Thymidine Kinase

The second experiment applied ChemGE to generate molecules with high predicted binding affinity for thymidine kinase (KITH), a well-known antiviral drug target. The authors used rDock for docking simulation, taking the best intermolecular score $S_{\text{inter}}$ from three runs with different initial conformations. Fitness was defined as $-S_{\text{inter}}$ (lower scores indicate higher affinity). The protein structure was taken from PDB ID 2B8T.

With 32 parallel cores and $(\mu, \lambda) = (32, 64)$, ChemGE completed 1000 generations in approximately 26 hours, generating 9466 molecules total. Among these, 349 molecules achieved intermolecular scores better than the best known inhibitor in the DUD-E database.

Diversity Analysis

Molecular diversity was measured using internal diversity based on Morgan fingerprints:

$$ I(A) = \frac{1}{|A|^2} \sum_{(x,y) \in A \times A} T_d(x, y) $$

where $T_d(x, y) = 1 - \frac{|x \cap y|}{|x \cup y|}$ is the Tanimoto distance.

The 349 “ChemGE-active” molecules (those scoring better than the best known inhibitor) had an internal diversity of 0.55, compared to 0.46 for known inhibitors and 0.65 for the whole ZINC database. This is a substantial improvement over known actives, achieved without any explicit diversity-promoting mechanism.

ISOMAP visualizations showed that ChemGE populations migrated away from known inhibitors over generations, ultimately occupying a completely different region of chemical space by generation 1000. This suggests ChemGE discovered a novel structural class of potential binders.

High Throughput and Diversity Without Deep Learning

ChemGE demonstrates several notable findings:

Deep learning is not required for competitive de novo molecular generation. Grammatical evolution over SMILES achieves higher throughput and comparable or better optimization scores than VAE- and RNN-based methods.
Population size matters significantly. Too small a population leads to premature convergence. Too large a population prevents sufficient per-molecule optimization within the computational budget. The $(\mu, \lambda) = (1000, 2000)$ setting provided the best balance.
Inherent diversity is a key advantage of evolutionary methods. Without any explicit diversity loss or penalty, ChemGE maintains diversity comparable to the ZINC database and exceeds that of known active molecules.
Parallel evaluation is naturally supported. Each generation produces $\lambda$ independent molecules that can be evaluated by separate docking simulators simultaneously.

The authors acknowledge several limitations. Synthetic routes and ADMET properties were not evaluated for the generated molecules. The docking scores, while favorable, require confirmation through biological assays. The authors also note that incorporating probabilistic or neural models into the evolutionary process might further improve performance.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Initial population	ZINC	~35M compounds	Randomly sampled starting molecules
Docking target	PDB 2B8T	1 structure	Thymidine kinase-ligand complex
Baseline actives	DUD-E (KITH)	57 inhibitors	Known thymidine kinase inhibitors

Algorithms

Grammatical evolution with $(\mu + \lambda)$ evolution strategy
Mutation only (no crossover)
Context-free grammar subset of OpenSMILES specification
Chromosome length: $N$ integers per molecule
Fitness set to $-\infty$ for invalid SMILES, MW > 500, or duplicate molecules

Models

No neural network models are used. ChemGE is purely evolutionary.

Evaluation

Metric	Value	Baseline	Notes
Max $J^{\log P}$ (8h)	5.88 +/- 0.34	ChemTS: 5.58 +/- 0.50	ChemGE (1000, 2000)
Molecules/min	527	ChemTS: 40.89	~13x throughput improvement
Docking hits	349	Best DUD-E inhibitor	Molecules with better $S_{\text{inter}}$
Internal diversity	0.55	Known inhibitors: 0.46	Morgan fingerprint Tanimoto distance

Hardware

CPU: Intel Xeon E5-2630 v3 (benchmark experiments, single core)
Docking: 32 cores in parallel (thymidine kinase experiment, ~26 hours for 1000 generations)

Artifacts

Artifact	Type	License	Notes
ChemGE	Code	MIT	Official implementation

Paper Information

Citation: Yoshikawa, N., Terayama, K., Sumita, M., Homma, T., Oono, K., & Tsuda, K. (2018). Population-based de novo molecule generation, using grammatical evolution. Chemistry Letters, 47(11), 1431-1434. https://doi.org/10.1246/cl.180665

@article{yoshikawa2018chemge,
  title={Population-based De Novo Molecule Generation, Using Grammatical Evolution},
  author={Yoshikawa, Naruki and Terayama, Kei and Sumita, Masato and Homma, Teruki and Oono, Kenta and Tsuda, Koji},
  journal={Chemistry Letters},
  volume={47},
  number={11},
  pages={1431--1434},
  year={2018},
  publisher={Oxford University Press},
  doi={10.1246/cl.180665}
}

VAE for Automatic Chemical Design (2018 Seminal)

Thu, 26 Mar 2026 00:00:00 +0000

A Foundational Method for Continuous Molecular Representation

This is a Method paper that introduces a variational autoencoder (VAE) framework for mapping discrete molecular representations (SMILES strings) into a continuous latent space. The primary contribution is demonstrating that this continuous representation enables three key capabilities: (1) automatic generation of novel molecules by decoding random or perturbed latent vectors, (2) smooth interpolation between molecules in latent space, and (3) gradient-based optimization of molecular properties using a jointly trained property predictor. This work is widely regarded as one of the earliest and most influential applications of deep generative models to molecular design.

The Challenge of Searching Discrete Chemical Space

Molecular design is fundamentally an optimization problem: identify molecules that maximize some set of desirable properties. The search space is enormous (estimated $10^{23}$ to $10^{60}$ drug-like molecules) and discrete, making systematic exploration difficult. Prior approaches fell into two categories, each with significant limitations:

Virtual screening over fixed libraries: effective but monolithic, costly to enumerate, and requiring hand-crafted rules to avoid impractical chemistries.
Discrete local search (e.g., genetic algorithms): requires manual specification of mutation and crossover heuristics, and cannot leverage gradient information to guide the search.

The core insight is that mapping molecules into a continuous vector space sidesteps these problems entirely. In a continuous space, new compounds can be generated by vector perturbation (no hand-crafted mutation rules), optimization can follow property gradients (enabling larger and more directed jumps), and large unlabeled chemical databases can be leveraged through unsupervised representation learning.

A VAE Architecture for SMILES Strings with Joint Property Prediction

The architecture consists of three coupled neural networks trained jointly:

Encoder: Converts SMILES character strings into fixed-dimensional continuous vectors (the latent representation). Uses three 1D convolutional layers followed by a fully connected layer. For ZINC molecules, the latent space has 196 dimensions; for QM9, 156 dimensions.
Decoder: Converts latent vectors back into SMILES strings character by character using three layers of gated recurrent units (GRUs). The output is stochastic, as each character is sampled from a probability distribution over the SMILES alphabet.
Property Predictor: A multilayer perceptron that predicts molecular properties directly from the latent representation. Joint training with the autoencoder reconstruction loss organizes the latent space so that molecules with similar properties cluster together.

The VAE Objective

The model uses the variational autoencoder framework of Kingma and Welling. The training objective combines three terms:

$$\mathcal{L} = \mathcal{L}_{recon} + \beta \cdot D_{KL}(q(z|x) | p(z)) + \lambda \cdot \mathcal{L}_{prop}$$

where $\mathcal{L}_{recon}$ is the reconstruction loss (cross-entropy over SMILES characters), $D_{KL}$ is the KL divergence regularizer that encourages the latent distribution $q(z|x)$ to match a standard Gaussian prior $p(z)$, and $\mathcal{L}_{prop}$ is the property prediction regression loss. Both the variational loss and the property prediction loss are annealed in using a sigmoid schedule after 29 epochs over a total of 120 epochs of training.

The KL regularization is critical: it forces the decoder to handle a wider variety of latent points, preventing “dead areas” in latent space that would decode to invalid molecules.

Gradient-Based Optimization

After training, a Gaussian process (GP) surrogate model is fit on top of the latent representations to predict the target property. Optimization proceeds by:

Encoding a seed molecule into the latent space
Using the GP model to define a smooth property surface over the latent space
Optimizing the latent vector $z$ to maximize the predicted property via gradient ascent
Decoding the optimized $z$ back into a SMILES string

The objective used for demonstration is $5 \times \text{QED} - \text{SAS}$, balancing drug-likeness (QED) against synthetic accessibility (SAS).

Experiments on ZINC and QM9 Datasets

Two autoencoder systems were trained:

ZINC: 250,000 drug-like molecules from the ZINC database, with a 196-dimensional latent space. Properties predicted: logP, QED, SAS.
QM9: 108,000 molecules with fewer than 9 heavy atoms, with a 156-dimensional latent space. Properties predicted: HOMO energy, LUMO energy, electronic spatial extent ($\langle R^2 \rangle$).

Latent Space Quality

The encoded latent dimensions follow approximately normal distributions as enforced by the variational regularizer. Decoding is stochastic: sampling the same latent point multiple times yields different SMILES strings, with the most frequent decoding tending to be closest to the original point in latent space. Decoding validity rates are 73-79% for points near known molecules but only 4% for randomly selected latent points.

Spherical interpolation (slerp) between molecules in latent space produces smooth structural transitions, accounting for the geometry of high-dimensional Gaussian distributions where linear interpolation would pass through low-probability regions.

Molecular Generation Comparison

Source	Dataset	Samples	logP	SAS	QED	% in ZINC	% in eMolecules
Data	ZINC	249k	2.46 (1.43)	3.05 (0.83)	0.73 (0.14)	100	12.9
GA	ZINC	5303	2.84 (1.86)	3.80 (1.01)	0.57 (0.20)	6.5	4.8
VAE	ZINC	8728	2.67 (1.46)	3.18 (0.86)	0.70 (0.14)	5.8	7.0
Data	QM9	134k	0.30 (1.00)	4.25 (0.94)	0.48 (0.07)	0.0	8.6
GA	QM9	5470	0.96 (1.53)	4.47 (1.01)	0.53 (0.13)	0.018	3.8
VAE	QM9	2839	0.30 (0.97)	4.34 (0.98)	0.47 (0.08)	0.0	8.9

The VAE generates molecules whose property distributions closely match the training data, outperforming a genetic algorithm baseline that biases toward higher chemical complexity and decreased drug-likeness. Only 5.8% of VAE-generated ZINC molecules were found in the original ZINC database, indicating genuine novelty.

Property Prediction

Dataset/Property	Mean Baseline	ECFP	Graph Conv.	1-hot SMILES	Encoder Only	VAE
ZINC/logP	1.14	0.38	0.05	0.16	0.13	0.15
ZINC/QED	0.112	0.045	0.017	0.041	0.037	0.054
QM9/HOMO (eV)	0.44	0.20	0.12	0.12	0.13	0.16
QM9/LUMO (eV)	1.05	0.20	0.15	0.11	0.14	0.16
QM9/Gap (eV)	1.07	0.30	0.18	0.16	0.18	0.21

The VAE latent representation achieves property prediction accuracy comparable to graph convolutions for some properties, though graph convolutions generally perform best. The primary purpose of joint training is not to maximize prediction accuracy but to organize the latent space for optimization.

Optimization Results

Bayesian optimization with a GP model on the jointly trained latent space consistently produces molecules with higher percentile scores on the $5 \times \text{QED} - \text{SAS}$ objective compared to both random Gaussian search and genetic algorithm baselines. Starting from molecules in the bottom 10th percentile of the ZINC dataset, the optimizer reliably discovers molecules in regions of high objective value. Training the GP with 1000 molecules (vs. 2000) produces a wider diversity of solutions by optimizing to multiple local optima rather than a single global optimum.

Key Findings, Limitations, and Legacy

Key Findings

A continuous latent representation of molecules enables gradient-based search through chemical space, a qualitatively different approach from discrete enumeration or genetic algorithms.
Joint training with property prediction organizes the latent space by property values, creating smooth gradients that optimization can follow.
The VAE generates novel molecules with realistic property distributions, and the latent space encodes an estimated 7.5 million molecules despite training on only 250,000.

Acknowledged Limitations

The SMILES-based decoder sometimes produces formally valid but chemically undesirable molecules (acid chlorides, anhydrides, cyclopentadienes, aziridines, etc.) because the grammar of valid SMILES does not capture all synthetic or stability constraints.
Character-level SMILES generation is fragile: the decoder must implicitly learn which strings are valid SMILES, making the learning problem harder than necessary.
Decoding validity drops to only 4% for random latent points far from training data, limiting the ability to explore truly novel regions of chemical space.

Directions Identified

The authors point to several extensions that were already underway at the time of publication:

Grammar VAE: Using an explicitly defined SMILES grammar instead of forcing the model to learn one (Kusner et al., 2017).
Graph-based decoders: Directly outputting molecular graphs to avoid the SMILES validity problem.
Adversarial training: Using GANs for molecular generation (ORGAN, ORGANIC).
LSTM/RNN generators: Applying recurrent networks directly to SMILES for generation and reaction prediction.

This paper has been cited over 2,900 times and launched a large body of follow-up work in VAE-based, GAN-based, and reinforcement learning-based molecular generation.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Training	ZINC (drug-like subset)	250,000 molecules	Randomly sampled from ZINC database
Training	QM9	108,000 molecules	Molecules with fewer than 9 heavy atoms
Evaluation	ZINC held-out set	5,000 molecules	For latent space analysis

Algorithms

Encoder: 3 x 1D convolutional layers (ZINC: filters 9,9,10 with kernels 9,9,11; QM9: filters 2,2,1 with kernels 5,5,4), followed by a fully connected layer
Decoder: 3 x GRU layers (ZINC: hidden dim 488; QM9: hidden dim 500), trained with teacher forcing
Property Predictor: 2 fully connected layers of 1000 neurons (dropout 0.20) for prediction; smaller 3-layer MLP of 67 neurons (dropout 0.15) for latent space shaping
Variational loss annealing: Sigmoid schedule after 29 epochs, total 120 epochs
SMILES validation: Post-hoc filtering with RDKit; invalid outputs discarded
Optimization: Gaussian process surrogate model trained on 2000 maximally diverse molecules from latent space

Models

Built with Keras and TensorFlow. Latent dimensions: 196 (ZINC), 156 (QM9). SMILES alphabet: 35 characters (ZINC), 22 characters (QM9). Maximum string length: 120 (ZINC), 34 (QM9). Only canonicalized SMILES used for training.

Evaluation

Metric	Description
logP	Water-octanol partition coefficient
QED	Quantitative Estimation of Drug-likeness (0-1)
SAS	Synthetic Accessibility Score
HOMO/LUMO (eV)	Frontier orbital energies (QM9)
Decoding validity	Fraction of latent points producing valid SMILES
Novelty	Fraction of generated molecules not in training set

Hardware

Training was performed on the Harvard FAS Odyssey Cluster. Specific GPU types and training times are not reported. The Gaussian process optimization requires only minutes to train on a few thousand molecules.

Artifacts

Artifact	Type	License	Notes
chemical_vae	Code	Apache-2.0	Official implementation with training scripts and pre-trained models

Paper Information

Citation: Gómez-Bombarelli, R., Wei, J. N., Duvenaud, D., Hernández-Lobato, J. M., Sánchez-Lengeling, B., Sheberla, D., Aguilera-Iparraguirre, J., Hirzel, T. D., Adams, R. P., & Aspuru-Guzik, A. (2018). Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules. ACS Central Science, 4(2), 268-276. https://doi.org/10.1021/acscentsci.7b00572

@article{gomez2018automatic,
  title={Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules},
  author={G{\'o}mez-Bombarelli, Rafael and Wei, Jennifer N. and Duvenaud, David and Hern{\'a}ndez-Lobato, Jos{\'e} Miguel and S{\'a}nchez-Lengeling, Benjam{\'i}n and Sheberla, Dennis and Aguilera-Iparraguirre, Jorge and Hirzel, Timothy D. and Adams, Ryan P. and Aspuru-Guzik, Al{\'a}n},
  journal={ACS Central Science},
  volume={4},
  number={2},
  pages={268--276},
  year={2018},
  publisher={American Chemical Society},
  doi={10.1021/acscentsci.7b00572}
}

S4 Structured State Space Models for De Novo Drug Design

Thu, 26 Mar 2026 00:00:00 +0000

Structured State Spaces Meet Chemical Language Modeling

This is a Method paper that introduces structured state space sequence (S4) models to chemical language modeling (CLM) for de novo drug design. S4 models have a dual formulation: they process entire input sequences via convolution during training (like Transformers) and generate sequences element-by-element via recurrence during inference (like LSTMs). The authors benchmark S4 against LSTM and GPT architectures across multiple drug discovery tasks, including drug-like molecule generation, bioactivity learning, chemical space exploration, natural product design, and prospective kinase inhibitor design validated by molecular dynamics simulations.

Bridging the LSTM-Transformer Gap in Molecular Generation

Chemical language models (CLMs) generate molecules by learning the “chemical language” of SMILES string representations. The two dominant architectures for CLMs are LSTMs and GPTs, each with complementary strengths and limitations:

LSTMs generate sequences recurrently (element-by-element), which enables efficient generation and good learning of local/short-range dependencies. However, their sequential information bottleneck limits learning of global sequence properties.
GPTs (Transformer decoders) process the entire input at once, better capturing global properties like bioactivity. However, they become increasingly compute-intensive for longer SMILES strings and struggle with chemical space exploration at higher sampling temperatures.

Complex molecular properties like bioactivity can emerge from separated portions of a SMILES string (e.g., distant functional groups in the linear notation). Neither architecture fully addresses the need to learn these long-range dependencies while maintaining efficient, robust generation. The chemical space, estimated at up to $10^{60}$ small molecules, demands models that can both capture complex property relationships and explore diverse scaffolds efficiently.

The Dual Nature of S4: Convolution Meets Recurrence

S4 models are built on discrete state space models, which map an input sequence $\mathbf{u}$ to an output sequence $\mathbf{y}$ through learnable parameters $\overline{\mathbf{A}} \in \mathbb{R}^{N \times N}$, $\overline{\mathbf{B}} \in \mathbb{R}^{N \times 1}$, $\overline{\mathbf{C}} \in \mathbb{R}^{1 \times N}$, and $\overline{\mathbf{D}} \in \mathbb{R}^{1 \times 1}$:

$$ x_{k} = \overline{\mathbf{A}} x_{k-1} + \overline{\mathbf{B}} u_{k} $$

$$ y_{k} = \overline{\mathbf{C}} x_{k} + \overline{\mathbf{D}} u_{k} $$

This linear recurrence can equivalently be “unrolled” into a global convolution:

$$ \mathbf{y} = \mathbf{u} * \overline{\mathbf{K}} $$

where $\overline{\mathbf{K}}$ is a convolution filter parameterized by $\overline{\mathbf{A}}$, $\overline{\mathbf{B}}$, and $\overline{\mathbf{C}}$. This duality is the core innovation for CLMs:

Training: S4 uses the convolutional formulation to learn from entire SMILES sequences simultaneously, capturing global molecular properties.
Generation: S4 switches to the recurrent formulation, producing SMILES tokens one at a time for efficient, robust chemical space exploration.

S4 addresses the numerical instabilities of naive state space models through high-order polynomial projection operators (HiPPO) and reduction to the stable Cauchy kernel computation, enabling effective learning of long-range dependencies.

For molecular ranking after fine-tuning, the log-likelihood score subtracts the pre-training likelihood to isolate target-specific information:

$$ \mathcal{L}_{\text{score}}(\mathbf{M}) = \mathcal{L}(\mathbf{M}_{\text{ft}}) - \mathcal{L}(\mathbf{M}_{\text{pt}}) $$

where $\mathcal{L}(\mathbf{M}_{\text{ft}})$ and $\mathcal{L}(\mathbf{M}_{\text{pt}})$ are the fine-tuned and pre-trained model log-likelihoods, respectively.

Benchmarking S4 Across Drug Discovery Tasks

Drug-like molecule generation

All three CLMs (S4, LSTM, GPT) were pre-trained on 1.9M canonical SMILES from ChEMBL v31 (molecules with fewer than 100 tokens). Each model generated 102,400 SMILES strings de novo.

Model	Valid	Unique	Novel
S4	99,268 (97%)	98,712 (96%)	95,552 (93%)
LSTM	97,151 (95%)	96,618 (94%)	82,988 (81%)
GPT	93,580 (91%)	93,263 (91%)	91,590 (89%)

S4 produces the most valid, unique, and novel molecules. Error analysis reveals that each architecture shows different failure modes: LSTMs struggle most with branching errors, GPTs with ring and bond assignment errors, while S4 generates fewer branching and ring errors but more bond assignment errors than LSTM. This pattern supports the hypothesis that S4 captures long-range dependencies (branching, ring opening/closure) better while local dependencies (bond assignment) are handled better by recurrent processing.

Bioactivity learning via transfer learning

Five fine-tuning campaigns were conducted on targets from the LIT-PCBA dataset: PKM2, MAPK1, GBA, mTORC1, and TP53. After fine-tuning, models ranked held-out test molecules by learned log-likelihoods to evaluate bioactive compound prioritization.

S4 outperformed both benchmarks across targets. Wilcoxon signed-rank tests on pooled scores confirmed statistically significant superiority:

S4 vs. LSTM: $p$ [top 10] = 8.41e-6, $p$ [top 50] = 2.93e-7, $p$ [top 100] = 1.45e-7
S4 vs. GPT: $p$ [top 10] = 2.33e-3, $p$ [top 50] = 3.72e-3, $p$ [top 100] = 2.61e-2

TP53 was the most challenging target, where no model consistently retrieved actives in the top 10, possibly due to activity cliffs in the test set.

Chemical space exploration with temperature sampling

Models were evaluated across sampling temperatures from $T = 1.0$ to $T = 2.0$ on three metrics: SMILES validity, rediscovery rate of known actives, and scaffold diversity. Key findings:

Validity: S4 and LSTM maintain higher validity than GPT at elevated temperatures (GPT median validity drops below 40% at high T).
Rediscovery: S4 outperforms LSTM in rediscovering bioactive molecules at all temperatures.
Scaffold diversity: LSTM achieves the highest number of unique scaffold clusters (median 6,602 at $T = 1.75$), with S4 as close second (6,520 clusters).

S4 provides the best balance between bioactivity capture and structural diversity.

Natural product design

Models were trained on 32,360 large natural product SMILES (length > 100 tokens) from the COCONUT database and used to generate 102,400 designs each.

Metric	S4	LSTM	GPT	Training Set
Valid	82,633 (81%)	76,264 (74%)	70,117 (68%)	n.a.
Unique	53,293 (52%)	51,326 (50%)	50,487 (49%)	n.a.
Novel	40,897 (40%)	43,245 (42%)	43,168 (42%)	n.a.
NP-likeness	1.6 +/- 0.7	1.5 +/- 0.7	1.5 +/- 0.7	1.6 +/- 0.7

S4 designs the most valid molecules (6,000 to 12,000 more than benchmarks) and achieves significantly higher NP-likeness ($p = 1.41 \times 10^{-53}$ vs. LSTM, $p = 1.02 \times 10^{-82}$ vs. GPT). S4 also achieves the lowest Kolmogorov-Smirnov distances to the training/test distributions across multiple structural properties (sp3 carbons, aliphatic rings, spiro atoms, molecular weight, fused ring size, heavy atoms).

For computational efficiency, S4 trains as fast as GPT (both approximately 1.3x faster than LSTM) and generates fastest among all architectures.

Prospective MAPK1 inhibitor design

The pre-trained S4 model was fine-tuned on 68 manually curated MAPK1 inhibitors ($K_i < 1 \mu M$) from ChEMBL v33. The last five fine-tuning epochs generated 256K molecules across five temperature values. After ranking and filtering by log-likelihood score and scaffold similarity, the top 10 designs were evaluated via Umbrella Sampling molecular dynamics simulations.

Eight out of ten designs showed high predicted affinity, with $\Delta G$ values ranging from $-10.3 \pm 0.6$ to $-23 \pm 4$ kcal/mol. These affinities are comparable to or exceed those of the closest known active neighbors ($\Delta G = -9.1 \pm 0.8$ to $-13 \pm 2$ kcal/mol). The most potent predicted design (molecule 2, $\Delta G = -23 \pm 4$ kcal/mol) engages extensively with the MAPK1 binding pocket, though synthetic accessibility may be limited. Several designs incorporate halogen substitutions favorable for MAPK1 inhibition, consistent with known structure-activity relationships.

S4 Combines the Best of LSTMs and GPTs for Molecular Design

The main findings of this study are:

S4 outperforms both LSTM and GPT in learning complex molecular properties like bioactivity, while maintaining competitive or superior performance in syntax learning and chemical space exploration.
The dual formulation is key: holistic training (convolution) enables better capture of global molecular properties, while recurrent generation preserves robust chemical syntax and diverse scaffold exploration.
S4 is especially strong for longer sequences: natural product design (SMILES > 100 tokens) shows the largest advantages over benchmarks in validity and property matching.
Prospective validation: 8/10 S4-designed MAPK1 inhibitors are predicted as highly active by molecular dynamics, with affinities comparable to or exceeding known actives.

Limitations acknowledged by the authors:

All evaluations are computational; no wet-lab experimental validation is reported.
Bioactivity evaluation relies on likelihood-based ranking, which is an indirect proxy.
The MD simulations, while more rigorous than simple docking, still represent in silico predictions.
SMILES augmentation and improved ranking protocols could further boost performance.

Future directions include application to macrocyclic peptides and protein sequences, organic reaction planning, structure-based drug design, and integration with wet-lab experimental validation.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pre-training	ChEMBL v31	1.9M SMILES	Molecules with SMILES length <= 100 tokens
Fine-tuning (bioactivity)	LIT-PCBA (5 targets)	11-56 actives + ~10K inactives per target	PKM2, MAPK1, GBA, mTORC1, TP53
Natural product training	COCONUT	32,360 SMILES	SMILES length > 100 tokens
Prospective fine-tuning	ChEMBL v33 (MAPK1)	68 inhibitors	$K_i < 1 \mu M$, target ID CHEMBL4040

Algorithms

Pre-training: next-token prediction on SMILES strings
Fine-tuning: transfer learning with early stopping (patience 5, tolerance $10^{-5}$)
Molecule ranking: log-likelihood scoring with pre-training bias subtraction (Eq. 5)
Temperature sampling: $T$ from 1.0 to 2.0 (step 0.25) for chemical space exploration

Models

S4: Structured state space sequence model with HiPPO initialization; hyperparameter search over 242 + 108 configurations
LSTM: 40 configurations optimized via random search
GPT: 35 configurations optimized via random search
All models share the same pre-training data and fine-tuning protocol for fair comparison

Evaluation

Metric	Best Model	Value	Notes
Validity (ChEMBL)	S4	97%	Out of 102,400 generated SMILES
Uniqueness (ChEMBL)	S4	96%	Among valid designs
Novelty (ChEMBL)	S4	93%	Not in training set
Bioactivity ranking (top 10)	S4	Significant (p = 8.41e-6 vs LSTM)	Wilcoxon signed-rank test
NP validity	S4	81%	COCONUT, SMILES > 100 tokens
MAPK1 inhibitor success	S4	8/10 designs active	Validated by MD (Umbrella Sampling)

Hardware

Hyperparameter search: NVIDIA A100 40GB GPUs
LSTM/GPT search: 5 days on single A100
S4 search: 10 days on multiple A100 GPUs
MD simulations: Dutch supercomputer Snellius; 1.2-1.6 microseconds per ligand (Umbrella Sampling)

Artifacts

Artifact	Type	License	Notes
S4 for de novo drug design	Code	MIT	Official PyTorch implementation with data and trained models
Zenodo archive	Dataset	CC-BY-4.0	Source data and molecule designs

Paper Information

Citation: Ozcelik, R., de Ruiter, S., Criscuolo, E., & Grisoni, F. (2024). Chemical language modeling with structured state space sequence models. Nature Communications, 15, 6176.

@article{ozcelik2024chemical,
  title={Chemical language modeling with structured state space sequence models},
  author={\"O{}z\c{c}elik, R{\i}za and de Ruiter, Sarah and Criscuolo, Emanuele and Grisoni, Francesca},
  journal={Nature Communications},
  volume={15},
  number={1},
  pages={6176},
  year={2024},
  publisher={Nature Publishing Group},
  doi={10.1038/s41467-024-50469-9}
}

REINVENT 4: Open-Source Generative Molecule Design

Thu, 26 Mar 2026 00:00:00 +0000

An Open-Source Reference Implementation for Generative Molecular Design

REINVENT 4 is a Resource paper presenting a production-grade, open-source software framework for AI-driven generative molecular design. The primary contribution is the unified codebase that integrates four distinct molecule generators (de novo, scaffold decoration, linker design, molecular optimization) within three machine learning optimization algorithms (transfer learning, reinforcement learning, curriculum learning). The software is released under the Apache 2.0 license and represents the fourth major version of the REINVENT platform, which has been in continuous production use at AstraZeneca for drug discovery.

Bridging the Gap Between Research Prototypes and Production Molecular Design

The motivation for REINVENT 4 stems from several gaps in the generative molecular design landscape. While numerous AI model architectures have been developed for molecular generation (VAEs, GANs, RNNs, transformers, flow models, diffusion models), most exist as research prototypes released alongside individual publications rather than as maintained, integrated software. The authors argue that the scientific community needs reference implementations of common generative molecular design algorithms in the public domain to:

Enable nuanced debate about the application of AI in drug discovery
Serve as educational tools for practitioners entering the field
Increase transparency around AI-driven molecular design
Provide a foundation for future innovation

REINVENT 4 consolidates previously separate codebases (REINVENT v1, v2, LibInvent, LinkInvent, Mol2Mol) into a single repository with a consistent interface, addressing the fragmentation that characterized earlier releases.

Unified Framework for Sequence-Based Molecular Generation

The core design of REINVENT 4 centers on sequence-based neural network models that generate SMILES strings in an autoregressive manner. All generators model the probability of producing a token sequence, with two formulations.

For unconditional agents (de novo generation), the joint probability of a sequence $T$ with tokens $t_1, t_2, \ldots, t_\ell$ is:

$$ \mathbf{P}(T) = \prod_{i=1}^{\ell} \mathbf{P}(t_i \mid t_{i-1}, t_{i-2}, \ldots, t_1) $$

For conditional agents (scaffold decoration, linker design, molecular optimization), the joint probability given an input sequence $S$ is:

$$ \mathbf{P}(T \mid S) = \prod_{i=1}^{\ell} \mathbf{P}(t_i \mid t_{i-1}, t_{i-2}, \ldots, t_1, S) $$

The negative log-likelihood for unconditional agents is:

$$ NLL(T) = -\log \mathbf{P}(T) = -\sum_{i=1}^{\ell} \log \mathbf{P}(t_i \mid t_{i-1}, t_{i-2}, \ldots, t_1) $$

Reinforcement Learning with DAP

The key optimization mechanism is reinforcement learning via the “Difference between Augmented and Posterior” (DAP) strategy. For each generated sequence $T$, the augmented likelihood is defined as:

$$ \log \mathbf{P}_{\text{aug}}(T) = \log \mathbf{P}_{\text{prior}}(T) + \sigma \mathbf{S}(T) $$

where $\mathbf{S}(T) \in [0, 1]$ is the scalar score and $\sigma \geq 0$ controls the balance between reward and regularization. The DAP loss is:

$$ \mathcal{L}(T) = \left(\log \mathbf{P}_{\text{aug}}(T) - \log \mathbf{P}_{\text{agent}}(T)\right)^2 $$

The presence of the prior likelihood in the augmented likelihood constrains how far the agent can deviate from chemically plausible space, functioning similarly to proximal policy gradient methods. The loss is lower-bounded by:

$$ \mathcal{L}(T) \geq \max\left(0, \log \mathbf{P}_{\text{prior}}(T) + \sigma \mathbf{S}(T)\right)^2 $$

Four Molecule Generators

REINVENT 4 supports four generator types:

Generator	Architecture	Input	Task
Reinvent	RNN	None	De novo design from scratch
LibInvent	RNN	Scaffold SMILES	R-group replacement, library design
LinkInvent	RNN	Two warhead fragments	Linker design, scaffold hopping
Mol2Mol	Transformer	Input molecule	Molecular optimization within similarity bounds

All generators are fully integrated with all three optimization algorithms (TL, RL, CL). The Mol2Mol transformer was trained on over 200 billion molecular pairs from PubChem with Tanimoto similarity $\geq 0.50$, using ranking loss to directly link negative log-likelihood to molecular similarity.

Staged Learning (Curriculum Learning)

A key new feature is staged learning, which implements curriculum learning as multi-stage RL. Each stage can define a different scoring profile, allowing users to gradually phase in computationally expensive scoring functions. For example, cheap drug-likeness filters can run first, followed by docking in later stages. Stages terminate when a maximum score threshold is exceeded or a step limit is reached.

Scoring Subsystem

The scoring subsystem implements a plugin architecture supporting over 25 scoring components, including:

Physicochemical descriptors from RDKit (QED, SLogP, TPSA, molecular weight, etc.)
Molecular docking via DockStream (AutoDock Vina, rDock, Hybrid, Glide, GOLD)
QSAR models via Qptuna and ChemProp (D-MPNN)
Shape similarity via ROCS
Synthesizability estimation via SA score
Matched molecular pairs via mmpdb
Generic REST and external process interfaces

Scores are aggregated via weighted arithmetic or geometric mean. A transform system (sigmoid, step functions, value maps) normalizes individual component scores to $[0, 1]$.

PDK1 Inhibitor Case Study

The paper demonstrates REINVENT 4 through a structure-based drug design exercise targeting Phosphoinositide-dependent kinase-1 (PDK1) inhibitors. The experimental setup uses PDB crystal structure 2XCH with DockStream and Glide for docking, defining hits as molecules with docking score $\leq -8$ kcal/mol and QED $\geq 0.7$.

Baseline RL from prior: 50 epochs of staged learning with batch size 128 produced 119 hits from 6,400 generated molecules (1.9% hit rate), spread across 103 generic Bemis-Murcko scaffolds.

Transfer learning + RL: After 10 epochs of TL on 315 congeneric pyridinone PDK1 actives from PubChem Assay AID1798002, the same 50-epoch RL run produced 222 hits (3.5% hit rate) across 176 unique generic scaffolds, nearly doubling productivity.

Both approaches generated top-scoring molecules (docking score of -10.1 kcal/mol each) with plausible binding poses reproducing key protein-ligand interactions seen in the native crystal structure, including hinge interactions with ALA 162 and contacts with LYS 111.

The paper also demonstrates the agent’s plasticity through a molecular weight switching experiment: after 500 epochs driving generation toward 1500 Da molecules, switching the reward to favor molecules $\leq 500$ Da resulted in rapid adaptation within ~50 epochs, showing that the RL agent can recover from extreme biases.

Practical Software for AI-Driven Drug Discovery

REINVENT 4 represents a mature, well-documented framework that consolidates years of incremental development into a single codebase. Key practical features include TOML/JSON configuration, TensorBoard visualization, multinomial sampling and beam search decoding, diversity filters for scaffold-level novelty, experience replay (inception), and a plugin mechanism for extending the scoring subsystem.

The authors acknowledge that this is one approach among many and that there is no single solution that uniformly outperforms others. REINVENT has demonstrated strong sample efficiency in benchmarks and produced realistic 3D docking poses, but the paper does not claim universal superiority. The focus is on providing a well-engineered, transparent reference implementation rather than advancing a novel algorithm.

Limitations include that only the Mol2Mol prior supports stereochemistry, the training data biases constrain the explorable chemical space, and the SMILES-based representation inherits the known fragility of string-based molecular encodings.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Prior training (Reinvent)	ChEMBL 25	~1.7M molecules	Drug-like compounds
Prior training (LibInvent)	ChEMBL 27	~1.9M molecules	Scaffold-decoration pairs
Prior training (LinkInvent)	ChEMBL 27	~1.9M molecules	Fragment-linker pairs
Prior training (Mol2Mol)	ChEMBL 28 / PubChem	~200B pairs	Tanimoto similarity $\geq 0.50$
Case study TL	PubChem AID1798002	315 compounds	Congeneric PDK1 actives
Case study docking	PDB 2XCH	1 structure	PDK1 crystal structure

Algorithms

Optimization: DAP (recommended), plus three deprecated alternatives (REINFORCE, A2C, MAULI)
Decoding: Multinomial sampling (default, temperature $K = 1$) and beam search
Diversity filter: Murcko scaffold, topological scaffold, scaffold similarity, same-SMILES penalty
Experience replay: Inception memory with configurable size and sampling rate
Gradient descent: Adam optimizer

Models

All pre-trained priors are distributed with the repository. RNN-based generators (Reinvent, LibInvent, LinkInvent) and transformer-based generator (Mol2Mol) with multiple similarity-conditioned variants.

Evaluation

Metric	Value	Condition	Notes
Hit rate (RL)	1.9%	50 epochs, batch 128	PDK1 case study
Hit rate (TL+RL)	3.5%	10 TL + 50 RL epochs	PDK1 case study
Scaffold diversity (RL)	103 scaffolds	From 119 hits	Generic Bemis-Murcko
Scaffold diversity (TL+RL)	176 scaffolds	From 222 hits	Generic Bemis-Murcko
Best docking score	-10.1 kcal/mol	Both methods	Glide SP

Hardware

The paper does not specify hardware requirements. REINVENT 4 supports both GPU and CPU execution. Python 3.10+ is required, with PyTorch 1.x (2.0 also compatible) and RDKit 2022.9+.

Artifacts

Artifact	Type	License	Notes
REINVENT4	Code	Apache-2.0	Full framework with pre-trained priors
DockStream	Code	Apache-2.0	Docking wrapper for scoring

Paper Information

Citation: Loeffler, H. H., He, J., Tibo, A., Janet, J. P., Voronov, A., Mervin, L. H., & Engkvist, O. (2024). Reinvent 4: Modern AI-driven generative molecule design. Journal of Cheminformatics, 16, 20. https://doi.org/10.1186/s13321-024-00812-5

@article{loeffler2024reinvent,
  title={Reinvent 4: Modern AI-driven generative molecule design},
  author={Loeffler, Hannes H. and He, Jiazhen and Tibo, Alessandro and Janet, Jon Paul and Voronov, Alexey and Mervin, Lewis H. and Engkvist, Ola},
  journal={Journal of Cheminformatics},
  volume={16},
  number={1},
  pages={20},
  year={2024},
  publisher={Springer},
  doi={10.1186/s13321-024-00812-5}
}

Re-evaluating Sample Efficiency in Molecule Generation

Thu, 26 Mar 2026 00:00:00 +0000

An Empirical Re-evaluation of Generative Model Benchmarks

This is an Empirical paper. The primary contribution is a critical reassessment of the Practical Molecular Optimization (PMO) benchmark for de novo molecule generation. Rather than proposing a new generative model, the authors modify existing benchmark metrics to account for chemical desirability (molecular weight, LogP, topological novelty) and molecular diversity. They then re-evaluate all 25 generative models from the original PMO benchmark plus the recently proposed Augmented Hill-Climb (AHC) method.

Sample Efficiency and Chemical Quality in Drug Design

Deep generative models for de novo molecule generation often require large numbers of oracle evaluations (up to $10^5$ samples) to optimize toward a target objective. This is a practical limitation when using computationally expensive scoring functions like molecular docking. The PMO benchmark by Gao et al. addressed this by reformulating performance as maximizing an objective within a fixed budget of 10,000 oracle calls, finding REINVENT to be the most sample-efficient model across 23 tasks.

However, the authors identify a key limitation: the PMO benchmark measures only sample efficiency without considering the chemical quality of proposed molecules. Investigating the top-performing REINVENT model on the JNK3 task, they find that 4 of 5 replicate runs produce molecules with molecular weight and LogP distributions far outside the training data (ZINC250k). The resulting molecules contain large structures with repeating substructures that are undesirable from a medicinal chemistry perspective. This disconnect between benchmark performance and practical utility motivates the modified evaluation metrics.

Modified Metrics: Property Filters and Diversity Requirements

The core innovation is the introduction of three modified AUC Top-10 metrics that extend the original PMO benchmark evaluation:

AUC Top-10 (Filtered): Molecules are excluded if their molecular weight or LogP falls beyond 4 standard deviations from the mean of the ZINC250k pre-training dataset ($\mu \pm 4\sigma$, covering approximately 99.99% of a normal distribution). Molecules with more than 10% de novo (unobserved in ZINC250k) ECFP4 fingerprint bits are also filtered out. This ensures the generative model does not drift beyond its applicability domain.

AUC Top-10 (Diverse): The top 10 molecules are selected iteratively, where a molecule is only added if its Tanimoto similarity (by ECFP4 fingerprints) to any previously selected compound does not exceed 0.35. This threshold corresponds to an approximately 80% probability that more-similar molecules belong to the same bioactivity class, enforcing that distinct candidates possess different profiles.

AUC Top-10 (Combined): Applies both property filters and diversity filters simultaneously, providing the most stringent evaluation of practical performance.

Benchmark Setup and Generative Models Evaluated

Implementation Details

The authors re-implement the PMO benchmark using the original code and data (MIT license) with no changes beyond adding AHC and the new metrics. For Augmented Hill-Climb, the architecture follows REINVENT: an embedding layer of size 128 and 3 layers of Gated Recurrent Units (GRU) with size 512. The prior is trained on ZINC250k using SMILES notation with batch size 128 for 5 epochs.

Two AHC variants are benchmarked:

SMILES-AHC: Hyperparameters optimized via the standard PMO procedure, yielding batch size 256, $\sigma = 120$, $K = 0.25$
SMILES-AHC*: Uses $\sigma = 60$, chosen based on prior knowledge that lower $\sigma$ values maintain better regularization and chemical quality

Both omit diversity filters and non-unique penalization for standardized comparison, despite these being shown to improve performance in prior work.

Models Compared

The benchmark includes 25 generative models from the original PMO paper spanning diverse architectures: REINVENT (RNN + RL), Graph GA (graph-based genetic algorithm), GP BO (Gaussian process Bayesian optimization), SMILES GA (SMILES-based genetic algorithm), SELFIES-based VAEs, and others. The 23 objective tasks derive primarily from the GuacaMol benchmark.

Re-ranked Results and Augmented Hill-Climb Performance

The modified metrics substantially re-order the ranking of generative models:

SMILES-AHC achieves top performance on AUC Top-10 (Combined)*, where both property filters and diversity are enforced. The use of domain-informed hyperparameter selection ($\sigma = 60$) proves critical.
SMILES-AHC (data-driven hyperparameters) ranks first when accounting for property filters alone, diversity alone, or both combined, demonstrating that the AHC algorithm itself provides strong performance even without manual tuning.
REINVENT retains its first-place rank under property filters alone, suggesting that the minority of compounds staying within acceptable property space still perform well. However, it drops when diversity is also required.
Evolutionary algorithms (Graph GA, GP BO, SMILES GA) drop significantly under the new metrics. This is expected because rule-based methods are not constrained by the ZINC250k distribution and tend to propose molecules that diverge from drug-like chemical space.
Both AHC variants excel on empirically difficult tasks, including isomer-based tasks, Zaleplon MPO, and Sitagliptin MPO, where other methods struggle.

Limitations

The authors acknowledge several limitations:

Results are preliminary because generative models have not undergone hyperparameter optimization against the new metrics
Property filter thresholds are subjective, and the 10% de novo ECFP4 bit threshold was chosen by visual inspection
Comparing rule-based models against distribution-based models using ZINC250k similarity introduces a bias toward distribution-based approaches
Six objective task reference molecules sit in the lowest 0.01% of ZINC250k property space, raising questions about whether distribution-based models can reasonably optimize for these objectives
Property filters and diversity could alternatively be incorporated directly into the objective function as additional oracles, though this would not necessarily produce the same results

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pre-training	ZINC250k	~250K molecules	Subset of ZINC15, provided by PMO benchmark
Evaluation	PMO benchmark tasks	23 objectives	Derived primarily from GuacaMol

Algorithms

Augmented Hill-Climb: RL strategy from Thomas et al. (2022), patience of 5
Hyperparameters (SMILES-AHC): batch size 256, $\sigma = 120$, $K = 0.25$
Hyperparameters (SMILES-AHC)*: $\sigma = 60$ (domain-informed selection)
Prior training: 5 epochs, batch size 128, SMILES notation
Oracle budget: 10,000 evaluations per task
Replicates: 5 per model per task

Models

Architecture: Embedding (128) + 3x GRU (512), following REINVENT
All 25 PMO benchmark models re-evaluated using original implementations

Evaluation

Metric	Description	Notes
AUC Top-10 (Original)	Area under curve of average top 10 molecules	Standard PMO metric
AUC Top-10 (Filtered)	Original with MW/LogP and ECFP4 novelty filters	$\mu \pm 4\sigma$ from ZINC250k
AUC Top-10 (Diverse)	Top 10 selected with Tanimoto < 0.35 diversity	ECFP4 fingerprints
AUC Top-10 (Combined)	Both filters and diversity applied	Most stringent metric

Hardware

Hardware requirements are not specified in the paper. The benchmark uses 10,000 oracle evaluations per task with 5 replicates, which is computationally modest compared to standard generative model training.

Artifacts

Artifact	Type	License	Notes
MolScore	Code	MIT	Scoring and benchmarking framework by the first author
PMO Benchmark	Code	MIT	Original benchmark code and data

Paper Information

Citation: Thomas, M., O’Boyle, N. M., Bender, A., & de Graaf, C. (2022). Re-evaluating sample efficiency in de novo molecule generation. arXiv preprint arXiv:2212.01385.

@misc{thomas2022reevaluating,
  title={Re-evaluating sample efficiency in de novo molecule generation},
  author={Thomas, Morgan and O'Boyle, Noel M. and Bender, Andreas and de Graaf, Chris},
  year={2022},
  eprint={2212.01385},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  doi={10.48550/arxiv.2212.01385}
}

Protein-to-Drug Molecule Translation via Transformer

Thu, 26 Mar 2026 00:00:00 +0000

Protein-Targeted Drug Generation as Machine Translation

This is a Method paper that proposes using the Transformer neural network architecture for protein-specific de novo drug generation. The primary contribution is framing the problem of generating molecules that bind to a target protein as a machine translation task: translating from the “language” of amino acid sequences to the SMILES representation of candidate drug molecules. The model takes only a protein’s amino acid sequence as input and generates novel molecules with predicted binding affinity, requiring no prior knowledge of active ligands, physicochemical descriptors, or the protein’s three-dimensional structure.

Limitations of Existing Generative Drug Design Approaches

Existing deep learning methods for de novo molecule generation suffer from several limitations. Most RNN-based approaches require a library of known active compounds against the target protein to fine-tune the generator or train a reward predictor for reinforcement learning. Structure-based drug design methods require the three-dimensional structure of the target protein, which can be costly and technically difficult to obtain through protein expression, purification, and crystallization. Autoencoder-based approaches (variational and adversarial) similarly depend on prior knowledge of protein binders or their physicochemical characteristics.

The estimated drug-like molecule space is on the order of $10^{60}$, while only around $10^{8}$ compounds have been synthesized. High-throughput screening is expensive and time-consuming, and virtual screening operates only on known molecules. Computational de novo design methods often generate molecules that are hard to synthesize or restrict accessible chemical space through coded rules. A method that requires only a protein’s amino acid sequence would substantially simplify the initial stages of drug discovery, particularly for targets with limited or no information about inhibitors and 3D structure.

Sequence-to-Sequence Translation with Self-Attention

The core insight is to treat protein-targeted drug generation as a translation problem between two “languages,” applying the Transformer architecture that had demonstrated strong results in neural machine translation. The encoder maps a protein amino acid sequence $(a_1, \ldots, a_n)$ to continuous representations $\mathbf{z} = (z_1, \ldots, z_n)$, and the decoder autoregressively generates a SMILES string conditioned on $\mathbf{z}$.

The self-attention mechanism computes:

$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$

where $d_k$ is a scaling factor. Multihead attention runs $h$ parallel attention heads:

$$ \text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V) $$

$$ \text{Multihead}(Q, K, V) = (\text{head}_1, \ldots, \text{head}_h)W^O $$

Positional encoding uses sinusoidal functions:

$$ PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i / d_{model}}}\right) $$

$$ PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i / d_{model}}}\right) $$

The self-attention mechanism is particularly well-suited for this task for two reasons. First, protein sequences can be much longer than SMILES strings (dozens of times longer), making the ability to capture long-range dependencies essential. Second, three-dimensional structural features of the binding pocket may be formed by amino acid residues far apart in the linear sequence, and multihead attention can jointly attend to different positional aspects simultaneously.

Data, Model Architecture, and Docking Evaluation

Data

The training data was retrieved from BindingDB, filtering for interactions between proteins from Homo sapiens, Rattus norvegicus, Mus musculus, and Bos taurus with binding affinity below 100 nM (IC50, Kd, or EC50). After filtering for valid PubChem CIDs, SMILES representations, UniProt IDs, molecular weight under 1000 Da, and amino acid sequence lengths between 80 and 2050, the final dataset contained 238,147 records with 1,613 unique proteins and 154,924 unique ligand SMILES strings.

Five Monte Carlo cross-validation splits were created, with the constraint that test set proteins share less than 20% sequence similarity with training set proteins (measured via Needleman-Wunsch global alignment).

Model Configuration

The model uses the original Transformer implementation via the tensor2tensor library with:

4 encoder/decoder layers of size 128
4 attention heads
Adam optimizer with learning rate decay from the original Transformer paper
Batch size of 4,096 tokens
Training for 600K epochs on a single GPU in Google Colaboratory
Vocabulary of 71 symbols (character-level tokenization)

Beam search decoding was used with two modes: beam size 4 keeping only the top-1 result (“one per one” mode) and beam size 10 keeping all 10 results (“ten per one” mode).

Chemical Validity and Uniqueness

Metric	One per One (avg)	Ten per One (avg)
Valid SMILES (%)	90.2	82.6
Unique SMILES (%)	92.3	81.7
ZINC15 match (%)	30.6	17.1

Docking Evaluation

To assess binding affinity, the authors selected two receptor tyrosine kinases from the test set (IGF-1R and VEGFR2) and performed molecular docking with SMINA. Four sets of ligands were compared: known binders, randomly selected compounds, molecules generated for the target protein, and molecules generated for other targets (cross-docking control).

ROC-AUC analysis showed that the docking tool classified generated molecules for the correct target as binders at rates comparable to known binders. For the best-discriminating structures (PDB 3O23 for IGF-1R, PDB 3BE2 for VEGFR2), Mann-Whitney U tests confirmed statistically significant differences between generated-for-target molecules and random compounds, while the difference between generated-for-target and known binders was not significant (p = 0.40 and 0.26 respectively), suggesting the model generates plausible binders.

Drug-Likeness Properties

Generated molecules were evaluated against Lipinski’s Rule of Five and other drug-likeness criteria:

Property	Constraint	One per One (%)	Ten per One (%)
logP	< 5	84.4	85.6
Molecular weight	< 500 Da	95.8	88.9
H-bond donors	< 5	95.8	91.9
H-bond acceptors	< 10	97.9	93.5
Rotatable bonds	< 10	97.9	91.2
TPSA	< 140	98.0	92.7
SAS	< 6	99.9	100.0

Mean QED values were 0.66 +/- 0.19 (one per one) and 0.58 +/- 0.21 (ten per one).

Structural Novelty

Tanimoto similarity analysis showed that only 8% of generated structures had similarity above the threshold (> 0.85) to training compounds. The majority (51%) had Tanimoto scores below 0.5. The mean nearest-neighbor Tanimoto similarity of generated molecules to the training set (0.54 +/- 0.17 in one-per-one mode) was substantially lower than the mean within-training-set similarity (0.74 +/- 0.14), indicating the model generates structurally diverse molecules outside the training distribution.

Generated Molecules Show Drug-Like Properties and Predicted Binding

The model generates roughly 90% chemically valid SMILES in one-per-one mode, with 92% uniqueness. Docking simulations on IGF-1R and VEGFR2 suggest that generated molecules for the correct target are statistically indistinguishable from known binders, while molecules generated for other targets behave more like random compounds. Drug-likeness properties fall within acceptable ranges for the vast majority of generated compounds.

The authors acknowledge several limitations:

Only two protein targets were analyzed via docking due to computational constraints, and the analysis was limited to proteins with a single well-known druggable binding pocket.
Beam search produces molecules that differ only slightly; diverse beam search or coupling with variational/adversarial autoencoders could improve diversity.
The fraction of molecules matching the ZINC15 database (30.6% in one-per-one mode) could potentially be reduced by pretraining on a larger compound set (e.g., ChEMBL’s 1.5 million molecules).
Model interpretability remains limited and is identified as important future work.
The approach is a proof of concept and requires further validation via in vitro assays across diverse protein targets.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Training/Test	BindingDB (filtered)	238,147 records	1,613 unique proteins, 154,924 unique SMILES; IC50/Kd/EC50 < 100 nM
Docking validation	PDB structures	11 (IGF-1R), 20 (VEGFR2)	SMINA docking with default settings
Database matching	ZINC15	N/A	Used for novelty assessment

Algorithms

Transformer (encoder-decoder) via tensor2tensor library
Beam search decoding (beam sizes 4 and 10)
Needleman-Wunsch global alignment for protein sequence similarity (EMBOSS)
SMINA for molecular docking
RDKit for validity checking, property calculation, and canonicalization

Models

4 layers, 128 hidden size, 4 attention heads
Character-level tokenization with 71-symbol vocabulary
5-fold Monte Carlo cross-validation with < 20% sequence similarity between train/test proteins

Evaluation

Metric	Value	Notes
Valid SMILES	90.2% (1-per-1), 82.6% (10-per-1)	Averaged across 5 splits
Unique SMILES	92.3% (1-per-1), 81.7% (10-per-1)	Averaged across 5 splits
ZINC15 match	30.6% (1-per-1), 17.1% (10-per-1)	Averaged across 5 splits
QED	0.66 +/- 0.19 (1-per-1), 0.58 +/- 0.21 (10-per-1)	Drug-likeness score
SAS compliance	99.9% (1-per-1), 100% (10-per-1)	SAS < 6

Hardware

Google Colaboratory with one GPU
Training for 600K epochs

Artifacts

Artifact	Type	License	Notes
molecule_structure_generation	Code	Not specified	Jupyter Notebook implementation using tensor2tensor

Paper Information

Citation: Grechishnikova, D. (2021). Transformer neural network for protein-specific de novo drug generation as a machine translation problem. Scientific Reports, 11, 321. https://doi.org/10.1038/s41598-020-79682-4

@article{grechishnikova2021transformer,
  title={Transformer neural network for protein-specific de novo drug generation as a machine translation problem},
  author={Grechishnikova, Daria},
  journal={Scientific Reports},
  volume={11},
  number={1},
  pages={321},
  year={2021},
  publisher={Nature Publishing Group},
  doi={10.1038/s41598-020-79682-4}
}

PrefixMol: Prefix Embeddings for Drug Molecule Design

Thu, 26 Mar 2026 00:00:00 +0000

Unified Multi-Conditional Molecular Generation

PrefixMol is a Method paper that introduces a unified generative model for structure-based drug design that simultaneously conditions on protein binding pockets and multiple chemical properties. The primary contribution is a prefix-embedding mechanism, borrowed from NLP multi-task learning, that represents each condition (pocket geometry, Vina score, QED, SA, LogP, Lipinski) as a learnable feature vector prepended to the input sequence of a GPT-based SMILES generator. This allows a single model to handle customized multi-conditional generation without the negative transfer that typically arises from merging separate task-specific models.

Bridging Target-Aware and Chemistry-Aware Molecular Design

Prior structure-based drug design methods (e.g., Pocket2Mol, GraphBP) generate molecules conditioned on protein binding pockets but impose no constraints on the chemical properties of the output. Conversely, controllable molecule generation methods (e.g., REINVENT, RetMol, CMG) can steer chemical properties but ignore protein-ligand interactions. Merging these two objectives into a single model is difficult for two reasons:

Data scarcity: Few datasets contain both protein-ligand binding affinity data and comprehensive molecular property annotations.
Negative transfer: Treating each condition as a separate task in a multi-task framework can hurt overall performance when tasks conflict.

PrefixMol addresses both problems by extending the CrossDocked dataset with molecular property labels and using a parameter-efficient prefix conditioning strategy that decouples task-specific knowledge from the shared generative backbone.

Prefix Conditioning in Attention Layers

The core innovation adapts prefix-tuning from NLP to molecular generation. Given a GPT transformer that generates SMILES token-by-token, PrefixMol prepends $n_c$ learnable condition vectors $\mathbf{p}_{\phi} \in \mathbb{R}^{n_c \times d}$ to the left of the sequence embedding $\mathbf{x} \in \mathbb{R}^{l \times d}$, forming an extended input $\mathbf{x}’ = [\text{PREFIX}; \mathbf{x}]$.

The output of each position is:

$$ h_i = \begin{cases} p_{\phi,i}, & \text{if } i < n_c \\ \text{LM}_\theta(x_i’, h_{

Because the prefix features always sit to the left, the causal attention mask ensures they influence all subsequent token predictions. The key insight is that the attention mechanism decomposes into a weighted sum of self-attention and prefix attention:

$$ \begin{aligned} \text{head} &= (1 - \lambda(\mathbf{x})) \underbrace{\text{Attn}(\mathbf{x}\mathbf{W}_q, \mathbf{c}\mathbf{W}_k, \mathbf{c}\mathbf{W}_v)}_{\text{self-attention}} \\ &\quad + \lambda(\mathbf{x}) \underbrace{\text{Attn}(\mathbf{x}\mathbf{W}_q, \mathbf{p}_\phi\mathbf{W}_k, \mathbf{p}_\phi\mathbf{W}_v)}_{\text{prefix attention}} \end{aligned} $$

where $\lambda(\mathbf{x})$ is a scalar representing the normalized attention weight on the prefix positions. This decomposition shows that conditions modulate generation through an additive attention pathway, and the activation map $\text{softmax}(\mathbf{x}\mathbf{W}_q \mathbf{W}_k^\top \mathbf{p}_\phi^\top)$ directly reveals how each condition steers model behavior.

Condition correlation is similarly revealed. For the prefix features themselves, the causal mask zeros out the cross-attention to the sequence, leaving only the prefix self-correlation term:

$$ \text{head} = \text{Attn}(\mathbf{p}_\phi \mathbf{W}_q, \mathbf{p}_\phi \mathbf{W}_k, \mathbf{p}_\phi \mathbf{W}_v) $$

The attention map $\mathbf{A}(\mathbf{p}_\phi)$ from this term encodes how conditions relate to one another.

Condition Encoders

Each condition has a dedicated encoder:

3D Pocket: A Geometric Vector Transformer (GVF) processes the binding pocket as a 3D graph with SE(3)-equivariant node and edge features. GVF extends GVP-GNN with a global attention module over geometric features. A position-aware attention mechanism with radial basis functions produces the pocket embedding.
Chemical properties: Separate MLPs embed each scalar property (Vina, QED, SA, LogP, Lipinski) into the shared $d$-dimensional space.

Training Objective

PrefixMol is trained with two losses. The auto-regressive loss is:

$$ \mathcal{L}_{AT} = -\sum_{1 < i \leq t} \log p_{\phi, \theta}(x_i \mid \mathbf{x}_{

A triplet property prediction loss encourages generated molecules to match desired properties:

$$ \mathcal{L}_{Pred} = \max\left((\hat{\mathbf{c}} - \mathbf{c})^2 - (\hat{\mathbf{c}} - \dot{\mathbf{c}})^2, 0\right) $$

where $\mathbf{c}$ is the input condition, $\hat{\mathbf{c}}$ is predicted by an MLP head, and $\dot{\mathbf{c}}$ is computed by RDKit from the generated SMILES (gradient is propagated through $\hat{\mathbf{c}}$ since RDKit is non-differentiable).

Experimental Setup and Controllability Evaluation

Dataset

The authors use the CrossDocked dataset (22.5 million protein-ligand structures) with chemical properties appended for each ligand. Data splitting and evaluation follow Pocket2Mol and Masuda et al.

Metrics

Vina score (binding affinity, computed by QVina after UFF refinement)
QED (quantitative estimate of drug-likeness, 0-1)
SA (synthetic accessibility, 0-1)
LogP (octanol-water partition coefficient)
Lipinski (rule-of-five compliance count)
High Affinity (fraction of pockets where generated molecules match or exceed test set affinities)
Diversity (average pairwise Tanimoto distance over Morgan fingerprints)
Sim.Train (maximum Tanimoto similarity to training set)

Baselines

Unconditional comparison against CVAE, AR (Luo et al. 2021a), and Pocket2Mol.

Key Results

Unconditional generation (Table 1): PrefixMol without conditions achieves sub-optimal results on Vina (-6.532), QED (0.551), SA (0.750), and LogP (1.415) compared to Pocket2Mol. However, it substantially outperforms all baselines on diversity (0.856 vs. 0.688 for Pocket2Mol) and novelty (Sim.Train of 0.239 vs. 0.376), indicating it generates genuinely novel molecules rather than memorizing training data.

Single-property control (Table 2): Molecular properties are positively correlated with conditional inputs across VINA, QED, SA, LogP, and Lipinski. With favorable control scales, PrefixMol surpasses Pocket2Mol on QED (0.767 vs. 0.563), SA (0.924 vs. 0.765), and LogP. The Vina score also improves when QED or LogP conditions are increased (e.g., -7.733 at QED control scale +2), revealing coupling between conditions.

Multi-property control (Table 3): Jointly adjusting all five conditions shows consistent positive relationships. For example, at control scale +4, QED reaches 0.722, SA reaches 0.913, and Lipinski saturates at 5.0. Joint QED+SA control at +2.0 achieves Lipinski = 5.0, confirming that certain properties are coupled.

Condition Relation Analysis

By computing partial derivatives of the prefix attention map with respect to each condition, the authors construct a relation matrix $\mathbf{R} = \sum_{i=2}^{6} |\partial \mathbf{A} / \partial c_i|$. Key findings:

Vina is weakly self-controllable but strongly influenced by QED, LogP, and SA, explaining why multi-condition control improves binding affinity even when Vina alone responds poorly.
LogP and QED are the most correlated property pair.
Lipinski is coupled to QED and SA, saturating at 5.0 when both QED and SA control scales reach +2.

Key Findings, Limitations, and Interpretability Insights

PrefixMol demonstrates that prefix embedding is an effective strategy for unifying target-aware and chemistry-aware molecular generation. The main findings are:

A single prefix-conditioned GPT model can control multiple chemical properties simultaneously while targeting specific protein pockets.
Multi-conditional generation outperforms unconditional baselines in drug-likeness metrics, and the controllability enables PrefixMol to surpass Pocket2Mol on QED, SA, and LogP.
The attention mechanism provides interpretable coupling relationships between conditions, offering practical guidance (e.g., improving QED indirectly improves Vina).

Limitations: The paper does not report validity rates for generated SMILES. The unconditional model underperforms Pocket2Mol on binding affinity (Vina), suggesting that generating 2D SMILES strings and relying on post hoc 3D conformer generation may be less effective than direct atom-by-atom 3D generation for binding affinity optimization. The condition relation analysis uses a first-order finite difference approximation ($\Delta = 1$), which may not capture nonlinear interactions. No external validation on prospective drug discovery tasks is provided. Hardware and training time details are not reported.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Training / Evaluation	CrossDocked (extended)	22.5M protein-ligand structures	Extended with molecular properties (QED, SA, LogP, Lipinski, Vina)

Algorithms

GPT-based auto-regressive SMILES generation with prefix conditioning
GVF (Geometric Vector Transformer) for 3D pocket encoding, extending GVP-GNN with global attention
Separate MLP encoders for each chemical property
Triplet property prediction loss with non-differentiable RDKit-computed properties
QVina for Vina score computation with UFF refinement

Models

GPT transformer backbone for SMILES generation
6 prefix condition vectors ($n_c = 6$): Pocket, Vina, QED, SA, LogP, Lipinski
Specific architectural hyperparameters (hidden dimension, number of layers, heads) not reported in the paper

Evaluation

Metric	PrefixMol (unconditional)	Pocket2Mol	Notes
Vina (kcal/mol)	-6.532	-7.288	Lower is better
QED	0.551	0.563	Higher is better
SA	0.750	0.765	Higher is better
Diversity	0.856	0.688	Higher is better
Sim.Train	0.239	0.376	Lower is better

Hardware

Not reported in the paper.

Artifacts

Artifact	Type	License	Notes
PrefixMol	Code	Not specified	Official PyTorch implementation

Paper Information

Citation: Gao, Z., Hu, Y., Tan, C., & Li, S. Z. (2023). PrefixMol: Target- and Chemistry-aware Molecule Design via Prefix Embedding. arXiv preprint arXiv:2302.07120.

@article{gao2023prefixmol,
  title={PrefixMol: Target- and Chemistry-aware Molecule Design via Prefix Embedding},
  author={Gao, Zhangyang and Hu, Yuqi and Tan, Cheng and Li, Stan Z.},
  journal={arXiv preprint arXiv:2302.07120},
  year={2023}
}

PASITHEA: Gradient-Based Molecular Design via Dreaming

Thu, 26 Mar 2026 00:00:00 +0000

Inceptionism Applied to Molecular Inverse Design

This is a Method paper that introduces PASITHEA, a gradient-based approach to de-novo molecular design inspired by inceptionism (deep dreaming) techniques from computer vision. The core contribution is a direct optimization framework that modifies molecular structures by backpropagating through a trained property-prediction network, with the molecular input (rather than weights) serving as the optimizable variable. PASITHEA is enabled by SELFIES, a surjective molecular string representation that guarantees 100% validity of generated molecules.

The Need for Direct Gradient-Based Molecular Optimization

Existing inverse molecular design methods, including variational autoencoders (VAEs), generative adversarial networks (GANs), reinforcement learning (RL), and genetic algorithms (GAs), share a common characteristic: they optimize molecules indirectly. VAEs and GANs learn distributions and scan latent spaces. RL agents learn policies from environmental rewards. GAs iteratively apply mutations and selections. None of these approaches directly maximize an objective function in a gradient-based manner with respect to the molecular representation itself.

This indirection has several consequences. VAE-based methods require learning a latent space, and the optimization happens in that space rather than directly on molecular structures. RL and GA methods require expensive function evaluations for each candidate molecule. The authors identify an opportunity to exploit gradients more directly by reversing the learning process of a neural network trained to predict molecular properties, thereby sidestepping latent spaces, policies, and population-based search entirely.

A second motivation is interpretability. By operating directly on the molecular representation (rather than a learned latent space), PASITHEA can reveal what a regression network has learned about structure-property relationships, a capability the authors frame as analogous to how deep dreaming reveals what image classifiers have learned about visual features.

Core Innovation: Inverting Regression Networks on SELFIES

PASITHEA’s key insight is a two-phase training procedure that repurposes the standard neural network training loop for molecule generation.

Phase 1: Prediction training. A fully connected neural network is trained to predict a real-valued chemical property (logP) from one-hot encoded SELFIES strings. The standard feedforward and backpropagation process updates the network weights to minimize mean squared error between predicted and ground-truth property values:

$$ \min_{\theta} \frac{1}{N} \sum_{i=1}^{N} (f_{\theta}(\mathbf{x}_i) - y_i)^2 $$

where $f_{\theta}$ is the neural network with parameters $\theta$, $\mathbf{x}_i$ is the one-hot encoded SELFIES input, and $y_i$ is the target logP value.

Phase 2: Inverse training (deep dreaming). The network weights $\theta$ are frozen. For a given input molecule $\mathbf{x}$ and a desired target property value $y_{\text{target}}$, the gradients are computed with respect to the input representation rather than the weights:

$$ \mathbf{x} \leftarrow \mathbf{x} - \eta \nabla_{\mathbf{x}} \mathcal{L}(f_{\theta}(\mathbf{x}), y_{\text{target}}) $$

This gradient descent on the input incrementally modifies the one-hot encoding of the molecular string, transforming it toward a structure whose predicted property matches the target value. At each step, the argmax function converts the continuous one-hot encoding back to a discrete SELFIES string, which always maps to a valid molecular graph due to the surjective property of SELFIES.

The role of SELFIES. The surjective mapping from strings to molecular graphs is essential. With SMILES, intermediate strings during optimization can become syntactically invalid (e.g., an unclosed ring like “CCCC1CCCCC”), producing no valid molecule. SELFIES enforces constraints that guarantee every string maps to a valid molecular graph, making the continuous gradient-based optimization feasible.

Input noise injection. Because inverse training transforms a one-hot encoding from binary values to real numbers, the discrete-to-continuous transition can cause convergence problems. The authors address this by initializing the input with noise: every zero in the one-hot encoding is replaced by a random number in $[0, k]$, where $k$ is a hyperparameter between 0.5 and 0.95. This smooths the optimization landscape and enables incremental molecular modifications rather than abrupt changes.

Experimental Setup on QM9 with LogP Optimization

Dataset and Property

The experiments use a random subset of 10,000 molecules from the QM9 dataset. The target property is the logarithm of the partition coefficient (logP), computed using RDKit. LogP measures lipophilicity, an important drug-likeness indicator that follows an approximately normal distribution in QM9 and has a nearly continuous range, making it suitable for gradient-based optimization.

Network Architecture

PASITHEA uses a fully connected neural network with four layers, each containing 500 nodes with ReLU activation. The loss function is mean squared error. Data is split 85%/15% for training/testing. The prediction model trains for approximately 1,500 epochs with an Adam optimizer and a learning rate of $1 \times 10^{-6}$.

For inverse training, the authors select a noise upper-bound of 0.9 and a learning rate of 0.01, chosen from hyperparameter tuning experiments that evaluate the percentage of molecules optimized toward the target property.

Optimization Targets

Two extreme logP targets are used: $+6$ (high lipophilicity) and $-6$ (low lipophilicity). These values exceed the range of logP values in the QM9 dataset (minimum: $-2.19$, maximum: $3.08$), testing whether the model can extrapolate beyond the training distribution.

Distribution Shifts and Interpretable Molecular Transformations

Distribution-Level Results

Applying deep dreaming to the full set of 10,000 molecules produces a clear shift in the logP distribution:

Statistic	QM9 Original	Optimized (target +6)	Optimized (target -6)
Mean logP	0.3909	1.8172	-0.3360
Min logP	-2.1903	-0.8240	-2.452
Max logP	3.0786	4.2442	0.9018

The optimized distributions extend beyond the original dataset’s property range. The right-shifted distribution (target +6) produces molecules with logP values up to 4.24, exceeding the original maximum of 3.08. The left-shifted distribution (target -6) reaches -2.45, below the original minimum. This indicates that PASITHEA can generate molecules with properties outside the training data bounds.

Additionally, 97.2% of the generated molecules do not exist in the original training set, indicating that the network is not memorizing data but rather using structural features to guide optimization. Some generated molecules contain more heavy atoms than the QM9 maximum of 9, since the SELFIES string length allows for larger structures.

Molecule-Level Interpretability

The stepwise molecular transformations reveal interpretable “strategies” the network employs:

Nitrogen appendage: When optimizing for lower logP, the network repeatedly appends nitrogen atoms to the molecule. The authors observe this as a consistent pattern across multiple test molecules, reflecting the known relationship between nitrogen content and reduced lipophilicity.
Length modulation: When optimizing for higher logP, the network tends to increase molecular chain length (e.g., extending a carbon chain). When optimizing for lower logP, it shortens chains. This captures the intuition that larger, more carbon-heavy molecules tend to be more lipophilic.
Bond order changes: The network replaces single bonds with double or triple bonds during optimization, demonstrating an understanding of the relationship between bonding patterns and logP.
Consistency across trials: Because the input initialization includes random noise, repeated trials with the same molecule produce different transformation sequences. Despite this stochasticity, the network applies consistent strategies across trials (e.g., always shortening chains for negative optimization), validating that it has learned genuine structure-property relationships.

Thermodynamic Stability

The authors assess synthesizability by computing heats of formation using MOPAC2016 at the PM7 level of theory. Some optimization trajectories move toward thermodynamically stable molecules (negative heats of formation), while others produce less stable structures. The authors acknowledge this limitation and propose multi-objective optimization incorporating stability as a future direction.

Comparison to VAEs

The key distinction from VAEs is where gradient computation occurs. In VAEs, a latent space is learned through encoding and decoding, and property optimization happens in that latent space. In PASITHEA, gradients are computed directly with respect to the molecular representation (SELFIES one-hot encoding). The authors argue this makes the approach more interpretable, since we can probe what the network learned about molecular structure without the “detour” through a latent space.

Limitations

The authors are forthright about the preliminary nature of these results:

The method is demonstrated only on a small subset of QM9 with a single, computationally inexpensive property (logP).
The simple four-layer architecture may not scale to larger molecular spaces or more complex properties.
Generated molecules are not always thermodynamically stable, requiring additional optimization objectives.
The approach has not been benchmarked against established methods (VAEs, GANs, RL) on standard generative benchmarks.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Training/Evaluation	QM9 (random subset)	10,000 molecules	logP values computed via RDKit

Algorithms

Prediction training: 4-layer fully connected NN, 500 nodes/layer, ReLU activation, MSE loss, Adam optimizer, LR $1 \times 10^{-6}$, ~1,500 epochs, 85/15 train/test split
Inverse training: Frozen weights, Adam optimizer, LR 0.01, noise upper-bound 0.9, logP targets of +6 and -6
Heats of formation: MOPAC2016, PM7 level, geometry optimization with eigenvector following (EF)

Models

The architecture is a simple 4-layer MLP. No pre-trained weights are distributed, but the full code is available.

Evaluation

Metric	Value	Notes
Novel molecules	97.2%	Generated molecules not in training set
Max logP (target +6)	4.2442	Exceeds QM9 max of 3.0786
Min logP (target -6)	-2.452	Below QM9 min of -2.1903

Hardware

Not specified in the paper.

Artifacts

Artifact	Type	License	Notes
Pasithea	Code	MIT	Official implementation

Paper Information

Citation: Shen, C., Krenn, M., Eppel, S., & Aspuru-Guzik, A. (2021). Deep molecular dreaming: inverse machine learning for de-novo molecular design and interpretability with surjective representations. Machine Learning: Science and Technology, 2(3), 03LT02. https://doi.org/10.1088/2632-2153/ac09d6

@article{shen2021deep,
  title={Deep molecular dreaming: inverse machine learning for de-novo molecular design and interpretability with surjective representations},
  author={Shen, Cynthia and Krenn, Mario and Eppel, Sagi and Aspuru-Guzik, Al{\'a}n},
  journal={Machine Learning: Science and Technology},
  volume={2},
  number={3},
  pages={03LT02},
  year={2021},
  publisher={IOP Publishing},
  doi={10.1088/2632-2153/ac09d6}
}

LMs Generate 3D Molecules from XYZ, CIF, PDB Files

Thu, 26 Mar 2026 00:00:00 +0000

Language Models as 3D Chemical Structure Generators

This is a Method paper that demonstrates transformer-based language models can generate molecules, crystalline materials, and protein binding sites directly in three dimensions by training on sequences derived from standard chemical file formats (XYZ, CIF, PDB). The key contribution is showing that unmodified autoregressive language models, using only next-token prediction, achieve performance comparable to domain-specific 3D generative models that incorporate SE(3) equivariance and other geometric inductive biases.

Beyond Graphs and Strings: The Need for 3D Chemical Generation

Molecular design with deep learning has largely relied on two representation paradigms: molecular graphs (processed with graph neural networks) and linearized string representations like SMILES and SELFIES (processed with sequence models). Both approaches have proven effective for drug-like organic molecules, but they share a fundamental limitation: they cannot represent structures whose identity depends on 3D spatial arrangement.

Crystalline materials, for example, have periodic lattice structures that cannot be reduced to simple graphs. Protein binding sites are defined by the 3D arrangement of hundreds of atoms across multiple residues. For tasks like catalysis design or structure-based drug discovery, the geometric positions of atoms are essential information that graphs and strings discard entirely.

Existing 3D generative models address this gap but typically require specialized architectures with SE(3) equivariance to handle rotational and translational symmetries. This work asks whether the general-purpose sequence modeling capability of transformers is sufficient to learn 3D chemical structure distributions without any domain-specific architectural modifications.

Direct Tokenization of Chemical File Formats

The core insight is straightforward: any 3D molecule, crystal, or biomolecule is already stored as text in standard file formats (XYZ, CIF, PDB). These files encode atom types and their Cartesian coordinates as sequences of characters and numbers. Rather than designing specialized architectures for point cloud generation, the authors simply tokenize these files and train a standard GPT-style transformer to predict the next token.

A molecule with $n$ atoms is represented as:

$$ \mathcal{M} = (e_1, x_1, y_1, z_1, \dots, e_n, x_n, y_n, z_n) $$

where $e_i$ is the element type and $(x_i, y_i, z_i)$ are Cartesian coordinates. Crystals additionally include lattice parameters:

$$ \mathcal{C} = (\ell_a, \ell_b, \ell_c, \alpha, \beta, \gamma, e_1, x_1, y_1, z_1, \dots, e_n, x_n, y_n, z_n) $$

Protein binding sites use residue-atom indicators (e.g., HIS-C, CYS-N) instead of bare element symbols:

$$ \mathcal{P} = (a_1, x_1, y_1, z_1, \dots, a_n, x_n, y_n, z_n) $$

The language model learns the joint distribution via autoregressive factorization:

$$ p(x) = \prod_{i=1}^{n} p(t_i \mid t_{i-1}, \dots, t_1) $$

Two tokenization strategies are explored:

Character-level (LM-CH): Every character in the file is a token, including digits, minus signs, spaces, and newlines. This produces long sequences but uses a small vocabulary (~30 tokens).
Atom+coordinate-level (LM-AC): Each atom placement requires exactly 4 tokens: one element/residue token and three coordinate tokens (e.g., ‘-1.98’). The vocabulary is larger (~100-10K tokens) but sequences are shorter.

Numerical precision is controlled by rounding coordinates to 1, 2, or 3 decimal places. Since the model lacks rotation and translation invariance, random rotation augmentation during training improves performance.

Experiments Across Molecules, Crystals, and Protein Binding Sites

Molecular Generation (ZINC)

The model is evaluated on 250K commercially available molecules from the ZINC dataset, with an average of 23 heavy atoms. XYZ files are generated using RDKit’s conformer tools. Coordinates use 2 decimal places of precision. The authors generate 10K molecules and evaluate both 3D geometry quality and standard generative metrics.

For 3D geometry assessment, root mean squared deviation (RMSD) between language model-generated conformers and RDKit-generated conformers shows most molecules fall between 1.0 and 2.0 RMSD, with a heavy tail extending to 4.0.

Standard metrics include validity, uniqueness, novelty, and earth mover’s distance (WA) for molecular property distributions (QED, SA score, molecular weight).

Model	3D	Valid (%)	Unique (%)	Novel (%)	WA MW	WA SA	WA QED
Train	No	100.0	100.0	100.0	0.816	0.013	0.002
SM-LM	No	98.35	100.0	100.0	3.640	0.049	0.005
SF-LM	No	100.0	100.0	100.0	3.772	0.085	0.006
JTVAE	No	100.0	98.56	100.0	22.63	0.126	0.023
ENF	Yes	1.05	96.37	99.72	168.5	1.886	0.160
G-SchNet	Yes	1.20	55.96	98.33	152.7	1.126	0.185
EDM	Yes	77.51	96.40	95.30	101.2	0.939	0.093
LM-CH	Yes	90.13	100.0	100.0	3.912	2.608	0.077
LM-AC	Yes	98.51	100.0	100.0	1.811	0.026	0.004

The atom+coordinate tokenization model (LM-AC) achieves 98.51% validity with 100% uniqueness and novelty. Its WA scores for molecular weight (1.811) and QED (0.004) are substantially better than all other 3D generative baselines and competitive with SMILES/SELFIES language models. The character-level model (LM-CH) at 90.13% validity performs comparably to graph-based models but falls short of the string-based language models.

Crystal Generation (Perov-5 and MP-20)

Crystal generation uses CIF-derived sequences with 3 decimal places of precision. Two datasets are used: Perov-5 (18,928 perovskite materials, 5 atoms per unit cell, 56 elements) and MP-20 (45,231 diverse materials, 1-20 atoms per unit cell, 89 elements).

Evaluation metrics include structural validity (minimum interatomic distance > 0.5 angstrom), compositional validity (charge neutrality via SMACT), coverage (recall and precision between generated and test sets), and earth mover’s distance for density and number of unique elements.

Data	Model	Struc. Valid (%)	Comp. Valid (%)	COV-R (%)	COV-P (%)	WA density	WA elements
Perov-5	CDVAE	100.0	98.59	99.45	98.46	0.126	0.063
Perov-5	LM-CH	100.0	98.51	99.60	99.42	0.071	0.036
Perov-5	LM-AC	100.0	98.79	98.78	99.36	0.089	0.028
MP-20	CDVAE	100.0	86.70	99.15	99.49	0.688	1.432
MP-20	LM-CH	84.81	83.55	99.25	97.89	0.864	0.132
MP-20	LM-AC	95.81	88.87	99.60	98.55	0.696	0.092

On Perov-5, both language models outperform CDVAE across most metrics. On the more diverse MP-20 dataset, LM-AC achieves the best scores on 3 of 6 metrics and remains competitive on the others. LM-CH struggles more with structural validity on MP-20 (84.81%).

Protein Binding Site Generation (PDB)

The most challenging task involves generating protein binding sites (~200-250 atoms each) from PDB-derived sequences. The dataset contains approximately 180K protein-ligand pairs. Residue-atom tokenization is used (e.g., CYS-C, CYS-N), with 2 decimal places of precision.

Validity is assessed per-residue using xyz2mol, with an additional check for inter-residue atomic overlap (atoms from different residues closer than the minimum bond distance). Approximately 99% of generated pockets pass the residue validity check, while about 5% fail the overlap check. Of generated pockets, 89.8% have unique residue orderings, and 83.6% have novel orderings not seen in training, indicating the model is generating novel binding site structures rather than memorizing.

Competitive 3D Generation Without Geometric Inductive Biases

The central finding is that standard transformer language models, without any equivariance or geometric inductive biases, can generate valid 3D chemical structures across three substantially different domains. The atom+coordinate tokenization (LM-AC) consistently outperforms character-level tokenization (LM-CH), likely because it produces shorter sequences and reduces the number of sequential decisions needed per atom placement.

Several limitations are worth noting. The model generates atoms using absolute Cartesian coordinates, which means it must learn rotation and translation invariance purely from data augmentation rather than having it built into the architecture. The authors acknowledge this becomes increasingly difficult as structure size grows. The vocabulary size also scales with coordinate precision and structure complexity, which could become prohibitive for very large systems.

The paper does not include computational cost comparisons with baseline models, making it difficult to assess the practical tradeoff between the simplicity of the language modeling approach and the efficiency of specialized architectures. The authors also note that further validation through computational simulation and experiment is needed to confirm the physical plausibility of generated structures.

Future directions identified include inverse design of molecules and materials conditioned on target properties, extension to more complex structures (metal-organic frameworks), and exploration of alternative tokenization strategies to handle larger systems.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Training/Eval	ZINC	250K molecules	~23 heavy atoms avg; XYZ files via RDKit conformer generation
Training/Eval	Perov-5	18,928 perovskites	5 atoms/unit cell, 56 elements
Training/Eval	MP-20	45,231 materials	1-20 atoms/unit cell, 89 elements
Training/Eval	Protein binding sites	~180K protein-ligand pairs	Processed to 200-250 atoms per pocket

Algorithms

Architecture: GPT-style transformer with ~1M to 100M parameters
Layers: 12
Embedding size: 128 to 1024
Attention heads: 4 to 12
Batch size: 4 to 32 structures
Learning rate: $10^{-4}$ to $10^{-5}$, decayed to $9 \times 10^{-6}$
Data augmentation: Random rotation of training structures at each epoch
Numerical precision: 2 decimal places (molecules, proteins), 3 decimal places (crystals)

Models

No pre-trained model weights are publicly available. The paper mentions “Example code can be found at” but the URL appears to be missing from the published version.

Evaluation

Metric	Domain	Description
Validity	Molecules	xyz2mol produces valid RDKit Mol object
Validity	Crystals	Structural (min distance > 0.5 angstrom) and compositional (charge neutral)
Uniqueness	All	Fraction of distinct generated structures
Novelty	All	Fraction not in training set
Earth mover’s distance	All	Distribution match for domain-specific properties
RMSD	Molecules	Deviation from RDKit conformer geometries
Coverage	Crystals	Recall and precision between generated and test sets

Hardware

Models were trained using the Canada Computing Systems (Compute Canada). Specific GPU types, counts, and training times are not reported.

Artifacts

No public code repository, model weights, or datasets specific to this work were found. The ZINC, Perov-5, and MP-20 datasets used for evaluation are publicly available from their original sources.

Paper Information

Citation: Flam-Shepherd, D. & Aspuru-Guzik, A. (2023). Language models can generate molecules, materials, and protein binding sites directly in three dimensions as XYZ, CIF, and PDB files. arXiv preprint arXiv:2305.05708.

@article{flamshepherd2023language,
  title={Language models can generate molecules, materials, and protein binding sites directly in three dimensions as {XYZ}, {CIF}, and {PDB} files},
  author={Flam-Shepherd, Daniel and Aspuru-Guzik, Al{\'a}n},
  journal={arXiv preprint arXiv:2305.05708},
  year={2023}
}

Link-INVENT: RL-Driven Molecular Linker Generation

Thu, 26 Mar 2026 00:00:00 +0000

A Method for Generative Linker Design with Reinforcement Learning

Link-INVENT is a Method paper that introduces a generative model for molecular linker design built on the REINVENT de novo design platform. The primary contribution is an encoder-decoder recurrent neural network (RNN) architecture that generates SMILES-based linkers connecting two molecular subunits, combined with a flexible multi-parameter optimization (MPO) scoring function and reinforcement learning (RL) to steer generation toward desired properties. Link-INVENT targets three practical drug discovery tasks: fragment linking, scaffold hopping, and proteolysis targeting chimera (PROTAC) design.

Why Linker Design Needs Flexible Multi-Parameter Optimization

Generating suitable chemical linkers between molecular subunits is a central challenge in fragment-based drug discovery (FBDD), scaffold hopping, and PROTAC design. Traditional computational approaches rely on database searches, inherently limiting the generalizability of proposed linkers to the pre-defined collection. Recent deep learning methods (DeLinker, SyntaLinker, 3DLinker, DiffLinker) can generate novel linkers but offer limited support for optimizing specific physicochemical properties. Users can typically control only linker length and a few properties like hydrogen-bond donor count.

The key gaps that Link-INVENT addresses are:

Conditioning on both subunits: Prior RNN-based approaches (SAMOA) generate linkers conditioned only on the SMILES sequence seen so far, which may not account for the second molecular subunit. Link-INVENT conditions on both warheads simultaneously.
Flexible scoring: Existing DL-based linker design tools lack the ability to define tailored MPO objectives. Link-INVENT inherits REINVENT 4’s full scoring infrastructure and adds linker-specific properties.
Generalizability: A single trained prior handles fragment linking, scaffold hopping, and PROTAC tasks without retraining.

Core Innovation: Conditional Linker Generation with Augmented Likelihood RL

Link-INVENT’s architecture is an encoder-decoder RNN adapted from the Lib-INVENT library design model. The encoder processes a pair of warheads (molecular subunits with defined exit vectors), and the decoder generates a linker token by token, yielding a connected molecule in SMILES format. The model uses three hidden layers of 512 LSTM cells with an embedding size of 256.

Training

The prior is trained on ChEMBL v27 data processed through reaction-based slicing to generate (linker, warheads pair, full molecule) tuples. SMILES randomization augments the training data at each epoch, improving chemical space generalizability. The prior is trained by maximizing the likelihood of generating a linker conditioned on the input warhead pair, with teacher forcing for stability.

Multi-Parameter Optimization via RL

The scoring function $S(x)$ is a weighted geometric mean of individual component scores:

$$ S(x) = \left(\prod_{i=1}^{n} C_{i}(x)^{w_{i}}\right)^{\frac{1}{\sum_{i=1}^{n} w_{i}}} $$

where $x$ is a sampled linked molecule, $C_{i}(x)$ is the score for the $i$-th component, and $w_{i}$ is its weight.

The agent (initialized as a copy of the prior) is updated via the Difference of Augmented and Posterior likelihoods (DAP) loss. The augmented log likelihood is:

$$ \log \pi_{\text{augmented}} = \log \pi_{\text{prior}} + \sigma \cdot S(x) $$

where $\pi$ denotes a policy (token sampling probabilities conditioned on the sequence so far) and $\sigma$ is a scalar factor. The loss function is:

$$ J(\theta) = \left(\log \pi_{\text{augmented}} - \log \pi_{\text{agent}}\right)^{2} $$

Minimizing $J(\theta)$ steers the agent to generate molecules that satisfy the scoring function while remaining anchored to the prior’s chemical space.

Diversity Filters

Link-INVENT uses Diversity Filters (DFs) to balance exploration and exploitation. Buckets of limited size track unique Bemis-Murcko scaffolds. When a bucket is full, further sampling of that scaffold receives a score of zero, encouraging the agent to explore diverse chemical space regions.

Linker-Specific Scoring Components

New scoring components provide direct control over linker properties:

Linker effective length: number of bonds between attachment atoms
Linker maximum graph length: bonds in the longest graph traversal path
Linker length ratio: effective length divided by maximum graph length (controls branching)
Linker ratio of rotatable bonds: rotatable bonds over total bonds (controls flexibility)
Linker number of rings: controls linearity vs. cyclicity
Linker number of HBDs: hydrogen-bond donors in the linker itself

Experimental Evaluation Across Three Drug Discovery Tasks

Link-INVENT was evaluated through four experiments across three drug discovery applications, all using the same pre-trained prior.

Illustrative Example: Two Benzene Rings

A simple experiment linked two benzene rings with the objectives of limiting HBDs and requiring exactly one ring in the linker. Over 20 epochs, the agent learned to satisfy both objectives, demonstrating the basic RL-guided generation process.

Experiment 1a: Fragment Linking (CK2 alpha Inhibitors)

Based on the casein kinase 2 (CK2 alpha) fragment linking campaign by Fusco and Brear et al., Link-INVENT was tasked with linking two fragment hits while retaining the Lys68 hydrogen-bond interaction via a DockStream docking constraint (Glide/LigPrep backend). The scoring function also enforced linker length ratio >= 70 and linker MW <= 200 Da.

Over 100 epochs in triplicate, the agent generated molecules with gradually improving docking scores. Key results:

Docking score distributions across triplicates were nearly identical, demonstrating reproducibility
Some generated molecules achieved more favorable docking scores than the reference ligand CAM4066 (-15.20 kcal/mol)
More than 5000 unique Bemis-Murcko scaffolds were generated, with minimal overlap across replicates
Binding pose analysis showed the generated linker closely resembled the ground-truth linker, retaining the Lys68 interaction

Experiment 1b: Comparison Fragment Linking (IMPDH Inhibitors)

Using the IMPDH inhibitor fragment linking case study from Trapero et al., this experiment applied core constrained docking (fragment pose within 0.3 A of reference) and compared results to DeLinker and SyntaLinker. The scoring function enforced linker effective length in [3, 5], length ratio >= 70, and linker MW <= 150 Da.

Link-INVENT generated 8960 SMILES across 70 epochs (comparable to DeLinker’s 9000 molecular graphs). Results:

Link-INVENT generated molecules with more favorable docking scores than the reference ligand across triplicate runs
Of 20 DeLinker and 3 SyntaLinker example molecules, none and one (the recovered reference) docked better than or equal to the reference
Approximately 3000 unique Bemis-Murcko scaffolds were generated from 5000 total molecules
Link-INVENT’s advantage comes from including docking explicitly as a learning objective rather than applying it post hoc

Experiment 2: Scaffold Hopping (DLK Inhibitor CNS Optimization)

Based on Patel et al.’s dual leucine zipper kinase (DLK) inhibitor campaign, Link-INVENT generated new scaffold ideas to improve CNS penetration while retaining potency. The scoring function included a Cys193 docking constraint plus CNS-compatible properties (HBDs < 2, tPSA <= 90 A squared, 3 <= SlogP <= 4, MW <= 450 Da, 1-2 aromatic rings in linker).

The solution space was significantly narrower than fragment linking. The agent still generated diverse scaffolds with favorable docking scores, though fewer exceeded the reference ligand’s score. Binding pose analysis confirmed retained Cys193 interactions and predicted additional Gln195 hydrogen bonds.

Experiment 3: PROTAC Design (Bcl-2/Mcl-1 Dual Degradation)

Three sub-experiments demonstrated linker-specific scoring components for PROTAC design based on Wang et al.’s Bcl-2/Mcl-1 dual degradation strategy:

Sub-Experiment	Objective	Key Finding
Sub-Exp 1: Linker length	Generate linkers within specified length intervals [4,6], [7,9], [10,12], [13,15]	Clear enrichment within target intervals vs. baseline broad distribution
Sub-Exp 2: Linearity	Control linear vs. cyclic linkers at fixed length [7,9]	Baseline ratio ~1:2 linear:cyclic; enforcing linearity or cyclicity achieved strong enrichment
Sub-Exp 3: Flexibility	Generate linkers with Low [0,30], Moderate [40,60], or High [70,100] rotatable bond ratios	Agent learned that rings and sp2 atoms yield rigidity; linear sp3 chains yield flexibility

Key Findings and Practical Implications for Drug Discovery

Link-INVENT demonstrates several practical advantages for molecular linker design:

Single prior, multiple tasks: The same pre-trained model handles fragment linking, scaffold hopping, and PROTAC design without retraining.
Docking as a learning signal: Including molecular docking explicitly in the scoring function (via DockStream) during RL yields molecules with more favorable docking scores than approaches that apply docking post hoc.
Implicit 3D awareness: The docking constraint guides the agent toward 3D structural awareness without explicit 3D coordinate inputs, as demonstrated by the overlap between generated and reference binding poses.
Diverse and reproducible output: Diversity filters ensure exploration of multiple chemical space regions, and triplicate experiments show consistent docking score distributions with minimal scaffold overlap.

Limitations acknowledged by the authors include:

The linker flexibility metric (ratio of rotatable bonds) is agnostic to intra-molecular hydrogen bonds and does not account for all rigidity factors
Molecular docking is an approximation that can be exploited (e.g., excessive HBDs achieving favorable scores at the expense of permeability)
Experiments 1a and 1b require a proprietary Schrodinger license for Glide/LigPrep docking
No direct experimental (wet-lab) validation was performed in this study

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Prior training	ChEMBL v27 (reaction-sliced)	Not specified	Filtered for drug-like compounds, then reaction-based slicing with SMIRKS
Validation	Held-out Bemis-Murcko scaffolds	287 scaffolds	Held out from training set
SMILES augmentation	Randomized SMILES per epoch	Same tuples, different representations	Improves generalizability

Algorithms

Architecture: Encoder-decoder RNN with 3 hidden layers of 512 LSTM cells, embedding size 256
RL loss: DAP (Difference of Augmented and Posterior likelihoods)
Batch size: 128 molecules per epoch
Diversity filter: Bemis-Murcko scaffold buckets of size 25
Score threshold: 0 (to store all molecules for analysis)
Scoring function: Weighted geometric mean of component scores

Models

Single pre-trained prior used across all experiments
Agent initialized as copy of prior, updated via RL
Pre-trained prior available at GitHub repository

Evaluation

Molecular docking via DockStream with Glide/LigPrep backend
Triplicate runs for all experiments
Metrics: docking scores, unique Bemis-Murcko scaffold counts, binding pose overlap

Hardware

Hardware specifications are not reported in the paper.

Artifacts

Artifact	Type	License	Notes
REINVENT (Link-INVENT code)	Code	Apache-2.0	Main codebase for Link-INVENT
ReinventCommunity (data + tutorial)	Code + Data	MIT	Training/validation data, reaction SMIRKS, pre-trained prior, Jupyter tutorial

Reproducibility status: Partially Reproducible. Code, training data, and pre-trained prior are publicly available. However, reproducing the docking-based experiments (1a, 1b, and 2) requires a proprietary Schrodinger license for Glide and LigPrep. The PROTAC experiments (Experiment 3) that use only physicochemical scoring are fully reproducible with the open-source code.

Paper Information

Citation: Guo, J., Knuth, F., Margreitter, C., Janet, J. P., Papadopoulos, K., Engkvist, O., & Patronov, A. (2023). Link-INVENT: generative linker design with reinforcement learning. Digital Discovery, 2, 392-408. https://doi.org/10.1039/D2DD00115B

@article{guo2023link,
  title={Link-INVENT: generative linker design with reinforcement learning},
  author={Guo, Jeff and Knuth, Franziska and Margreitter, Christian and Janet, Jon Paul and Papadopoulos, Kostas and Engkvist, Ola and Patronov, Atanas},
  journal={Digital Discovery},
  volume={2},
  number={2},
  pages={392--408},
  year={2023},
  publisher={Royal Society of Chemistry},
  doi={10.1039/D2DD00115B}
}

Lingo3DMol: Language Model for 3D Molecule Design

Thu, 26 Mar 2026 00:00:00 +0000

A Language Model Approach to Structure-Based Drug Design

This is a Method paper that introduces Lingo3DMol, a pocket-based 3D molecule generation model combining transformer language models with geometric deep learning. The primary contribution is threefold: (1) a new molecular representation called FSMILES (fragment-based SMILES) that encodes both 2D topology and 3D spatial coordinates, (2) a dual-decoder architecture that jointly predicts molecular topology and atomic positions, and (3) an auxiliary non-covalent interaction (NCI) predictor that guides molecule generation toward favorable binding modes.

Limitations of Existing 3D Molecular Generative Models

Existing approaches to structure-based drug design fall into two categories, each with notable limitations. Graph-based autoregressive methods (e.g., Pocket2Mol) represent molecules as 3D graphs and use GNNs for generation, but frequently produce non-drug-like structures: large rings (seven or more atoms), honeycomb-like ring arrays, and molecules with either too many or too few rings. The autoregressive sampling process tends to get stuck in local optima early in generation and accumulates errors at each step. Diffusion-based methods (e.g., TargetDiff) avoid autoregressive generation but still produce a notable proportion of undesirable structures due to weak perception of molecular topology, since they do not directly encode or predict bonds. Both approaches struggle with metrics like QED (quantitative estimate of drug-likeness) and SAS (synthetic accessibility score), and neither reliably reproduces known active compounds when evaluated on protein pockets.

FSMILES: Fragment-Based SMILES with Dual Coordinate Systems

The core innovation of Lingo3DMol is a new molecular sequence representation called FSMILES that addresses the topology problem inherent in atom-by-atom generation. FSMILES reorganizes a molecule into fragments using a ring-first, depth-first traversal. Each fragment is represented using standard SMILES syntax, and the full molecule is assembled by combining fragments with a specific connection syntax. Ring size information is encoded directly in atom tokens (e.g., C_6 for a carbon in a six-membered ring), providing the autoregressive decoder with critical context about local topology before it needs to close the ring.

The model integrates two coordinate systems. Local spherical coordinates encode bond length ($r$), bond angle ($\theta$), and dihedral angle ($\phi$) relative to three reference atoms (root1, root2, root3). These are predicted using separate MLP heads:

$$r = \operatorname{argmax}\left(\operatorname{softmax}\left(\operatorname{MLP}_1\left(\left[E_{\text{type}}(\text{cur}), H_{\text{topo}}, h_{\text{root1}}\right]\right)\right)\right)$$

$$\theta = \operatorname{argmax}\left(\operatorname{softmax}\left(\operatorname{MLP}_2\left(\left[E_{\text{type}}(\text{cur}), H_{\text{topo}}, h_{\text{root1}}, h_{\text{root2}}\right]\right)\right)\right)$$

$$\phi = \operatorname{argmax}\left(\operatorname{softmax}\left(\operatorname{MLP}_3\left(\left[E_{\text{type}}(\text{cur}), H_{\text{topo}}, h_{\text{root1}}, h_{\text{root2}}, h_{\text{root3}}\right]\right)\right)\right)$$

Global Euclidean coordinates ($x, y, z$) are predicted by a separate 3D decoder ($D_{\text{3D}}$). During inference, the model defines a search space around the predicted local coordinates ($r \pm 0.1$ A, $\theta \pm 2°$, $\phi \pm 2°$) and selects the global position with the highest joint probability within that space. This fusion strategy exploits the rigidity of bond lengths and angles (which makes local prediction easier) while maintaining global spatial awareness.

NCI/Anchor Prediction Model

A separately trained NCI/anchor prediction model identifies potential non-covalent interaction sites and anchor points in the protein pocket. This model shares the transformer architecture of the generation model and is initialized from pretrained parameters. It predicts whether each pocket atom will form hydrogen bonds, halogen bonds, salt bridges, or pi-pi stacking interactions with the ligand, and whether it lies within 4 A of any ligand atom (anchor points). The predicted NCI sites serve two purposes: they are incorporated as input features to the encoder, and they provide starting positions for molecule generation (the first atom is placed within 4.5 A of a sampled NCI site).

Pretraining and Architecture

The model uses a denoising pretraining strategy inspired by BART. During pretraining on 12 million drug-like molecules, the model receives perturbed molecules (with 25% of atoms deleted, coordinates perturbed by $\pm 0.5$ A, and 25% of carbon element types corrupted) and learns to reconstruct the original structure. The architecture is transformer-based with graph structural information encoded through distance and edge vector bias terms in the attention mechanism:

$$A_{\text{biased}} = \operatorname{softmax}\left(\frac{QK^{\top}}{\sqrt{d_k}} + B_D + B_J\right)V$$

The overall loss combines FSMILES token prediction, absolute coordinate prediction, and local coordinate predictions ($r$, $\theta$, $\phi$) with their auxiliary counterparts:

$$L = L_{\text{FSMILES}} + L_{\text{abs-coord}} + L_r + L_\theta + L_\phi + L_{r,\text{aux}} + L_{\theta,\text{aux}} + L_{\phi,\text{aux}}$$

Fine-tuning is performed on 11,800 protein-ligand complex samples from PDBbind 2020, with the first three encoder layers frozen to prevent overfitting.

Evaluation on DUD-E with Drug-Likeness Filtering

The evaluation uses the DUD-E dataset (101 targets, 20,000+ active compounds), comparing Lingo3DMol against Pocket2Mol and TargetDiff. A key methodological contribution is the emphasis on filtering generated molecules for drug-likeness (QED >= 0.3 and SAS <= 5) before evaluating binding metrics, as the authors demonstrate that molecules with good docking scores can still be poor drug candidates.

Molecular properties and binding mode (Table 1, drug-like molecules only):

Metric	Pocket2Mol	TargetDiff	Lingo3DMol
Drug-like molecules (% of total)	61%	49%	82%
Mean QED	0.56	0.60	0.59
Mean SAS	3.5	4.0	3.1
ECFP TS > 0.5 (% of targets)	8%	3%	33%
Mean min-in-place GlideSP	-6.7	-6.2	-6.8
Mean GlideSP redocking	-7.5	-7.0	-7.8
Mean RMSD vs. low-energy conformer (A)	1.1	1.1	0.9
Diversity	0.84	0.88	0.82

Lingo3DMol generates substantially more drug-like molecules (82% vs. 61% and 49%) and finds similar-to-active compounds for 33% of targets compared to 8% (Pocket2Mol) and 3% (TargetDiff). The model also achieves the best min-in-place GlideSP scores and lowest RMSD versus low-energy conformers, indicating higher quality binding poses and more realistic 3D geometries.

Molecular geometry: Lingo3DMol demonstrated the lowest Jensen-Shannon divergence for all atom-atom distance distributions and produced significantly fewer molecules with large rings (0.23% with 7-membered rings vs. 2.59% for Pocket2Mol and 11.70% for TargetDiff).

Information leakage analysis: The authors controlled for information leakage by excluding proteins with >30% sequence identity to DUD-E targets from training. When DUD-E targets were stratified by sequence identity to Pocket2Mol’s training set, Lingo3DMol’s advantage widened as leakage decreased, suggesting the performance gap is genuine rather than an artifact of training overlap.

Ablation studies (Table 2):

Metric	Standard	Random NCI	No Pretraining
Drug-like (%)	82%	47%	71%
ECFP TS > 0.5	33%	6%	3%
Mean min-in-place GlideSP	-6.8	-5.8	-4.9
Dice score	0.25	0.15	0.13

Both pretraining and the NCI predictor are essential. Removing pretraining reduces the number of valid molecules and binding quality. Replacing the trained NCI predictor with random NCI site selection severely degrades drug-likeness and the ability to generate active-like compounds.

Key Findings, Limitations, and Future Directions

Lingo3DMol demonstrates that combining language model sequence generation with geometric deep learning can produce drug-like 3D molecules that outperform graph-based and diffusion-based alternatives in binding mode quality, drug-likeness, and similarity to known actives. The FSMILES representation successfully constrains generated molecules to realistic topologies by encoding ring size information and using fragment-level generation.

Several limitations are acknowledged. Capturing all non-covalent interactions within a single molecule remains difficult with autoregressive generation. The model does not enforce equivariance (SE(3) invariance is approximated via rotation/translation augmentation and invariant features rather than built into the architecture). The pretraining dataset is partially proprietary (12M molecules from a commercial library, of which 1.4M from public sources are shared). Diversity of generated drug-like molecules is slightly lower than baselines, though the authors argue that baseline diversity explores chemical space away from known active regions. A comprehensive evaluation of drug-like properties beyond QED and SAS metrics is identified as an important next step.

Future directions include investigating electron density representations for molecular interactions, incorporating SE(3) equivariant architectures (e.g., GVP, Vector Neurons), and developing more systematic drug-likeness evaluation frameworks.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pretraining	In-house commercial library	12M molecules (1.4M public)	Filtered for drug-likeness; conformers via ConfGen
Fine-tuning	PDBbind 2020 (general set)	11,800 samples (8,201 PDB IDs)	Filtered for <30% sequence identity to DUD-E targets
NCI labels	PDBbind 2020	Same as fine-tuning	Labeled using ODDT for H-bonds, halogen bonds, salt bridges, pi-pi stacking
Evaluation	DUD-E	101 targets, 20,000+ active compounds	Standard benchmark for structure-based drug design
Geometry evaluation	CrossDocked2020	100 targets	Used for bond length and atom distance distribution comparisons

Algorithms

Transformer-based encoder-decoder with graph structural bias terms (distance matrix $B_D$, edge vector matrix $B_J$)
Denoising pretraining: 25% atom deletion, coordinate perturbation ($\pm 0.5$ A), 25% carbon element type corruption
Depth-first search sampling with reward function combining model confidence and anchor fulfillment
Fine-tuning: first three encoder layers frozen
Local-global coordinate fusion during inference with search space: $r \pm 0.1$ A, $\theta \pm 2°$, $\phi \pm 2°$

Models

Generation model: transformer encoder-decoder with dual decoders ($D_{\text{2D}}$ for topology, $D_{\text{3D}}$ for global coordinates)
NCI/anchor prediction model: same architecture, initialized from pretrained parameters
Pretrained, fine-tuned, and NCI model checkpoints available on GitHub and figshare

Evaluation

Metric	Lingo3DMol	Best Baseline	Notes
Drug-like molecules (%)	82%	61% (P2M)	QED >= 0.3, SAS <= 5
ECFP TS > 0.5 (% targets)	33%	8% (P2M)	Tanimoto similarity to known actives
Min-in-place GlideSP	-6.8	-6.7 (P2M)	Lower is better
GlideSP redocking	-7.8	-7.5 (P2M)	Lower is better
RMSD vs. low-energy conformer	0.9 A	1.1 A (both)	Lower is better
Generation speed (100 mol)	874 +/- 401 s	962 +/- 622 s (P2M)	NVIDIA Tesla V100

Hardware

Inference benchmarked on NVIDIA Tesla V100 GPUs
Generation of 100 valid molecules per target: 874 +/- 401 seconds

Artifacts

Artifact	Type	License	Notes
Lingo3DMol	Code	GPL-3.0	Inference code and model architecture
Model checkpoints	Model	GPL-3.0	Pretraining, fine-tuning, and NCI checkpoints
Training data	Dataset	Not specified	Partial pretraining data (1.4M public molecules), fine-tuning complexes, evaluation molecules
Online service	Other	N/A	Web interface for molecule generation

Paper Information

Citation: Feng, W., Wang, L., Lin, Z., Zhu, Y., Wang, H., Dong, J., Bai, R., Wang, H., Zhou, J., Peng, W., Huang, B., & Zhou, W. (2024). Generation of 3D molecules in pockets via a language model. Nature Machine Intelligence, 6(1), 62-73. https://doi.org/10.1038/s42256-023-00775-6

@article{feng2024generation,
  title={Generation of 3D molecules in pockets via a language model},
  author={Feng, Wei and Wang, Lvwei and Lin, Zaiyun and Zhu, Yanhao and Wang, Han and Dong, Jianqiang and Bai, Rong and Wang, Huting and Zhou, Jielong and Peng, Wei and Huang, Bo and Zhou, Wenbiao},
  journal={Nature Machine Intelligence},
  volume={6},
  number={1},
  pages={62--73},
  year={2024},
  publisher={Nature Publishing Group},
  doi={10.1038/s42256-023-00775-6}
}

Evolutionary Molecular Design via Deep Learning + GA

Thu, 26 Mar 2026 00:00:00 +0000

Fingerprint-Based Evolutionary Molecular Design

This is a Method paper that introduces an evolutionary design methodology (EDM) for goal-directed molecular optimization. The primary contribution is a three-component framework where (1) molecules are encoded as extended-connectivity fingerprint (ECFP) vectors, (2) a genetic algorithm evolves these fingerprint vectors through mutation and crossover, (3) a recurrent neural network (RNN) decodes the evolved fingerprints back into valid SMILES strings, and (4) a deep neural network (DNN) evaluates molecular fitness. The key advantage over prior evolutionary approaches is that no hand-crafted chemical rules or fragment libraries are needed, as the RNN learns valid molecular reconstruction from data.

Challenges in Evolutionary Molecular Optimization

Evolutionary algorithms for molecular design face two core challenges. First, maintaining chemical validity of evolved molecules is difficult when operating on graph or string representations directly. Prior methods rely on predefined chemical rules and fragment libraries to constrain structural modifications (atom/bond additions, deletions, substitutions), but these introduce bias and risk convergence to local optima. Each new application domain requires specifying new chemical rules, which may not exist for emerging areas. Second, fitness evaluation must be both efficient and accurate. Simple evaluation methods like structural similarity indices or semi-empirical quantum chemistry calculations reduce computational cost but may not capture complex property relationships.

High-throughput computational screening (HTCS) is a common alternative, but it depends on the quality of predefined virtual chemical libraries and often requires multiple iterative enumerations, limiting its ability to explore novel chemical space.

Core Innovation: Evolving Fingerprints with Neural Decoding

The key insight is to perform genetic operations in fingerprint space rather than in molecular graph or SMILES string space. The framework comprises three learned functions:

Encoding function $e(\cdot)$: Converts a SMILES string $\mathbf{m}$ into a 5000-dimensional ECFP vector $\mathbf{x}$ using Morgan fingerprints with a neighborhood radius of 6. This is a deterministic hash-based encoding (not learned).

Decoding function $d(\cdot)$: An RNN with three hidden layers of 500 LSTM units that reconstructs a SMILES string from an ECFP vector. The RNN generates SMILES as a sequence of three-character substrings, conditioning each prediction on the current substring and the input ECFP vector:

$$d(\mathbf{x}) = \mathbf{m}, \quad \text{where } p(\mathbf{m}_{t+1} | \mathbf{m}_{t}, \mathbf{x})$$

The three-character substring approach reduces the ratio of invalid SMILES by imposing additional constraints on subsequent characters.

Property prediction function $f(\cdot)$: A five-layer DNN with 250 hidden units per layer that predicts molecular properties from ECFP vectors:

$$\mathbf{t} = f(e(\mathbf{m}))$$

The RNN is trained by minimizing cross-entropy loss between the softmax output and the target SMILES string $\mathbf{m}_{i}$, learning the relationship $d(e(\mathbf{m}_{i})) = \mathbf{m}_{i}$. The DNN is trained by minimizing mean squared error between predicted and computed property values. Both use the Adam optimizer with mini-batch size 100, 500 training epochs, and dropout rate 0.5.

Genetic Algorithm Operations

The GA evolves ECFP vectors using the DEAP library with the following parameters:

Population size: 50
Crossover rate: 0.7 (uniform crossover, mixing ratio 0.2)
Mutation rate: 0.3 (Gaussian mutation, $N(0, 0.2^{2})$, applied to 1% of elements)
Selection: Tournament selection with size 3, top 3 individuals as parents
Termination: 500 generations or 30 consecutive generations without fitness improvement

The evolutionary loop proceeds as follows: a seed molecule $\mathbf{m}_{0}$ is encoded to $\mathbf{x}_{0}$, mutated to generate a population $\mathbf{P}^{0} = {\mathbf{z}_{1}, \mathbf{z}_{2}, \ldots, \mathbf{z}_{L}}$, each vector is decoded via the RNN, validity is checked with RDKit, fitness is evaluated via the DNN, and the top parents produce the next generation through crossover and mutation.

Experimental Setup: Light-Absorbing Wavelength Optimization

Training Data and Deep Learning Performance

The models were trained on 10,000 to 100,000 molecules randomly sampled from PubChem (molecular weight 200-600 g/mol). Each molecule was labeled with DFT-computed excitation energy ($S_{1}$), HOMO, and LUMO energies using B3LYP/6-31G.

Training Data	Validity (%)	Reconstructability (%)	$S_{1}$ (R, MAE)	HOMO (R, MAE)	LUMO (R, MAE)
100,000	88.8	62.4	0.977, 0.185 eV	0.948, 0.168 eV	0.960, 0.195 eV
50,000	86.7	60.1	0.973, 0.198 eV	0.945, 0.172 eV	0.955, 0.209 eV
30,000	85.3	59.8	0.930, 0.228 eV	0.934, 0.191 eV	0.945, 0.224 eV
10,000	83.2	55.7	0.913, 0.278 eV	0.885, 0.244 eV	0.917, 0.287 eV

Validity refers to the proportion of chemically valid SMILES after RDKit inspection. Reconstructability measures how often the RNN can reproduce the original molecule from its ECFP (62.4% at 100k training samples by matching canonical SMILES among 10,000 generated strings).

Design Task 1: Unconstrained S1 Modification

Fifty seed molecules with $S_{1}$ values between 3.8 eV and 4.2 eV were evolved in both increasing and decreasing directions. With 50,000 training samples, $S_{1}$ increased by approximately 60% on average in the increasing direction and showed slightly lower rates of change in the decreasing direction. The asymmetry is attributed to the skewed $S_{1}$ distribution of training data (average $S_{1}$ of 4.3-4.4 eV, higher than the seed median of 4.0 eV). Performance saturated at approximately 50,000 training samples.

Design Task 2: S1 Modification with HOMO/LUMO Constraints

The same 50 seeds were evolved with constraints: $-7.0 \text{ eV} < \text{HOMO} < -5.0 \text{ eV}$ and $\text{LUMO} < 0.0 \text{ eV}$. In the increasing $S_{1}$ direction, constraints suppressed the rate of change because both HOMO and LUMO bounds limit the achievable HOMO-LUMO gap. In the decreasing direction, constraints had minimal effect because LUMO could freely decrease while HOMO had sufficient room to rise within the allowed range.

Design Task 3: Extrapolation Beyond Training Data

To generate molecules with $S_{1}$ values below 1.77 eV (outside the training distribution, which had mean $S_{1}$ of 4.91 eV), the authors introduced iterative “phases”: generate molecules, compute their properties via DFT, retrain the models, and repeat. Starting from the 30 lowest-$S_{1}$ seed molecules with 300 generation runs per phase:

Phase 1: Average $S_{1}$ = 2.20 eV, 12 molecules below 1.77 eV
Phase 2: Average $S_{1}$ = 2.22 eV, 37 molecules below 1.77 eV
Phase 3: Average $S_{1}$ = 2.31 eV, 58 molecules below 1.77 eV

While the average $S_{1}$ rose slightly across phases, variance decreased (from 1.40 to 1.36), indicating the model concentrated its outputs closer to the target range. This active-learning-like loop demonstrates the framework can extend beyond the training distribution.

Design Task 4: GuacaMol Benchmarks

The method was evaluated on the GuacaMol goal-directed benchmark suite using the ChEMBL25 training dataset. The RNN model was retrained with three-character substrings.

Benchmark	Best of Dataset	SMILES LSTM	SMILES GA	Graph GA	Graph MCTS	cRNN	EDM (ours)
Celecoxib rediscovery	0.505	1.000	0.607	1.000	0.378	1.000	1.000
Troglitazone rediscovery	0.419	1.000	0.558	1.000	0.312	1.000	1.000
Thiothixene rediscovery	0.456	1.000	0.495	1.000	0.308	1.000	1.000
LogP(-1.0)	1.000	1.000	1.000	1.000	0.980	1.000	1.000
LogP(8.0)	1.000	1.000	1.000	1.000	0.979	1.000	1.000
TPSA(150.0)	1.000	1.000	1.000	1.000	1.000	1.000	1.000
CNS MPO	1.000	1.000	1.000	1.000	1.000	1.000	1.000
QED	0.948	0.948	0.948	0.948	0.944	0.948	0.948

The EDM achieves maximum scores on all eight tasks, matching the cRNN baseline. The 256 highest-scoring molecules from the ChEMBL25 test set were used as seeds, with 500 SMILES strings generated per seed.

Key Findings and Limitations

Results

The evolutionary design framework successfully evolved seed molecules toward target properties across all four design tasks. The RNN decoder maintained 88.8% chemical validity at 100k training samples, and the DNN property predictor achieved correlation coefficients above 0.94 for $S_{1}$, HOMO, and LUMO prediction. The iterative retraining procedure enabled exploration outside the training data distribution, generating 58 molecules with $S_{1}$ below 1.77 eV after three phases. On GuacaMol benchmarks, the method achieved maximum scores on all eight tasks, matching SMILES LSTM, Graph GA, and cRNN baselines.

Limitations

Several limitations are worth noting:

Reconstructability ceiling: Only 62.4% of molecules could be reconstructed from their ECFP vectors, meaning the RNN decoder fails to recover the original molecule approximately 38% of the time. This information loss in the ECFP encoding is a fundamental bottleneck.
Data dependence: Performance is sensitive to the training data distribution. The asymmetric evolution rates for increasing vs. decreasing $S_{1}$ directly reflect the skewed training data.
Structural constraints: Three heuristic constraints (fused ring sizes, number of fused rings, alkyl chain lengths) were still needed to maintain reasonable molecular structures, partially undermining the claim of a fully data-driven approach.
DFT reliance: The extrapolation experiment requires DFT calculations in the loop, which are computationally expensive and may limit scalability.
Limited benchmark scope: Only 8 GuacaMol tasks were tested, and all achieved perfect scores, making it difficult to differentiate from competing methods. The paper does not report on harder multi-objective benchmarks.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Training/Evaluation	PubChem random sample	10,000-100,000 molecules	MW 200-600 g/mol, labeled with DFT-computed $S_{1}$, HOMO, LUMO
GuacaMol Benchmark	ChEMBL25	Standard split	Used for retraining RNN; 256 top-scoring seeds

Algorithms

Genetic algorithm: DEAP library; population 50, crossover rate 0.7, mutation rate 0.3, tournament size 3
RNN decoder: 3 hidden layers, 500 LSTM units each, three-character substring generation
DNN predictor: 5 layers, 250 hidden units, sigmoid activations, linear output
Training: Adam optimizer, mini-batch 100, 500 epochs, dropout 0.5

Models

All neural networks were implemented using Keras with the Theano backend (GPU-accelerated). No pre-trained model weights are publicly available.

Evaluation

RNN validity: Proportion of chemically valid SMILES (RDKit check)
Reconstructability: Fraction of seed molecules recoverable from ECFP (canonical SMILES match in 10,000 generated strings)
DNN accuracy: Correlation coefficient (R) and MAE via 10-fold cross-validation
Evolutionary performance: Average rate of $S_{1}$ change across 50 seeds; molecule count in target range
GuacaMol: Standard rediscovery and property satisfaction benchmarks

Hardware

The paper does not specify GPU models, training times, or computational requirements for the evolutionary runs. DFT calculations used the Gaussian 09 program suite with B3LYP/6-31G.

Artifacts

No public code repository or pre-trained models are available. The paper is published under a CC-BY 4.0 license as open access in Scientific Reports.

Artifact	Type	License	Notes
Paper (Nature)	Paper	CC-BY 4.0	Open access

Reproducibility classification: Partially Reproducible. The method is described in sufficient detail for reimplementation, but no code, trained models, or preprocessed datasets are released. The DFT calculations require Gaussian 09, a commercial software package.

Paper Information

Citation: Kwon, Y., Kang, S., Choi, Y.-S., & Kim, I. (2021). Evolutionary design of molecules based on deep learning and a genetic algorithm. Scientific Reports, 11, 17304. https://doi.org/10.1038/s41598-021-96812-8

@article{kwon2021evolutionary,
  title={Evolutionary design of molecules based on deep learning and a genetic algorithm},
  author={Kwon, Youngchun and Kang, Seokho and Choi, Youn-Suk and Kim, Inkoo},
  journal={Scientific Reports},
  volume={11},
  number={1},
  pages={17304},
  year={2021},
  publisher={Nature Publishing Group},
  doi={10.1038/s41598-021-96812-8}
}

DrugEx v3: Scaffold-Constrained Graph Transformer

Thu, 26 Mar 2026 00:00:00 +0000

A Graph Transformer Method for Scaffold-Constrained Drug Design

This is a Method paper that introduces DrugEx v3, a Graph Transformer model for scaffold-constrained de novo drug design. The primary contribution is a novel positional encoding scheme for molecular graphs that allows a Transformer architecture to operate on graph-structured molecular data rather than SMILES strings. The model takes user-provided scaffold fragments as input and generates complete molecules through growing and connecting operations, trained with multi-objective reinforcement learning to optimize for both target affinity and drug-likeness.

From Fixed Objectives to User-Guided Scaffold Design

Prior versions of DrugEx (v1 and v2) used RNN-based generators trained with reinforcement learning for de novo drug design, but they operated under fixed objectives and could not accept user-provided structural priors. If a medicinal chemist wanted to explore analogs of a specific scaffold, the model needed retraining from scratch. Meanwhile, SMILES-based molecular generators face inherent limitations for scaffold-constrained design: SMILES is a linear notation, so inserting fragments at multiple positions of a scaffold requires complex grammar handling, and small token changes can produce invalid molecules.

Several approaches had been proposed for scaffold-based generation, including graph generative models (Lim et al., 2019), DeepScaffold (Li et al., 2020), SMILES-based scaffold decorators (Arus-Pous et al., 2020), and SyntaLinker for fragment linking (Yang et al., 2020). DrugEx v3 aims to combine the advantages of graph representations (validity guarantees, local invariance, flexible extension) with the Transformer architecture’s ability to handle complex dependencies, while maintaining the multi-objective reinforcement learning framework from DrugEx v2.

Graph Positional Encoding for Molecular Transformers

The core innovation is adapting the Transformer architecture to work directly with molecular graph representations. Two key modifications make this possible.

Graph word encoding. Since atoms and bonds cannot be processed simultaneously in a graph, the authors combine them into a single index:

$$ W = T_{atom} \times 4 + T_{bond} $$

where $T_{atom}$ is the atom type index and $T_{bond}$ is the bond type index (four bond types: single, double, triple, and none).

Graph positional encoding. Standard sequential position encoding does not capture molecular topology. The authors propose an adjacency-matrix-based positional encoding:

$$ P = I_{Atom} \times L_{max} + I_{Connected} $$

where $I_{Atom}$ is the current atom index, $L_{max}$ is the maximum sequence length, and $I_{Connected}$ is the index of the atom connected by the current bond. This encoding is then processed through the standard sinusoidal positional encoding:

$$ PE_{(p, 2i)} = \sin(pos / 10000^{2i / d_{m}}) $$

$$ PE_{(p, 2i+1)} = \cos(pos / 10000^{2i / d_{m}}) $$

with $d_{m} = 512$.

Molecule generation procedure. Each molecule in the training data is represented as a five-row matrix encoding atom type, bond type, connected atom index, current atom index, and fragment index. The columns are divided into three sections: fragment (the scaffold), growing (new atoms added to fragments), and linking (bonds connecting grown fragments). The decoder uses a GRU-based recurrent layer to sequentially output atom type, bond type, connected atom index, and current atom index at each step, with chemical valence rules enforced at every generation step to guarantee valid molecules.

Multi-objective reinforcement learning. The generator is trained with a policy gradient objective:

$$ J(\theta) = \mathbb{E}\left[R^{*}(y_{1:T}) | \theta\right] = \sum_{t=1}^{T} \log G(y_{t} | y_{1:t-1}) \cdot R^{\ast}(y_{1:T}) $$

where $R^{*}$ is a Pareto-based reward combining target affinity and QED drug-likeness score:

$$ R^{*} = \begin{cases} 0.5 + \frac{k - N_{undesired}}{2N_{desired}}, & \text{if desired} \\ \frac{k}{2N_{undesired}}, & \text{if undesired} \end{cases} $$

with $k$ being the solution’s index in the Pareto rank. An exploration strategy uses two networks: an exploitation network $G_{\theta}$ (updated by policy gradient) and an exploration network $G_{\phi}$ (fixed, pre-trained on ChEMBL), with an exploration rate $\varepsilon$ controlling how many scaffolds are routed to $G_{\phi}$ during training.

Experimental Setup: Architecture Comparison and RL Optimization

Data

The ChEMBL set (version 27) contained approximately 1.7 million molecules for pre-training, preprocessed via RDKit (charge neutralization, metal/fragment removal). The LIGAND set comprised 10,828 adenosine receptor ligands for fine-tuning. Each molecule was decomposed into fragments using the BRICS algorithm, creating scaffold-molecule pairs (up to 15 pairs per molecule with four fragments). The ChEMBL set yielded 9.3 million training pairs, and the LIGAND set produced 53,888 training pairs.

Architecture comparison

Four architectures were compared:

Graph Transformer: graph input with novel positional encoding
Sequential Transformer: SMILES input with standard Transformer
LSTM-BASE: SMILES encoder-decoder with three recurrent layers
LSTM+ATTN: LSTM-BASE with an attention mechanism between encoder and decoder

All models were pre-trained on ChEMBL and fine-tuned on the LIGAND set. The bioactivity predictor was a random forest regression model using 2048D ECFP6 fingerprints and 19D physicochemical descriptors, with an activity threshold of pX = 6.5 for the A2A adenosine receptor.

Evaluation metrics

Five metrics were used: validity (parseable molecules), accuracy (scaffold containment), desirability (meeting all objectives), uniqueness, and novelty (not in ChEMBL). Diversity was measured using the Solow-Polasky index with Tanimoto distance on ECFP6 fingerprints:

$$ I(A) = \frac{1}{|A|} \mathbf{e}^{\intercal} F(\mathbf{s})^{-1} \mathbf{e} $$

Hardware

Models were benchmarked on a server with NVIDIA Tesla P100 GPUs.

Key Results: Graph Representation Advantages and RL Trade-offs

Pre-training and fine-tuning performance

The Graph Transformer achieved the best overall performance across all metrics:

Method	Validity (PT)	Accuracy (PT)	Validity (FT)	Accuracy (FT)	Novelty (FT)	Uniqueness (FT)
Graph Transformer (512)	100.0%	99.3%	100.0%	99.2%	68.9%	82.9%
Seq. Transformer (512)	96.7%	74.0%	99.3%	92.7%	8.9%	28.9%
LSTM+ATTN (512)	94.3%	72.8%	96.9%	85.9%	6.3%	20.7%
LSTM-BASE (512)	93.9%	52.4%	98.7%	81.6%	3.9%	19.2%

PT = pre-trained, FT = fine-tuned. The Graph Transformer achieved 100% validity due to its explicit valence checking at each generation step. It also produced substantially more novel and unique molecules after fine-tuning compared to SMILES-based methods.

The authors identified four advantages of the graph representation over SMILES: (1) local invariance, where fragment ordering does not affect output; (2) global extendibility, where new atoms can be appended without restructuring existing data; (3) freedom from grammar constraints; and (4) direct accessibility of chemical valence rules for validity enforcement.

Reinforcement learning results

With multi-objective RL (affinity + QED), 74.6% of generated molecules were predicted active at $\varepsilon = 0.0$. The exploration rate $\varepsilon$ trades off desirability against uniqueness:

$\varepsilon$	Desirability	Uniqueness	Novelty	Diversity
0.0	74.6%	60.7%	60.6%	0.879
0.1	66.8%	75.0%	74.6%	0.842
0.2	61.6%	80.2%	79.4%	0.879
0.3	56.8%	89.8%	88.8%	0.874

The authors report that $\varepsilon = 0.3$ produced the best balance between desirability and uniqueness, with 56.8% desired molecules and 89.8% uniqueness. Diversity remained above 0.84 across all settings.

Limitations

The Graph Transformer produced molecules with worse synthetic accessibility (SA scores) compared to SMILES-based methods, particularly after fine-tuning on the smaller LIGAND set. The authors attribute this to uncommon ring systems generated when the model handles long-distance dependencies. A kekulization issue also causes a small fraction of molecules to fail scaffold matching: aromatic bond inference during sanitization can alter the scaffold substructure. Without single-objective affinity constraint, the model generates molecules with molecular weight exceeding 500 Da, reducing drug-likeness. All bioactivity predictions rely on a random forest model rather than experimental validation, and the t-SNE analysis suggests some generated molecules fall outside the model’s applicability domain.

Future directions

The authors propose extending the Graph Transformer to accept protein information as input via proteochemometric modeling, enabling design of ligands for targets without known ligands. Lead optimization, where a “hit” serves as input to generate improved analogs, is also identified as a natural extension.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pre-training	ChEMBL v27	~1.7M molecules (9.3M scaffold-molecule pairs)	Preprocessed via RDKit
Fine-tuning	LIGAND set (A2A AR ligands from ChEMBL)	10,828 ligands (53,888 pairs)	Split 8:1:1 train/val/test
Bioactivity labels	ChEMBL A2A AR activity data	pX threshold = 6.5	Average pChEMBL values

Algorithms

Fragment decomposition: BRICS algorithm via RDKit (max 4 fragments per molecule)
Optimizer: Adam with learning rate $10^{-4}$, batch size 256
Pre-training: 20 epochs; fine-tuning: up to 1,000 epochs with early stopping (patience: 100 epochs)
Bioactivity predictor: random forest regression (scikit-learn) with 2048D ECFP6 + 19D physicochemical descriptors
Pareto-based multi-objective ranking with GPU acceleration

Models

Graph Transformer: 512 hidden units, 8 attention heads, $d_{k} = d_{v} = 64$
Sequential Transformer: same hidden size, sinusoidal positional encoding
LSTM-BASE / LSTM+ATTN: 128 embedding units, 512 hidden units, 3 recurrent layers

Evaluation

Metric	Graph Transformer	Best SMILES Baseline	Notes
Validity (fine-tuned)	100.0%	99.6% (LSTM-BASE 1024)	Valence checking guarantees validity
Accuracy (fine-tuned)	99.2%	94.3% (Seq. Transformer 1024)	Scaffold containment
Desirability (RL, $\varepsilon$=0.0)	74.6%	N/A	Only Graph Transformer used for RL
Diversity (RL)	0.879	N/A	Solow-Polasky index

Hardware

NVIDIA Tesla P100 GPUs. Specific training times not reported, but Transformer models trained faster than LSTM models with the same hidden layer size.

Artifacts

Artifact	Type	License	Notes
CDDLeiden/DrugEx	Code	MIT	Official implementation (v1, v2, v3)
ChEMBL v27	Dataset	CC-BY-SA 3.0	Pre-training data source

Paper Information

Citation: Liu, X., Ye, K., van Vlijmen, H. W. T., IJzerman, A. P., & van Westen, G. J. P. (2023). DrugEx v3: scaffold-constrained drug design with graph transformer-based reinforcement learning. Journal of Cheminformatics, 15, 24. https://doi.org/10.1186/s13321-023-00694-z

@article{liu2023drugex,
  title={DrugEx v3: scaffold-constrained drug design with graph transformer-based reinforcement learning},
  author={Liu, Xuhan and Ye, Kai and van Vlijmen, Herman W. T. and IJzerman, Adriaan P. and van Westen, Gerard J. P.},
  journal={Journal of Cheminformatics},
  volume={15},
  number={1},
  pages={24},
  year={2023},
  publisher={Springer},
  doi={10.1186/s13321-023-00694-z}
}

Curriculum Learning for De Novo Drug Design (REINVENT)

Thu, 26 Mar 2026 00:00:00 +0000

Curriculum Learning as a Method for Molecular Generation

This is a Method paper that introduces curriculum learning (CL) into the REINVENT de novo molecular design platform. The primary contribution is a training strategy that decomposes complex multi-parameter optimization (MPO) objectives into sequences of simpler tasks with increasing complexity. The agent learns each simpler task before progressing to the full production objective, accelerating convergence and improving the quality and diversity of generated molecules compared to standard policy-based reinforcement learning (RL).

The Computational Cost of Complex Reward Functions

Policy-based RL for molecular design works by training a generative model (the agent) to produce molecules that maximize a reward function. In practice, drug design reward functions often include computationally expensive components such as molecular docking. When the reward landscape is complex and minima are difficult to find, the agent may spend many epochs sampling molecules far from the desired objective. The resulting small gradients cause minimal policy updates, leading to long periods of non-productivity. This is particularly wasteful when each reward evaluation involves expensive physics-based computations.

The core problem is that standard RL treats the full MPO objective as a monolithic task. If the agent cannot find any rewarding molecules early in training, it receives near-zero gradients and makes negligible progress. This creates a bootstrapping problem: the agent needs to already be sampling from favorable regions of chemical space to receive useful learning signals, but it has no guidance on how to get there.

Curriculum learning, originally proposed by Bengio et al. (2009), addresses this by arranging training tasks in order of increasing difficulty. When constituent tasks are correlated with the final objective, the gradients from simpler tasks provide more effective traversal of the optimization landscape.

Formalized Curriculum Strategy for REINVENT

The key innovation is a two-phase training protocol with formal definitions for curriculum progression.

A scoring function maps SMILES strings to desirability scores in $[0, 1]$ using a weighted geometric mean:

$$S(x) = \left(\prod_{i=1}^{n} c_{i}(x)^{w_{i}}\right)^{1 / \sum_{i=1}^{n} w_{i}}$$

where $x$ is a sampled compound in SMILES format, $c_{i}$ is the $i$-th scoring component, and $w_{i}$ is its weight.

A Curriculum $C$ consists of a sequence of Objectives $O = {O_{C_1}, \ldots, O_{C_n}, O_{P}}$, where subscripts $C$ and $P$ denote Curriculum and Production Objectives respectively. Each Objective has a corresponding scoring function. Progression is controlled by Curriculum Progression Criteria $P = {P_{1}, \ldots, P_{n}}$, where each $P_{i}$ defines a score threshold the agent must achieve before advancing to the next objective.

Curriculum Phase: The agent trains on sequential Curriculum Objectives with increasing complexity. A diversity filter is not applied during this phase, as it could be counterproductive to guiding the agent toward favorable chemical space. No computationally expensive components (e.g., docking) are used in Curriculum Objectives.

Production Phase: Activated only when the final Curriculum Progression Criterion $P_{n}$ is satisfied. The agent now optimizes the full Production Objective, which may include expensive components like molecular docking. A new inception memory is initialized (clearing Curriculum Phase compounds), and a Bemis-Murcko scaffold diversity filter is applied to encourage exploration across multiple local minima.

The implementation builds on REINVENT’s RNN architecture: three hidden layers of 512 LSTM cells with an embedding size of 256 and a linear layer with softmax activation, pretrained on ChEMBL to learn SMILES syntax.

Three Experiments on PDK1 Inhibitor Design

The authors evaluate CL on three molecular design tasks of increasing complexity, all centered on designing 3-phosphoinositide-dependent protein kinase-1 (PDK1) inhibitors.

Experiment 1: Target Scaffold Construction

The goal is to generate compounds possessing a dihydro-pyrazoloquinazoline scaffold with a phenyl substituent, a scaffold not present in the prior’s training set. Standard RL fails entirely over 2000 epochs because the probability of randomly sampling a compound with this scaffold is negligibly small, producing binary rewards (1.0 if scaffold present, 0.5 otherwise) that never rise above 0.5.

CL decomposes the target scaffold into 5 progressively complex substructures. Each Curriculum Progression Criterion threshold is set to 0.8. The agent learns to generate compounds with each substructure before advancing. CL finds the target scaffold within 1750 epochs, while baseline RL cannot find it in the same timeframe.

Experiments 2 and 3: Molecular Docking Constraints

These experiments use a Production Objective combining a molecular docking constraint (retaining two hydrogen-bonding interactions with Ala 162 of PDK1, PDB ID: 2XCH) and QED (Quantitative Estimate of Druglikeness). Both experiments limit computational cost by capping production epochs at 300.

Experiment 2 uses Tanimoto (2D) similarity to a reference ligand as the Curriculum Objective. Two scenarios are tested: “Low” (threshold 0.5) and “High” (threshold 0.8).

Experiment 3 uses ROCS (3D) shape-based similarity to the reference ligand as the Curriculum Objective, with “Low” (0.5) and “High” (0.75) thresholds.

All experiments are run in triplicate. The baseline includes both standard RL and RL with Tanimoto/ROCS components added directly to the scoring function (not sequentially), to control for the presence of these components.

Across all CL experiments, CL generates between 2,941 and 9,068 more compounds with docking scores better than the reference ligand (-10.907 kcal/mol) compared to baseline RL, corresponding to 12.42-23.79% improvement in the fraction of compounds exceeding the reference. Between the Curriculum Objectives, the “High” threshold scenario outperforms the “Low” scenario by 316-3,415 additional compounds (with percentage improvements ranging from -0.4% to 10.57%).

Baseline RL produces essentially no compounds satisfying the docking constraint for the first 100 epochs. CL agents achieve immediate productivity: in the “High” Tanimoto scenario, the initial docking score already exceeds the maximum score achieved by baseline RL over 300 epochs.

Scaffold Diversity Analysis

CL generates more unique Bemis-Murcko scaffolds than baseline RL in all experiments. The “High” scenarios produce more unique scaffolds than the “Low” scenarios. CL also produces a higher fraction of “favorable” scaffolds (those with better docking scores than the reference ligand).

Accelerated Convergence with a Diversity Trade-off

The results demonstrate three consistent findings across all experiments:

Accelerated productivity: CL agents reach productive sampling of favorable compounds substantially faster than baseline RL. Even a single Curriculum Objective with a computationally inexpensive metric can guide the agent to regions of chemical space where expensive Production Objectives are readily satisfied.
Improved output quality: CL generates more compounds with favorable docking scores, more unique scaffolds, and a higher fraction of scaffolds that outperform the reference ligand.
Controllable trade-off: The Curriculum Progression Criterion threshold provides direct control over agent policy. Higher thresholds produce better Production Objective optimization but reduce intra-set diversity (higher cross-Tanimoto similarities among generated compounds). UMAP visualizations confirm that “Low” and “High” scenarios sample from nearby but distinct regions of chemical space.

The authors note that even moderate optimization of similarity-based Curriculum Objectives (the “Low” scenarios) already substantially narrows the agent’s perceived solution space. This suggests that CL inherently regularizes the agent policy, trading some diversity for convergence speed.

Limitations: The authors acknowledge that data supporting the findings are available only upon request rather than as public deposits. The approach is demonstrated on a single target (PDK1), and the curriculum design requires domain expertise to decompose objectives appropriately. The inverse relationship between Curriculum Objective optimization and solution diversity means practitioners must carefully tune thresholds for their specific applications.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Prior training	ChEMBL	Not specified	Used to pretrain the RNN on SMILES syntax
Docking target	PDB 2XCH	1 structure	PDK1 receptor crystal structure

Raw data supporting the findings are available from the corresponding author upon request.

Algorithms

REINVENT platform with LSTM-based RNN (3 hidden layers, 512 cells, embedding size 256)
Scoring function: weighted geometric mean of components
Curriculum Progression Criteria: score thresholds (0.5 or 0.75-0.8 depending on scenario)
Diversity filter: Identical Murcko Scaffold with bucket size 25 (Production Phase only)
Inception (experience replay) for both phases, reset at phase transition
Batch size: 128, learning rate: 0.0001, sigma: 128, Adam optimizer

Models

Prior: RNN pretrained on ChEMBL SMILES
Agent: Initialized from prior, focused via RL/CL
No pretrained model weights are publicly released

Evaluation

Metric	Description	Notes
Docking score (Glide SP)	Predicted binding affinity (kcal/mol)	Lower is better; reference ligand: -10.907
QED	Quantitative Estimate of Druglikeness	Range [0, 1]
Unique Bemis-Murcko scaffolds	Scaffold diversity measure	Averaged over triplicates
Cross-Tanimoto similarity	Intra-set compound diversity	Calculated on pooled triplicates
Tanimoto/ROCS similarity	Curriculum Objective metrics	2D fingerprint and 3D shape similarity

Hardware

GPU: NVIDIA Tesla V100 (32 GB)
Docking: AWS p3.8xlarge instance
LigPrep parallelized over 8 CPU cores
Glide docking parallelized over 48 CPU cores via DockStream

Artifacts

Artifact	Type	License	Notes
REINVENT	Code	Apache-2.0	De novo molecular design platform
CL Tutorial Notebook	Code	MIT	Jupyter notebook tutorial for curriculum learning

Paper Information

Citation: Guo, J., Fialková, V., Arango, J. D., Margreitter, C., Janet, J. P., Papadopoulos, K., Engkvist, O., & Patronov, A. (2022). Improving de novo molecular design with curriculum learning. Nature Machine Intelligence, 4, 555-563. https://doi.org/10.1038/s42256-022-00494-4

@article{guo2022curriculum,
  title={Improving de novo molecular design with curriculum learning},
  author={Guo, Jeff and Fialkov{\'a}, Vendy and Arango, Juan Diego and Margreitter, Christian and Janet, Jon Paul and Papadopoulos, Kostas and Engkvist, Ola and Patronov, Atanas},
  journal={Nature Machine Intelligence},
  volume={4},
  number={6},
  pages={555--563},
  year={2022},
  publisher={Springer Nature},
  doi={10.1038/s42256-022-00494-4}
}

CogMol: Controlled Molecule Generation for COVID-19

Thu, 26 Mar 2026 00:00:00 +0000

A Controlled Generation Framework for Target-Specific Drug Design

This is a Method paper that introduces CogMol (Controlled Generation of Molecules), an end-to-end framework for de novo drug design. The primary contribution is a pipeline that combines a SMILES-based Variational Autoencoder (VAE) with multi-attribute controlled latent space sampling (CLaSS) to generate novel drug-like molecules with high binding affinity to specified protein targets, off-target selectivity, and favorable drug-likeness properties. The framework operates on protein sequence embeddings, allowing it to generalize to unseen target proteins without model retraining.

Multi-Constraint Drug Design for Novel Viral Targets

Traditional drug discovery costs 2-3 billion USD and takes over a decade with less than 10% success rate. Generating drug molecules requires satisfying multiple competing objectives simultaneously: target binding affinity, off-target selectivity, synthetic accessibility, drug-likeness, and low toxicity. Prior generative approaches using reinforcement learning or Bayesian optimization are computationally expensive and typically require fine-tuning on target-specific ligand libraries, making them unable to generalize to unseen protein targets.

The emergence of SARS-CoV-2 in 2020 created an urgent need for antiviral drug candidates targeting novel viral proteins. Because no binding affinity data existed for these new targets, and the viral proteins were not closely related to proteins in existing databases like BindingDB, existing target-specific generative frameworks could not be directly applied. CogMol addresses this by using pre-trained protein sequence embeddings from UniRep (trained on 24 million UniRef50 sequences) rather than learning protein representations from the limited BindingDB training set.

Controlled Latent Space Sampling with Pre-trained Protein Embeddings

CogMol’s core innovation is a three-component architecture that enables multi-constraint molecule generation for unseen targets:

1. SMILES VAE with adaptive pre-training. A Variational Autoencoder is first trained unsupervised on the MOSES/ZINC dataset (1.6M molecules), then jointly fine-tuned with QED and SA property predictors on BindingDB molecules. The standard VAE objective is:

$$\mathcal{L}_{\text{VAE}}(\theta, \phi) = \mathbb{E}_{p(x)} \left\{ \mathbb{E}_{q_\phi(z|x)} [\log p_\theta(x|z)] - D_{\text{KL}}(q_\phi(z|x) | p(z)) \right\}$$

where $q_\phi(z|x) = \mathcal{N}(z; \mu(x), \Sigma(x))$ specifies a diagonal Gaussian encoder distribution.

2. Protein-molecule binding affinity predictor. A regression model takes pre-trained UniRep protein sequence embeddings and molecule latent embeddings $z$ as input and predicts pIC50 binding affinity ($= -\log(\text{IC50})$). Because UniRep embeddings capture sequence, structural, and functional relationships from a large unsupervised corpus, the predictor can estimate binding affinity for novel target sequences not present in the training data.

3. CLaSS controlled sampling. The Conditional Latent attribute Space Sampling scheme generates molecules satisfying multiple constraints (affinity, QED, selectivity) through rejection sampling in the VAE latent space:

$$p(\mathbf{x} | \mathbf{a}) = \mathbb{E}_{\mathbf{z}} [p(\mathbf{z} | \mathbf{a}) , p(\mathbf{x} | \mathbf{z})] \approx \mathbb{E}_{\mathbf{z}} [\hat{p}_\xi(\mathbf{z} | \mathbf{a}) , p_\theta(\mathbf{x} | \mathbf{z})]$$

where $\mathbf{a} = [a_1, a_2, \ldots, a_n]$ is a set of independent attribute constraints. The conditional density $\hat{p}_\xi(\mathbf{z} | \mathbf{a})$ is approximated using a Gaussian mixture model $Q_\xi(\mathbf{z})$ and per-attribute classifiers $q_\xi(a_i | \mathbf{z})$, with Bayes’ rule and conditional independence assumptions. The acceptance probability equals the product of all attribute predictor scores, enabling efficient multi-constraint sampling without surrogate model or policy learning.

Selectivity modeling. Off-target selectivity for a molecule $m$ against target $T$ is defined as:

$$\text{Sel}_{T,m} = \text{BA}(T, m) - \frac{1}{k} \sum_{i=1}^{k} \text{BA}(T_i, m)$$

where $\text{BA}(T, m)$ is binding affinity to the target and $T_i$ are $k$ randomly selected off-targets. This selectivity score is incorporated as a control attribute during CLaSS sampling.

Experimental Setup: COVID-19 Targets and In Silico Screening

Target proteins. CogMol was applied to three SARS-CoV-2 targets not present in BindingDB: NSP9 Replicase dimer, Main Protease (Mpro), and the Receptor-Binding Domain (RBD) of the spike protein. A cancer target (human HDAC1) with low ligand coverage in the training data was also evaluated.

Training data. The SMILES VAE was trained on the MOSES benchmark (1.6M molecules from ZINC). The binding affinity predictor used curated IC50 data from BindingDB as reported in DeepAffinity, with all protein classes included in training.

CLaSS controlled generation. Molecules were generated with simultaneous constraints on binding affinity (> 0.5 normalized), QED (> 0.8 normalized), and selectivity (> 0.5 normalized). Approximately 1000 molecules per target were selected for downstream evaluation.

In silico screening pipeline. Generated molecules underwent:

Toxicity prediction via a multi-task deep neural network (MT-DNN) on 12 Tox21 in vitro endpoints and ClinTox clinical trial failure
Binding affinity rescoring with a higher-accuracy SMILES-level predictor
Blind docking (5 independent runs per molecule) using AutoDock Vina against target protein structures
Synthetic feasibility assessment using a retrosynthetic algorithm based on the Molecular Transformer trained on patent reaction data

Baselines. VAE performance was benchmarked against models from the MOSES platform. CLaSS-accepted molecules were compared against randomly sampled molecules from the latent space. Generated molecules were compared against FDA-approved drugs for toxicity and synthesizability.

Key Results

CLaSS enrichment (Table 1). CLaSS consistently produced higher fractions of molecules meeting all criteria compared to random sampling. For the triple constraint (affinity > 0.5, QED > 0.8, selectivity > 0.5), the enrichment was substantial: 6.9% vs. 0.7% for NSP9, 9.0% vs. 0.9% for RBD, and 10.4% vs. 1.1% for Mpro.

Target	CLaSS (Aff+QED+Sel)	Random (Aff+QED+Sel)	Enrichment
NSP9	6.9%	0.7%	~10x
RBD	9.0%	0.9%	~10x
Mpro	10.4%	1.1%	~9.5x

Docking results (Table 3). 87-95% of high-affinity generated molecules showed docking binding free energy (BFE) below -6 kcal/mol, with minimum BFEs reaching -8.6 to -9.5 kcal/mol depending on the target.

Novelty. The likelihood of generating an exact duplicate of a training molecule was 2% or less. Against the full PubChem database (~103M molecules), exact matches ranged from 3.7% to 9.5%. Generated molecules also showed novel chemical scaffolds as confirmed by high Frechet ChemNet Distance.

Synthesizability. Generated molecules for COVID-19 targets showed 85-90% synthetic feasibility using retrosynthetic analysis, exceeding the ~78% rate of FDA-approved drugs.

Toxicity. Approximately 70% of generated parent molecules and ~80% of predicted metabolites were toxic in 0-1 endpoints out of 13, comparable to FDA-approved drugs.

Generated Molecules Show Favorable Binding and Drug-Like Properties

CogMol demonstrates that controlled latent space sampling with pre-trained protein embeddings can generate novel, drug-like molecules for unseen viral targets. The key findings are:

CLaSS provides roughly 10x enrichment over random latent space sampling for molecules satisfying all three constraints (affinity, QED, selectivity).
Generated molecules bind favorably to druggable pockets in target protein 3D structures, even though the generation model uses only 1D sequence information.
Some generated SMILES matched existing PubChem molecules with known biological activity, suggesting the model identifies chemically relevant regions of molecular space.
The framework generalizes across targets of varying novelty, with Mpro (more similar to training proteins) yielding easier generation than NSP9 or RBD.

Limitations. The authors note that no wet-lab validation was performed on generated candidates. There may be divergence between ML-predicted properties and experimental measurements. The binding affinity predictor’s accuracy is bounded by the quality and coverage of BindingDB training data. Selectivity modeling uses a random sample of off-targets rather than a pharmacologically curated panel.

Future directions. The authors propose incorporating additional contexts beyond target protein (e.g., metabolic pathways), adding more pharmacologically relevant controls, and weighting objectives by relative importance.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
VAE pre-training	MOSES/ZINC	1.6M train, 176K test	Publicly available benchmark
VAE adaptive training	BindingDB (DeepAffinity split)	~27K protein-ligand pairs	Curated IC50 data
Protein embeddings	UniRef50 via UniRep	24M sequences	Pre-trained, publicly available
Toxicity prediction	Tox21 + ClinTox	12 in vitro + clinical endpoints	Public benchmark datasets
Docking validation	PDB structures	3 SARS-CoV-2 targets	Public crystal structures

Algorithms

VAE architecture: SMILES encoder-decoder with diagonal Gaussian latent space, jointly trained with QED and SA regressors
CLaSS: rejection sampling from Gaussian mixture model of latent space with per-attribute classifiers
Binding affinity: regression on concatenated UniRep protein embeddings and VAE molecule embeddings
Selectivity: excess binding affinity over average of $k$ random off-targets

Models

SMILES VAE with adaptive pre-training (ZINC then BindingDB)
Multi-task toxicity classifier (MT-DNN) for Tox21 and ClinTox endpoints
Binding affinity predictor (latent-level for generation, SMILES-level for screening)
Retrosynthetic predictor based on Molecular Transformer

Evaluation

Metric	Value	Baseline	Notes
Validity	90%	-	Generated SMILES
Uniqueness	99%	-	Among valid molecules
Filter pass	95%	-	Relevant chemical filters
Docking BFE < -6 kcal/mol	87-95%	-	Varies by target
Synthetic feasibility	85-90%	78% (FDA drugs)	COVID-19 targets
Low toxicity (0-1 endpoints)	~70% parent, ~80% metabolite	Comparable to FDA drugs	MT-DNN prediction

Hardware

The paper does not specify GPU types or training times. The work was funded internally by IBM Research.

Artifacts

Artifact	Type	License	Notes
CogMol (GitHub)	Code	Apache-2.0	Official implementation
~3500 generated molecules	Dataset	Open license	For three SARS-CoV-2 targets

Paper Information

Citation: Chenthamarakshan, V., Das, P., Hoffman, S. C., Strobelt, H., Padhi, I., Lim, K. W., Hoover, B., Manica, M., Born, J., Laino, T., & Mojsilovic, A. (2020). CogMol: Target-Specific and Selective Drug Design for COVID-19 Using Deep Generative Models. Advances in Neural Information Processing Systems, 33, 4320-4332.

@inproceedings{chenthamarakshan2020cogmol,
  title={CogMol: Target-Specific and Selective Drug Design for COVID-19 Using Deep Generative Models},
  author={Chenthamarakshan, Vijil and Das, Payel and Hoffman, Samuel C. and Strobelt, Hendrik and Padhi, Inkit and Lim, Kar Wai and Hoover, Benjamin and Manica, Matteo and Born, Jannis and Laino, Teodoro and Mojsilovi{\'c}, Aleksandra},
  booktitle={Advances in Neural Information Processing Systems},
  volume={33},
  pages={4320--4332},
  year={2020}
}

BindGPT: GPT for 3D Molecular Design and Docking

Thu, 26 Mar 2026 00:00:00 +0000

A Language Model for Joint 3D Molecular Graph and Conformation Generation

BindGPT is a Method paper that introduces a GPT-based language model for generating 3D molecular structures. The primary contribution is a unified framework that jointly produces molecular graphs (via SMILES) and 3D coordinates (via XYZ tokens) within a single autoregressive model. This eliminates the need for external graph reconstruction tools like OpenBabel, which are error-prone when applied to noisy atom positions. The same pre-trained model serves as a 3D molecular generative model, a conformer generator conditioned on molecular graphs, and a pocket-conditioned 3D molecule generator.

The Graph Reconstruction Problem in 3D Molecular Generation

Most existing 3D molecular generators focus on predicting atom types and positions, relying on supplementary software (e.g., OpenBabel or RDKit) to reconstruct molecular bonds from predicted coordinates. This introduces a fragile dependency: small positional errors can drastically change the reconstructed molecular graph or produce disconnected structures. Additionally, while diffusion models and equivariant GNNs have shown strong results for 3D molecular generation, they often depend on SE(3) equivariance inductive biases and are computationally expensive at sampling time (up to $10^6$ seconds for 1000 valid molecules for EDM). The pocket-conditioned generation task is further limited by the small size of available 3D binding pose datasets (e.g., CrossDocked), making it difficult for specialized models to generalize without large-scale pre-training.

SMILES+XYZ Tokenization: Jointly Encoding Graphs and Coordinates

The core innovation in BindGPT is coupling SMILES notation with XYZ coordinate format in a single token sequence. The sequence starts with a token, followed by character-level SMILES tokens encoding the molecular graph, then an token marking the transition to coordinate data. Each 3D atom position is encoded using 6 tokens (integer and fractional parts for each of the three coordinates). The atom ordering is synchronized between SMILES and XYZ, so atom symbols from SMILES are not repeated in the coordinate section.

For protein pockets, sequences begin with a token followed by atom names and coordinates. Following AlphaFold’s approach, only alpha-carbon coordinates are retained to keep pocket representations compact.

The model uses the GPT-NeoX architecture with rotary position embeddings (RoPE), which enables length generalization between pre-training and fine-tuning where sequence lengths differ substantially. The pre-trained model has 108M parameters with 15 layers, 12 attention heads, and a hidden dimension of 768.

Pre-training on Large-Scale 3D Data

Pre-training uses the Uni-Mol dataset containing 208M conformations for 12M molecules and 3.2M protein pocket structures. Each training batch contains either ligand sequences or pocket sequences (not mixed within a sequence). Since pockets are far fewer than ligands, the training schedule runs 5 pocket epochs per ligand epoch, resulting in roughly 8% pocket tokens overall. Training uses large batches of 1.6M tokens per step with Flash Attention and DeepSpeed optimizations.

Supervised Fine-Tuning with Augmentation

For pocket-conditioned generation, BindGPT is fine-tuned on CrossDocked 2020, which contains aligned pocket-ligand pairs. Unlike prior work that subsamples less than 1% of the best pairs, BindGPT uses all intermediate ligand poses (including lower-quality ones), yielding approximately 27M pocket-ligand pairs. To combat overfitting on the limited diversity (14k unique molecules, 3k pockets), two augmentation strategies are applied:

SMILES randomization: Each molecule can yield 100-1000 different valid SMILES strings
Random 3D rotation: The same rotation matrix is applied to both pocket and ligand coordinates

During fine-tuning, the pocket token sequence is concatenated before the ligand token sequence. An optional variant conditions on binding energy scores from the CrossDocked dataset, enabling contrastive learning between good and bad binding examples.

Reinforcement Learning with Docking Feedback

BindGPT applies REINFORCE (not PPO or REINVENT, which were found less stable) to further optimize pocket-conditioned generation. On each RL step, the model generates 3D ligand structures for a batch of random protein pockets, computes binding energy rewards using QVINA docking software, and updates model parameters. A KL-penalty between the current model and the SFT initialization stabilizes training.

The RL objective can be written as:

$$\mathcal{L}_{\text{RL}} = -\mathbb{E}_{x \sim \pi_\theta}\left[ R(x) \right] + \beta \cdot D_{\text{KL}}(\pi_\theta | \pi_{\text{SFT}})$$

where $R(x)$ is the docking reward from QVINA and $\beta$ controls the strength of the KL regularization.

Experimental Evaluation Across Three 3D Generation Tasks

Datasets

Purpose	Dataset	Size	Notes
Pre-training	Uni-Mol 3D	208M conformations (12M molecules) + 3.2M pockets	Large-scale 3D molecular dataset
Fine-tuning (SFT)	CrossDocked 2020	~27M pocket-ligand pairs	14k molecules x 3k pockets, includes all pose qualities
Fine-tuning (conformer)	GEOM-DRUGS	27M conformations for 300k molecules	Standard benchmark for 3D conformer generation
Evaluation (conformer)	Platinum	Experimentally validated conformations	Zero-shot evaluation holdout
Evaluation (pocket)	CrossDocked holdout	100 pockets	Held out from training

Task 1: 3D Molecule Generation (Pre-training)

Compared against XYZ-Transformer (the only other model capable of large-scale pre-training), BindGPT achieves 98.58% validity (vs. 12.87% for XYZ-TF without hydrogens), higher SA (0.77 vs. 0.21), QED (0.59 vs. 0.30), and Lipinski scores (4.86 vs. 4.79). BindGPT also produces conformations with RMSD of 0.89 (XYZ-TF’s RMSD calculation failed to converge). Generation is 12x faster (13s vs. 165s for 1000 molecules).

Task 2: 3D Molecule Generation (Fine-tuned on GEOM-DRUGS)

Against EDM and MolDiff (diffusion baselines), BindGPT outperforms on nearly all 3D distributional metrics:

Metric	EDM	MolDiff	BindGPT
JS bond lengths	0.246	0.365	0.029
JS bond angles	0.282	0.155	0.075
JS dihedral angles	0.328	0.162	0.098
JS freq. bond types	0.378	0.163	0.045
JS freq. bond pairs	0.396	0.136	0.043
JS freq. bond triplets	0.449	0.125	0.042
Time (1000 molecules)	1.4e6 s	7500 s	200 s

BindGPT is two orders of magnitude faster than diffusion baselines while producing more accurate 3D geometries. MolDiff achieves better drug-likeness scores (QED, SA), but the authors argue 3D distributional metrics are more relevant for evaluating 3D structure fidelity.

Task 3: Pocket-Conditioned Molecule Generation

Method	Vina Score	SA	QED	Lipinski
Pocket2Mol	-7.15 +/- 4.89	0.75	0.57	4.88
TargetDiff	-7.80 +/- 3.61	0.58	0.48	4.51
BindGPT-FT	-5.44 +/- 2.09	0.78	0.50	4.72
BindGPT-RFT	-7.24 +/- 1.68	0.74	0.48	4.32
BindGPT-RL	-8.60 +/- 1.90	0.84	0.43	4.81

The RL-fine-tuned model achieves the best Vina binding scores (-8.60 vs. -7.80 for TargetDiff) with lower variance and the highest SA score (0.84). The SFT-only model (BindGPT-FT) underperforms baselines on binding score, demonstrating that RL is essential for strong pocket-conditioned generation. QED is lower for BindGPT-RL, but the authors note that QED could be included in the RL reward and was excluded for fair comparison.

Conformer Generation

On the Platinum dataset (zero-shot), BindGPT matches the performance of Torsional Diffusion (the specialized state-of-the-art) when assisted by RDKit, with a small gap without RDKit assistance. Uni-Mol fails to generalize to this dataset despite pre-training on the same Uni-Mol data.

Key Findings, Limitations, and Future Directions

BindGPT demonstrates that a simple autoregressive language model without equivariance inductive biases can match or surpass specialized diffusion models and GNNs across multiple 3D molecular generation tasks. The key findings include:

Joint SMILES+XYZ generation eliminates graph reconstruction errors, achieving 98.58% validity compared to 12.87% for XYZ-Transformer
Large-scale pre-training is critical for pocket-conditioned generation, as none of the baselines use pre-training and instead rely on heavy inductive biases
RL fine-tuning with docking feedback substantially improves binding affinity beyond what SFT alone achieves
Sampling is two orders of magnitude faster than diffusion baselines (200s vs. 1.4M s for EDM)

Limitations include the relatively modest model size (108M parameters), with the authors finding this sufficient for current tasks but not exploring larger scales. The RL optimization uses only Vina score as reward; multi-objective optimization incorporating SA, QED, and other properties is left as future work. The model also relies on character-level SMILES tokenization rather than more sophisticated chemical tokenizers. BindGPT is the first model to explicitly generate hydrogens at scale, though validity drops from 98.58% to 77.33% when hydrogens are included.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pre-training	Uni-Mol 3D	208M conformations, 12M molecules, 3.2M pockets	From Zhou et al. (2023)
SFT (pocket)	CrossDocked 2020	~27M pocket-ligand pairs	Full version including low-quality poses
SFT (conformer)	GEOM-DRUGS	27M conformations, 300k molecules	Standard benchmark
Evaluation	Platinum	Experimentally validated conformations	Zero-shot holdout

Algorithms

Architecture: GPT-NeoX with rotary position embeddings (RoPE)
Pre-training: Causal language modeling with 1.6M tokens per batch
SFT augmentation: SMILES randomization + random 3D rotation
RL: REINFORCE with KL-penalty from SFT initialization; QVINA docking as reward

Models

Size: 108M parameters, 15 layers, 12 heads, hidden size 768
Vocabulary: Character-level SMILES tokens + special tokens (, , ) + coordinate tokens (6 per 3D position)

Evaluation

Validity, SA, QED, Lipinski: Standard drug-likeness metrics
Jensen-Shannon divergences: Distribution-level 3D structural metrics (bond lengths, angles, dihedrals, bond types)
RMSD: Alignment quality of generated conformations vs. RDKit reference
RMSD-Coverage: CDF of RMSD between generated and reference conformers
Vina score: Binding energy from QVINA docking software

Hardware

Pre-training and fine-tuning use Flash Attention and DeepSpeed for efficiency
Specific GPU counts and training times are described in Appendix G (not available in the main text)

Artifacts

Artifact	Type	License	Notes
Project Page	Other	Not specified	Project website with additional details

No public code repository or pre-trained model weights were identified. The project website exists but no source code has been released as of this writing.

Reproducibility Status: Partially Reproducible. The paper provides detailed architecture specs and hyperparameters, but no public code or model weights are available. All training datasets (Uni-Mol, CrossDocked, GEOM-DRUGS) are publicly accessible.

Paper Information

Citation: Zholus, A., Kuznetsov, M., Schutski, R., Shayakhmetov, R., Polykovskiy, D., Chandar, S., & Zhavoronkov, A. (2025). BindGPT: A Scalable Framework for 3D Molecular Design via Language Modeling and Reinforcement Learning. Proceedings of the AAAI Conference on Artificial Intelligence, 39(24), 26083-26091. https://doi.org/10.1609/aaai.v39i24.34804

@inproceedings{zholus2025bindgpt,
  title={BindGPT: A Scalable Framework for 3D Molecular Design via Language Modeling and Reinforcement Learning},
  author={Zholus, Artem and Kuznetsov, Maksim and Schutski, Roman and Shayakhmetov, Rim and Polykovskiy, Daniil and Chandar, Sarath and Zhavoronkov, Alex},
  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
  volume={39},
  number={24},
  pages={26083--26091},
  year={2025},
  doi={10.1609/aaai.v39i24.34804}
}

Avoiding Failure Modes in Goal-Directed Generation

Thu, 26 Mar 2026 00:00:00 +0000

Reinterpreting Goal-Directed Generation Failures as QSAR Model Issues

This is an Empirical study that challenges a widely cited finding about failure modes in goal-directed molecular generation. Renz et al. (2019) had shown that when molecules are optimized against a machine learning scoring function, control models trained on the same data distribution assign much lower scores to the generated molecules. This was interpreted as evidence that generation algorithms exploit model-specific biases. Langevin et al. demonstrate that this divergence is already present in the original data distribution and is attributable to disagreement among the QSAR classifiers, not to flaws in the generation algorithms themselves.

Why QSAR Model Agreement Matters for Molecular Generation

Goal-directed generation uses a scoring function (typically a QSAR model) to guide the design of molecules that maximize predicted activity. In the experimental framework from Renz et al., three Random Forest classifiers are trained: an optimization model $C_{opt}$ on Split 1, a model control $C_{mc}$ on Split 1 with a different random seed, and a data control $C_{dc}$ on Split 2. Each returns a confidence score ($S_{opt}$, $S_{mc}$, $S_{dc}$). The expectation is that molecules with high $S_{opt}$ should also score highly under $S_{mc}$ and $S_{dc}$, since all three models are trained on the same data distribution for the same target.

Renz et al. observed that during optimization, $S_{mc}$ and $S_{dc}$ diverge from $S_{opt}$, reaching substantially lower values. This was interpreted as goal-directed generation exploiting biases unique to the optimization model. The recommendation was to halt generation when control scores stop increasing, requiring a held-out dataset for a control model, which may not be feasible in low-data regimes.

The key insight of Langevin et al. is that nobody had checked whether this score disagreement existed before generation even began. If the classifiers already disagree on high-scoring molecules in the original dataset, the divergence during generation is expected behavior, not evidence of algorithmic failure.

Pre-Existing Classifier Disagreement Explains the Divergence

The core contribution is showing that the gap between optimization and control scores is a property of the QSAR models, not of the generation algorithms.

The authors introduce a held-out test set (10% of the data, used for neither training split) and augment it via Topliss tree enumeration to produce structural analogs for smoother statistical estimates. On this held-out set, they compute the Mean Average Difference (MAD) between $S_{opt}$ and control scores as a function of $S_{opt}$:

$$ \text{MAD}(x) = \frac{1}{|\{i : S_{opt}(x_i) \geq x\}|} \sum_{S_{opt}(x_i) \geq x} |S_{opt}(x_i) - S_{dc}(x_i)| $$

On the three original datasets (DRD2, EGFR, JAK2), the MAD between $S_{opt}$ and $S_{dc}$ grows substantially with $S_{opt}$, reaching approximately 0.3 for the highest-scoring molecules. For EGFR, even the top molecules (with $S_{opt}$ between 0.5 and 0.6) have $S_{dc}$ below 0.2. This disagreement exists entirely within the original data distribution, before any generative algorithm is applied.

The authors formalize this with tolerance intervals. At each generation time step $t$, the distribution of optimization scores is $P_t[S_{opt}(x)]$. From the held-out set, the conditional distributions $P[S_{dc}(x) | S_{opt}(x)]$ and $P[S_{mc}(x) | S_{opt}(x)]$ are estimated empirically. The expected control scores at time $t$ are then:

$$ \mathbb{E}[S_{dc}] = \int P[S_{dc}(x) | S_{opt}(x)] \cdot P_t[S_{opt}(x)] , dS_{opt} $$

By sampling from these distributions, the authors construct 95% tolerance intervals for the expected control scores at each time step. The observed trajectories of $S_{mc}$ and $S_{dc}$ during generation fall within these intervals, demonstrating that the divergence is fully explained by pre-existing classifier disagreement.

Experimental Setup: Original Reproduction and Corrected Experiments

Reproduction of Renz et al.

The original experimental framework uses three datasets from ChEMBL: DRD2 (842 molecules, 59 actives), EGFR (842 molecules, 40 actives), and JAK2 (667 molecules, 140 actives). These are small, noisy, and chemically diverse. Three goal-directed generation algorithms are tested:

Algorithm	Type	Mechanism
Graph GA	Genetic algorithm on molecular graphs	Mutation and crossover of molecular graphs
SMILES-LSTM	Recurrent neural network	Hill-climbing fine-tuning on best molecules
MSO	Particle swarm in CDDD latent space	Multiple swarm optimization

All algorithms are run for 151 epochs with 10 runs each. The reproduction confirms the findings of Renz et al.: $S_{mc}$ and $S_{dc}$ diverge from $S_{opt}$ during optimization.

Tolerance interval analysis

The held-out set is augmented using Topliss tree enumeration on phenyl rings, providing structural analogs that are reasonable from a medicinal chemistry perspective. The optimization score range is divided into 25 equal bins, and for each molecule at each time step, 10 samples from the conditional control score distribution are drawn to construct empirical tolerance intervals.

Corrected experiments with adequate models

To test whether generation algorithms actually exploit biases when the classifiers agree, the authors construct two tasks where optimization and control models correlate well:

ALDH1 dataset: 464 molecules from LIT-PCBA, split using similarity-based pairing to maximize intra-pair chemical similarity. This ensures both splits sample similar chemistry.
Modified JAK2: The same JAK2 dataset but with Random Forest hyperparameters adjusted (200 trees instead of 100, minimum 3 samples per leaf instead of 1) to reduce overfitting to spurious correlations.

In both cases, $S_{opt}$, $S_{mc}$, and $S_{dc}$ agree well on the held-out test set. The starting population for generation is set to the held-out test set (rather than random ChEMBL molecules) to avoid building in a distribution shift.

Findings: No Algorithmic Failure When Models Agree

On the corrected experimental setups (ALDH1 and modified JAK2), there is no major divergence between optimization and control scores during generation. The three algorithms produce molecules that score similarly under all three classifiers.

Key findings:

Pre-existing disagreement explains divergence: On all three original datasets, the divergence between $S_{opt}$ and control scores during generation falls within the tolerance intervals predicted from the initial data distribution alone. The generation algorithms are not exploiting model-specific biases beyond what already exists in the data.
Split similarity bias is also pre-existing: Renz et al. observed that generated molecules are more similar to Split 1 (used to train $C_{opt}$) than Split 2. The authors show this bias is already present in the top-5 percentile of the held-out set: on EGFR and DRD2, high-scoring molecules are inherently more similar to Split 1.
Appropriate model design resolves the issue: When Random Forest hyperparameters are chosen to avoid overfitting (more trees, higher minimum samples per leaf), or when data splits are constructed to be chemically balanced, the classifiers agree and the generation algorithms behave as expected.
Quality problems remain independent: Even when optimization and control scores align, the generated molecules can still be poor drug candidates (unreactive, unsynthesizable, containing unusual fragments). The score divergence issue and the chemical quality issue are separate problems.

Limitations acknowledged by the authors

The study focuses on Random Forest classifiers with ECFP fingerprints. The behavior of other model types (e.g., graph neural networks) and descriptor types is not fully explored, though supplementary results show similar patterns with physico-chemical descriptors and Atom-Pair fingerprints.
The corrected ALDH1 task uses a relatively small dataset (464 molecules) with careful split construction. Scaling this approach to larger, more heterogeneous datasets is not demonstrated.
The authors note that their results do not prove generation algorithms never exploit biases; they show that the specific evidence from Renz et al. can be explained without invoking algorithmic failure.
The problem of low-quality generated molecules (poor synthesizability, unusual fragments) remains unresolved and is acknowledged as an open question.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Original tasks	DRD2, EGFR, JAK2	842, 842, 667 molecules	Extracted from ChEMBL; small with few actives
New task	ALDH1	464 molecules (173 with purine substructure)	Extracted from LIT-PCBA; similarity-based split
Augmentation	Topliss tree analogs	~10x augmentation of held-out set	Structural analogs via phenyl ring enumeration

Algorithms

Three goal-directed generation algorithms from the original Renz et al. study:

Graph GA: Genetic algorithm on molecular graphs (Jensen, 2019)
SMILES-LSTM: Hill-climbing on LSTM-generated SMILES (Segler et al., 2018)
MSO: Multi-Swarm Optimization in CDDD latent space (Winter et al., 2019)

All run for 151 epochs, 10 runs each.

Models

Random Forest classifiers (scikit-learn) with:

ECFP fingerprints (radius 2, 1024 bits, RDKit)
Default parameters for original tasks
Modified parameters for JAK2 correction: 200 trees, min 3 samples per leaf

Evaluation

Metric	Purpose	Notes
Mean Average Difference (MAD)	Measures disagreement between optimization and control scores	Computed as function of $S_{opt}$ on held-out set
95% tolerance intervals	Expected range of control scores given optimization scores	Empirical, constructed from held-out set
Tanimoto similarity	Split bias assessment	Morgan fingerprints, radius 2, 1024 bits
ROC-AUC	Classifier predictive performance	Used to verify models have comparable accuracy

Hardware

Not specified in the paper.

Artifacts

Artifact	Type	License	Notes
Code and datasets	Code	Apache-2.0	Fork of Renz et al. codebase with modifications

Paper Information

Citation: Langevin, M., Vuilleumier, R., & Bianciotto, M. (2022). Explaining and avoiding failure modes in goal-directed generation of small molecules. Journal of Cheminformatics, 14, 20. https://doi.org/10.1186/s13321-022-00601-y

@article{langevin2022explaining,
  title={Explaining and avoiding failure modes in goal-directed generation of small molecules},
  author={Langevin, Maxime and Vuilleumier, Rodolphe and Bianciotto, Marc},
  journal={Journal of Cheminformatics},
  volume={14},
  number={1},
  pages={20},
  year={2022},
  publisher={Springer},
  doi={10.1186/s13321-022-00601-y}
}

Augmented Hill-Climb for RL-Based Molecule Design

Thu, 26 Mar 2026 00:00:00 +0000

A Hybrid RL Strategy for De Novo Molecule Generation

This is a Method paper that proposes Augmented Hill-Climb (AHC), a reinforcement learning strategy for conditioning SMILES-based language models during de novo molecule generation. The primary contribution is a simple hybrid between the REINVENT and Hill-Climb (HC) RL strategies that computes the REINVENT loss function only on the top-k highest-scoring molecules per batch (as in HC), thereby removing the counterproductive regularization effect of low-scoring molecules. The authors demonstrate that AHC improves optimization ability ~1.5-fold and sample efficiency ~45-fold compared to REINVENT across docking tasks against four GPCR targets, and that the approach generalizes to transformer architectures.

Sample Efficiency Bottleneck in RL-Guided Molecular Generation

Recurrent neural networks trained on SMILES have become a standard approach for de novo molecule generation, with RL strategies like REINVENT and Hill-Climb achieving top performance on benchmarks such as GuacaMol and MOSES. However, RL-guided generation can be highly sample-inefficient, often requiring $10^5$ or more molecules to optimize complex objectives. This is acceptable for cheap scoring functions (e.g., QSAR models, property calculators) but becomes a practical bottleneck when using computationally expensive scoring functions like molecular docking or computer-aided synthesis planning.

The REINVENT strategy regularizes the agent by computing a loss based on the difference between the agent’s policy and an “augmented likelihood” that combines the prior policy with a scaled reward. When low-scoring molecules are sampled ($R_T \approx 0$), the augmented likelihood reduces to the prior likelihood, causing the agent to trend back toward the prior policy. This negates useful learnings, especially early in training or when the objective is difficult. Meanwhile, Hill-Climb simply fine-tunes the RNN on the top-k molecules per batch, which is sample-efficient but lacks explicit regularization, leading to mode collapse and generation of invalid SMILES.

Previous work by Neil et al. compared RL strategies but did not clearly quantify sample-efficiency differences, and modifications to the REINVENT loss function by Fialkova et al. showed no significant improvement. The best agent reminder (BAR) mechanism offered modest gains but was originally tested on graph-based models.

Core Innovation: Filtering Low-Scoring Molecules from the REINVENT Loss

Augmented Hill-Climb combines the loss formulation of REINVENT with the top-k selection mechanism of Hill-Climb. The agent samples a batch of molecules, ranks them by reward, and computes the REINVENT loss only on the top-k molecules. This removes the counterproductive regularization caused by low-scoring molecules while retaining the prior-based regularization for high-scoring molecules.

The REINVENT loss defines an augmented likelihood:

$$ \log P_{\mathbb{U}}(A) = \log P_{prior}(A) + \sigma R_T $$

where $\sigma$ is a scaling coefficient controlling the reward contribution. The agent loss is the squared difference between the augmented likelihood and the agent’s log-likelihood:

$$ L(\theta) = \left[\log P_{\mathbb{U}}(A) - \log P_{agent}(A)\right]^2 $$

In standard REINVENT, this loss is computed over all molecules in the batch. When $R_T \approx 0$, the augmented likelihood collapses to the prior likelihood, pushing the agent back toward the prior. AHC avoids this by computing the loss only on the top-k molecules ranked by reward, exactly as Hill-Climb selects molecules for fine-tuning.

The key insight is that high-scoring molecules are still regularized by the prior component of the augmented likelihood ($\log P_{prior}(A)$), preventing catastrophic forgetting. Low-scoring molecules, which would otherwise pull the agent back toward the prior, are simply excluded from the loss computation.

Diversity Filters to Prevent Mode Collapse

AHC is more susceptible to mode collapse than REINVENT because it focuses learning on high-scoring molecules. The authors address this with diversity filters (DFs) that penalize the reward of molecules similar to previously generated ones. Through a hyperparameter search over 825 configurations on three GuacaMol tasks, they identify an optimal DF configuration (DF2) with:

Minimum score threshold of 0.5 (lower than DF1’s 0.8)
Linear penalization output mode (softer than binary)
Bin size of 50 (larger than DF1’s 25)
Scaffold similarity based on ECFP4 fingerprints

The authors find that stricter DFs (lower thresholds, smaller bins) better prevent mode collapse but reduce optimization performance, while more lenient DFs enable better learning of chemotype-reward associations. DF2 represents a compromise.

Experimental Setup: Docking Tasks and Benchmark Comparisons

The evaluation spans five experiments:

Experiment 1: AHC vs. REINVENT on DRD2 docking over 100 RL updates (6,400 samples), varying $\sigma$ from 30 to 240. RNN trained on the MOSESn dataset (MOSES with neutralized charges, 2.45M molecules).

Experiment 2: AHC + DF2 vs. REINVENT on four GPCR targets (DRD2, OPRM1, AGTR1, OX1R) over 500 RL updates. Docking performed with Glide-SP after ligand preparation with LigPrep.

Experiment 3: Diversity filter hyperparameter search (825 configurations) on three GuacaMol tasks (Aripiprazole similarity, C11H24 isomers, Osimertinib MPO) using the GuacaMol training set (1.27M molecules from ChEMBL24).

Experiment 4: Benchmark of AHC against REINFORCE, REINVENT (v1 and v2), BAR, and Hill-Climb (with and without KL regularization) on six tasks of varying difficulty:

Task	Difficulty	Objective
Heavy atoms	Easy	Maximize number of heavy atoms
Risperidone similarity	Easy	Maximize Tanimoto similarity to Risperidone
DRD2 activity	Medium	Maximize QSAR-predicted DRD2 activity
DRD2 docking	Medium	Minimize Glide-SP docking score
DRD2-DRD3 dual	Hard	Maximize predicted activity against both targets
DRD2/DRD3 selective	Hard	Maximize selective DRD2 activity over DRD3

Experiment 5: AHC vs. REINVENT on transformer (Tr) and gated transformer (GTr) architectures on the same six benchmark tasks. The GTr implements a GRU-style gate in place of residual connections to stabilize RL training.

RNN and Transformer Architectures

Three RNN configurations were used: (1) embedding 128 + 3 GRU layers of 512 (REINVENT v1), (2) embedding 256 + 3 LSTM layers of 512 (REINVENT 2.0), (3) 3 LSTM layers of 512 with dropout 0.2 (GuacaMol). Transformers used 4 encoder layers with hidden dimension 512, 8 attention heads, and feed-forward dimension 1024.

QSAR models for DRD2 and DRD3 activity were random forest classifiers trained on ExCAPE-DB data with GHOST threshold identification for handling class imbalance.

Key Findings: 45-Fold Sample Efficiency Improvement

Experiment 1: AHC Consistently Outperforms REINVENT

AHC improved optimization ability by 1.39-fold over REINVENT averaged across all $\sigma$ values, with maximum optimization of 205% at $\sigma = 240$ (compared to 128% for REINVENT). AHC required ~80 fewer RL steps to match REINVENT’s mean docking score at 100 steps. With DF1 applied, the improvement was 1.45-fold.

AHC showed greater sensitivity to $\sigma$, giving practitioners more control over the regularization-optimization trade-off. At $\sigma = 60$ (heavily regularized), AHC still improved 1.47-fold over REINVENT while maintaining property space defined by the MOSESn training set. At higher $\sigma$ values, AHC extrapolated further outside the training distribution, which can be favorable (novel chemical space) or unfavorable (scoring function exploitation, e.g., larger molecules getting better docking scores due to the additive nature of scoring functions).

Experiment 2: Improvement Across Four GPCR Targets

Across DRD2, OPRM1, AGTR1, and OX1R, AHC + DF2 required on average 7.4-fold fewer training steps and 45.5-fold fewer samples to reach optimization thresholds. The improvement was largest early in training: 19.8-fold fewer steps to reach 120% optimization, and 71.8-fold fewer samples to first produce a molecule exceeding 160% optimization.

AHC + DF2 surpassed the 80% retrospective precision threshold within 100 RL updates for all targets except the challenging OX1R. DF2 successfully stabilized learning, avoiding the convergence-to-threshold failure mode observed with DF1.

Scaffold analysis showed AHC generates similar chemistry to REINVENT. The top 500 scaffolds produced by REINVENT were also generated by AHC, but typically much sooner.

Experiment 4: Benchmark Against All RL Strategies

AHC outperformed all other RL strategies on all six benchmark tasks except maximizing heavy atoms (an extrapolation task of limited practical relevance). AHC was particularly superior during early-stage optimization and for harder objectives (dual activity, selective activity).

Hill-Climb with a smaller batch size (HC*) showed improved early-stage sample efficiency similar to AHC, but rapidly underwent mode collapse. KL regularization did not rescue mode collapse in any case and sometimes worsened performance. BAR performed poorly in most tasks, possibly because the best-agent memory acts as a second regularizer that inhibits learning.

In terms of wall time for the DRD2 docking task, AHC reached 140% optimization in 16 CPU hours vs. 202 CPU hours for REINVENT 2.0. AHC was the only strategy to reach 200% optimization within the allotted time (216 CPU hours). Parallelized over 10 CPUs, this corresponds to ~21.6 hours, making docking-guided generation feasible on local machines.

Experiment 5: Generalization to Transformers

AHC outperformed REINVENT on both the standard transformer and the gated transformer architectures. The standard transformer was unstable under RL, readily undergoing mode collapse. The gated transformer (with GRU-style gating replacing residual connections) stabilized RL training. AHC’s efficiency gains generalized to both architectures.

Limitations

The authors acknowledge several limitations:

Chemistry quality evaluation is complicated by the interaction between RL strategy and scoring function suitability. Greater optimization may lead to unreasonable chemistry due to scoring function exploitation rather than the RL strategy itself.
The diversity filter hyperparameter search was conducted on GuacaMol toy tasks, which may not fully transfer to docking-based objectives.
The docking scoring function was system-dependent: DRD2 and OPRM1 were optimized effectively, while AGTR1 and OX1R proved more challenging (especially AGTR1, where the docking algorithm targeted the wrong sub-pocket).
KL regularization proved ineffective for HC and REINFORCE, suggesting it is not a sufficient regularization method in this context.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
RNN pretraining	MOSESn (MOSES neutralized)	2,454,087 molecules	ZINC15 clean leads with neutralized charges
RNN pretraining	GuacaMol train	1,273,104 molecules	ChEMBL24 with property filters
QSAR training	ExCAPE-DB (DRD2)	4,609 actives / 343,026 inactives	Random forest with GHOST thresholds
QSAR training	ExCAPE-DB (DRD3)	2,758 actives / 402,524 inactives	Unique subsets for dual/selective tasks
DF parameter search	GuacaMol benchmark tasks	3 tasks	825 configurations tested

Algorithms

AHC: REINVENT loss computed on top-k molecules per batch, ranked by reward
Baselines: REINFORCE, REINVENT (v1, v2), BAR, Hill-Climb, Hill-Climb + KL regularization
Hyperparameters: Default values from each original publication (listed in Supplementary Table S3)
Docking: Glide-SP with Schrodinger Protein Preparation Wizard, LigPrep for ligand preparation

Models

RNNs: 3 configurations (GRU/LSTM, 512 hidden units, trained 5-10 epochs)
Transformer: 4 encoder layers, 512 hidden dim, 8 heads, 1024 FFN dim
Gated Transformer: Same architecture with GRU-style gating replacing residual connections
QSAR: Random forest classifiers (100 estimators, max depth 15, min leaf 2)

Evaluation

Metric	AHC + DF2	REINVENT	Notes
Optimization fold-improvement	1.45x	baseline	DRD2 docking, averaged across sigma values
Sample efficiency	45.5x fewer samples	baseline	Averaged across 4 GPCR targets
Step efficiency	7.4x fewer steps	baseline	Averaged across 4 GPCR targets
CPU hours to 140% (DRD2 docking)	16h	202h (REINVENT 2.0)	AMD Threadripper 1920 + RTX 2060 Super

Hardware

AMD Threadripper 1920 CPU
Nvidia GeForce RTX 2060 Super GPU
DRD2 docking benchmark: 216 CPU hours for AHC to reach 200% optimization (~21.6h parallelized over 10 CPUs)

Artifacts

Artifact	Type	License	Notes
SMILES-RNN	Code	MIT	RNN and transformer generative model code
MolScore	Code	MIT	Scoring function platform
Figshare datasets	Dataset	CC-BY-4.0	Supporting data (published under same license as paper)

Paper Information

Citation: Thomas, M., O’Boyle, N. M., Bender, A., & de Graaf, C. (2022). Augmented Hill-Climb increases reinforcement learning efficiency for language-based de novo molecule generation. Journal of Cheminformatics, 14, 68.

@article{thomas2022augmented,
  title={Augmented Hill-Climb increases reinforcement learning efficiency for language-based de novo molecule generation},
  author={Thomas, Morgan and O'Boyle, Noel M. and Bender, Andreas and de Graaf, Chris},
  journal={Journal of Cheminformatics},
  volume={14},
  number={1},
  pages={68},
  year={2022},
  doi={10.1186/s13321-022-00646-z}
}

AlphaDrug: MCTS-Guided Target-Specific Drug Design

Thu, 26 Mar 2026 00:00:00 +0000

Target-Conditioned Molecular Generation via Transformer and MCTS

AlphaDrug is a Method paper that proposes a target-specific de novo molecular generation framework. The primary contribution is the combination of two components: (1) an Lmser Transformer (LT) that embeds protein-ligand context through hierarchical skip connections from encoder to decoder, and (2) a Monte Carlo tree search (MCTS) procedure guided by both the LT’s predicted probabilities and docking scores from the SMINA program. The method generates SMILES strings autoregressively, with each symbol selection informed by look-ahead search over potential binding affinities.

Bridging the Gap Between Molecular Generation and Protein Targeting

Most deep learning methods for de novo molecular generation optimize physicochemical properties (LogP, QED, SA) without conditioning on a specific protein target. Virtual screening approaches rely on existing compound databases and are computationally expensive. The few methods that do consider protein targets, such as LiGANN and the transformer-based approach of Grechishnikova (2021), show limited docking performance. The core challenge is twofold: the search space of drug-like molecules is estimated at $10^{60}$ compounds, and learning protein-ligand interaction patterns from sequence data is difficult because proteins and ligands have very different structures and sequence lengths.

AlphaDrug addresses these gaps by proposing a method that jointly learns protein-ligand representations and uses docking-guided search to navigate the vast chemical space.

Lmser Transformer and Docking-Guided MCTS

The key innovations are the Lmser Transformer architecture and the MCTS search strategy.

Lmser Transformer (LT)

The standard transformer for sequence-to-sequence tasks passes information from the encoder’s top layer to the decoder through cross-attention. AlphaDrug identifies an information transfer bottleneck: deep protein features from the encoder’s final layer must serve all decoder layers. Inspired by the Lmser (least mean squared error reconstruction) network, the authors add hierarchical skip connections from each encoder layer to the corresponding decoder layer.

Each decoder layer receives protein features at the matching level of abstraction through a cross-attention mechanism:

$$f_{ca}(Q_m, K_S, V_S) = \text{softmax}\left(\frac{Q_m K_S^T}{\sqrt{d_k}}\right) V_S$$

where $Q_m$ comes from the ligand molecule decoder and $(K_S, V_S)$ are passed through skip connections from the protein encoder. This allows different decoder layers to access different levels of protein features, rather than all layers sharing the same top-level encoding.

The multi-head attention follows the standard formulation:

$$\text{MultiHead}(Q, K, V) = \text{Concat}(H_1, \dots, H_h) W^O$$

$$H_i = f_{ca}(Q W_i^Q, K W_i^K, V W_i^V)$$

MCTS for Molecular Generation

The molecular generation process models SMILES construction as a sequential decision problem. At each step $\tau$, the context $C_\tau = {S, a_1 a_2 \cdots a_\tau}$ consists of the protein sequence $S$ and the intermediate SMILES string. MCTS runs a fixed number of simulations per step, each consisting of four phases:

Select: Starting from the current root node, child nodes are selected using a variant of the PUCT algorithm:

$$\tilde{a}_{\tau+t} = \underset{a \in A}{\arg\max}\left(Q(\tilde{C}_{\tau+t-1}, a) + U(\tilde{C}_{\tau+t-1}, a)\right)$$

where $Q(\tilde{C}, a) = W_a / N_a$ is the average reward and $U(\tilde{C}, a) = c_{puct} \cdot P(a | \tilde{C}) \cdot \sqrt{N_t} / (1 + N_t(a))$ is an exploration bonus based on the LT’s predicted probability.

The Q-values are normalized to $[0, 1]$ using the range of docking scores in the tree:

$$Q(\tilde{C}, a) \leftarrow \frac{Q(\tilde{C}, a) - \min_{m \in \mathcal{M}} f_d(S, m)}{\max_{m \in \mathcal{M}} f_d(S, m) - \min_{m \in \mathcal{M}} f_d(S, m)}$$

Expand: At a leaf node, the LT computes next-symbol probabilities and adds child nodes to the tree.

Rollout: A complete molecule is generated greedily using LT probabilities. Valid molecules are scored with SMINA docking; invalid molecules receive the minimum observed docking score.

Backup: Docking values propagate back up the tree, updating visit counts and cumulative rewards.

Training Objective

The LT is trained on known protein-ligand pairs using cross-entropy loss:

$$J(\Theta) = -\sum_{(S,m) \in \mathcal{D}} \sum_{\tau=1}^{L_m} \sum_{a \in \mathcal{A}} y_a \ln P(a \mid C_\tau(S, m))$$

MCTS is only activated during inference, not during training.

Experiments on Diverse Protein Targets

Dataset

The authors use BindingDB, filtered to 239,455 protein-ligand pairs across 981 unique proteins. Filtering criteria include: human proteins only, IC50 < 100 nM, molecular weight < 1000 Da, and single-chain targets. Proteins are clustered at 30% sequence identity using MMseqs2, with 25 clusters held out for testing (100 proteins), and the remainder split 90/10 for training (192,712 pairs) and validation (17,049 pairs).

Baselines

T+BS10: Standard transformer with beam search (K=10) from Grechishnikova (2021)
LT+BS10: The proposed Lmser Transformer with beam search
LiGANN: 3D pocket-to-ligand shape generation via BicycleGAN
SBMolGen: ChemTS-based method with docking constraints
SBDD-3D: 3D autoregressive graph-based generation
Decoys: Random compounds from ZINC database
Known ligands: Original binding partners from the database

Main Results

Method	Docking	Uniqueness	LogP	QED	SA	NP
Decoys	7.3	-	2.4	0.8	2.4	-1.2
Known ligands	9.8	-	2.2	0.5	3.3	-1.0
LiGANN	6.7	94.7%	2.9	0.6	3.0	-1.1
SBMolGen	7.7	100%	2.6	0.7	2.8	-1.2
SBDD-3D	7.7	99.3%	1.5	0.6	4.0	0.3
T+BS10	8.5	90.6%	3.8	0.5	2.8	-0.8
LT+BS10	8.5	98.1%	4.0	0.5	2.7	-1.0
AlphaDrug (freq)	10.8	99.5%	4.9	0.4	2.9	-1.0
AlphaDrug (max)	11.6	100%	5.2	0.4	2.7	-0.8

AlphaDrug (max) achieves the highest average docking score (11.6), surpassing known ligands (9.8). Statistical significance is confirmed with two-tailed t-test P-values below 0.01 for all comparisons.

MCTS vs. Beam Search Under Equal Compute

When constrained to the same number of docking evaluations, MCTS consistently outperforms beam search:

Docking times (N)	BS	MCTS	P-value
N = 105 (S=10)	8.4 (10.9)	10.9 (11.5)	1.8e-34 (4.5e-3)
N = 394 (S=50)	8.3 (11.4)	11.6 (12.2)	1.4e-31 (1.8e-3)
N = 1345 (S=500)	8.4 (11.9)	12.4 (13.2)	2.2e-39 (8.2e-6)

Values in parentheses are average top-1 scores per protein.

Ablation: Effect of Protein Sequence Input

Replacing the full transformer (T) or LT with a transformer encoder only (TE, no protein input) demonstrates that protein conditioning improves both uniqueness and docking score per symbol (SpS):

Method	Uniqueness	SpS	Molecular length
TE + MCTS (S=50)	81.0%	0.1926	62.93
T + MCTS (S=50)	98.0%	0.2149	55.63
LT + MCTS (S=50)	100.0%	0.2159	56.54

The SpS metric (docking score normalized by molecule length) isolates the quality improvement from the tendency of longer molecules to score higher.

Computational Efficiency

A docking lookup table caches previously computed protein-molecule docking scores, reducing actual docking calls by 81-86% compared to the theoretical maximum ($L \times S$ calls per molecule). With $S = 10$, AlphaDrug generates molecules in about 52 minutes per protein; with $S = 50$, about 197 minutes per protein.

Docking Gains with Acknowledged Limitations

Key Findings

86% of AlphaDrug-generated molecules have higher docking scores than known ligands for their respective targets.
The LT architecture with hierarchical skip connections improves uniqueness (from 90.6% to 98.1% with beam search) and provides slight SpS gains over the vanilla transformer.
MCTS is the dominant factor in performance improvement: even with only 10 simulations, it boosts docking scores by 31.3% over greedy LT decoding.
Case studies on three proteins (3gcs, 3eig, 4o28) show that generated molecules share meaningful substructures with known ligands, suggesting chemical plausibility.

Limitations

The authors identify three areas for improvement:

Sequence-only representation: AlphaDrug uses amino acid sequences rather than 3D protein structures. While it outperforms existing 3D methods (SBDD-3D), incorporating 3D pocket geometry could further improve performance.
External docking as value function: SMINA docking calls are computationally expensive and become a bottleneck during MCTS. A learnable end-to-end value network would reduce this cost and allow joint policy-value training.
Full rollout requirement: Every MCTS simulation requires generating a complete molecule for docking evaluation. Estimating binding affinity from partial molecules remains an open challenge.

The physicochemical properties (QED, SA) of AlphaDrug’s outputs are comparable to baselines but not explicitly optimized. LogP values trend toward the upper end of the Ghose filter range (4.9-5.2 vs. the 5.6 limit), which may indicate lipophilicity bias.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Training	BindingDB (filtered)	192,712 protein-ligand pairs	Human proteins, IC50 < 100 nM, MW < 1000 Da
Validation	BindingDB (filtered)	17,049 pairs	Same filtering criteria
Testing	BindingDB (filtered)	100 proteins from 25 clusters	Clustered at 30% sequence identity via MMseqs2

Algorithms

MCTS with PUCT selection criterion, $c_{puct} = 1.5$
$S = 50$ simulations per step (default), $S = 10$ for fast variant
Greedy rollout policy using LT probabilities
Docking lookup table for efficiency (caches SMINA results)
Two generation modes: max (deterministic, highest visit count) and freq (stochastic, proportional to visit counts)

Models

Lmser Transformer with hierarchical encoder-to-decoder skip connections
Sinusoidal positional encoding
Multi-head cross-attention at each decoder layer
Detailed hyperparameters (embedding dimensions, number of layers/heads) are in the supplementary material (Table S1)

Evaluation

Metric	AlphaDrug (max)	Known ligands	Best baseline (T+BS10)
Docking score	11.6	9.8	8.5
Uniqueness	100%	-	90.6%
Validity	100%	-	Not reported

Hardware

Hardware specifications are not explicitly reported in the paper. Generation time is reported as approximately 52 minutes per protein ($S = 10$) and 197 minutes per protein ($S = 50$), with docking (via SMINA) being the dominant cost.

Artifacts

Artifact	Type	License	Notes
CMACH508/AlphaDrug	Code	MIT	Official implementation, includes data processing and generation scripts

Paper Information

Citation: Qian, H., Lin, C., Zhao, D., Tu, S., & Xu, L. (2022). AlphaDrug: protein target specific de novo molecular generation. PNAS Nexus, 1(4), pgac227. https://doi.org/10.1093/pnasnexus/pgac227

@article{qian2022alphadrug,
  title={AlphaDrug: protein target specific de novo molecular generation},
  author={Qian, Hao and Lin, Cheng and Zhao, Dengwei and Tu, Shikui and Xu, Lei},
  journal={PNAS Nexus},
  volume={1},
  number={4},
  pages={pgac227},
  year={2022},
  doi={10.1093/pnasnexus/pgac227}
}

TamGen: GPT-Based Target-Aware Drug Design and Generation

Wed, 25 Mar 2026 00:00:00 +0000

A Method for Target-Conditioned Molecular Generation

This is a Method paper that introduces TamGen (Target-aware molecular generation), a three-module architecture for generating drug-like compounds conditioned on protein binding pocket structures. The primary contribution is a GPT-like chemical language model pre-trained on 10 million SMILES from PubChem, combined with a Transformer-based protein encoder and a VAE-based contextual encoder for compound refinement. The authors validate TamGen on the CrossDocked2020 benchmark and apply it through a Design-Refine-Test pipeline to discover 14 novel inhibitors of the Mycobacterium tuberculosis ClpP protease, with $\text{IC}_{50}$ values ranging from 1.88 to 35.2 $\mu$M.

Bridging Generative AI and Practical Drug Discovery

Target-based generative drug design aims to create novel compounds with desired pharmacological properties from scratch, exploring the estimated $10^{60}$ feasible compounds in chemical space rather than screening existing libraries of $10^{4}$ to $10^{8}$ molecules. Prior approaches using diffusion models, GANs, VAEs, and autoregressive models have demonstrated the feasibility of generating compounds conditioned on target proteins. However, most generated compounds lack satisfactory physicochemical properties for drug-likeness, and validations with biophysical or biochemical assays are largely missing.

The key limitations of existing 3D generation methods (TargetDiff, Pocket2Mol, ResGen, 3D-AR) include:

Generated compounds frequently contain multiple fused rings, leading to poor synthetic accessibility
High cellular toxicity and decreased developability associated with excessive fused ring counts
Slow generation speeds (tens of minutes to hours per 100 compounds)
Limited real-world experimental validation of generated candidates

TamGen addresses these issues by operating in 1D SMILES space rather than 3D coordinate space, leveraging pre-training on natural compound distributions to produce more drug-like molecules.

TamGen consists of three components: a compound decoder, a protein encoder, and a contextual encoder.

Compound Decoder (Chemical Language Model)

The compound decoder is a GPT-style autoregressive model pre-trained on 10 million SMILES randomly sampled from PubChem. The pre-training objective follows standard next-token prediction:

$$ \min -\sum_{y \in \mathcal{D}_0} \frac{1}{M_y} \sum_{i=1}^{M_y} \log P(y_i \mid y_{i-1}, y_{i-2}, \ldots, y_1) $$

where $M_y$ is the SMILES sequence length. This enables both unconditional and conditional generation. The decoder uses 12 Transformer layers with hidden dimension 768.

Protein Encoder with Distance-Aware Attention

The protein encoder processes binding pocket residues using both sequential and geometric information. Given amino acids $\mathbf{a} = (a_1, \ldots, a_N)$ with 3D coordinates $\mathbf{r} = (r_1, \ldots, r_N)$, the input representation combines amino acid embeddings with coordinate embeddings:

$$ h_i^{(0)} = E_a a_i + E_r \rho\left(r_i - \frac{1}{N}\sum_{j=1}^{N} r_j\right) $$

where $\rho$ denotes a random roto-translation operation applied as data augmentation, and coordinates are centered to the origin.

The encoder uses a distance-aware self-attention mechanism that weights attention scores by spatial proximity:

$$ \begin{aligned} \hat{\alpha}_j &= \exp\left(-\frac{|r_i - r_j|^2}{\tau}\right)(h_i^{(l)\top} W h_j^{(l)}) \\ \alpha_j &= \frac{\exp \hat{\alpha}_j}{\sum_{k=1}^{N} \exp \hat{\alpha}_k} \\ \hat{\boldsymbol{h}}_i^{(l+1)} &= \sum_{j=1}^{N} \alpha_j (W_v h_j^{(l)}) \end{aligned} $$

where $\tau$ is a temperature hyperparameter and $W$, $W_v$ are learnable parameters. The encoder uses 4 layers with hidden dimension 256. Outputs are passed to the compound decoder via cross-attention.

VAE-Based Contextual Encoder

A VAE-based contextual encoder determines the mean $\mu$ and standard deviation $\sigma$ for any (compound, protein) pair. During training, the model recovers the input compound. During application, a seed compound enables compound refinement. The full training objective combines reconstruction loss with KL regularization:

$$ \min_{\Theta, q} \frac{1}{|\mathcal{D}|} \sum_{(\mathbf{x}, \mathbf{y}) \in \mathcal{D}} -\log P(\mathbf{y} \mid \mathbf{x}, z; \Theta) + \beta \mathcal{D}_{\text{KL}}(q(z \mid \mathbf{x}, \mathbf{y}) | p(z)) $$

where $\beta$ is a hyperparameter controlling the KL divergence weight, and $p(z)$ is a standard Gaussian prior.

Benchmark Evaluation and Tuberculosis Drug Discovery

CrossDocked2020 Benchmark

TamGen was evaluated against five baselines (liGAN, 3D-AR, Pocket2Mol, ResGen, TargetDiff) on the CrossDocked2020 dataset (~100k drug-target pairs for training, 100 test binding pockets). For each target, 100 compounds were generated per method. Evaluation metrics included:

Docking score (AutoDock-Vina): binding affinity estimate
QED: quantitative estimate of drug-likeness
Lipinski’s Rule of Five: physicochemical property compliance
SAS: synthetic accessibility score
LogP: lipophilicity (optimal range 0-5 for oral administration)
Molecular diversity: Tanimoto similarity between Morgan fingerprints

TamGen ranked first or second on 5 of 6 metrics and achieved the best overall score using mean reciprocal rank (MRR) across all metrics. On synthetic accessibility for high-affinity compounds, TamGen performed best. The generated compounds averaged 1.78 fused rings, closely matching FDA-approved drugs, while competing 3D methods produced compounds with significantly more fused rings.

TamGen was also 85x to 394x faster than competing methods: generating 100 compounds per target in an average of 9 seconds on a single A6000 GPU, compared to tens of minutes or hours for the baselines.

Design-Refine-Test Pipeline for ClpP Inhibitors

The practical application targeted ClpP protease of Mycobacterium tuberculosis, an emerging antibiotic target with no documented advanced inhibitors beyond Bortezomib.

Design stage: Using the ClpP binding pocket from PDB structure 5DZK, TamGen generated 2,612 unique compounds. Compounds were filtered by molecular docking (retaining those with better scores than Bortezomib) and Ligandformer phenotypic activity prediction. Peptidomimetic compounds were excluded for poor ADME properties. Four seed compounds were selected.

Refine stage: Using the 4 seed compounds plus 3 weakly active compounds ($\text{IC}_{50}$ 100-200 $\mu$M) from prior experiments, TamGen generated 8,635 unique compounds conditioned on both the target and seeds. After filtering, 296 compounds were selected for testing.

Test stage: From a 446k commercial compound library, 159 analogs (MCS similarity > 0.55) were identified. Five analogs showed significant inhibitory effects. Dose-response experiments revealed $\text{IC}_{50}$ values below 20 $\mu$M for all five, with Analog-005 achieving $\text{IC}_{50}$ of 1.9 $\mu$M. Three additional novel compounds were synthesized for SAR analysis:

Compound	Series	Source	$\text{IC}_{50}$ ($\mu$M)	Key Feature
Analog-005	II	Commercial library	1.9	Most potent analog
Analog-003	I	Commercial library	< 20	Strongest single-dose inhibition
Syn-A003-01	I	TamGen (synthesized)	< 20	Diphenylurea scaffold

Both compound series (diphenylurea and benzenesulfonamide scaffolds) represent novel ClpP inhibitor chemotypes distinct from Bortezomib. Additionally, 6 out of 8 directly synthesized TamGen compounds demonstrated $\text{IC}_{50}$ below 40 $\mu$M, confirming TamGen’s ability to produce viable hits without the library search step.

Ablation Studies

Four ablation experiments clarified the contributions of TamGen’s components:

Without pre-training: Significantly worse docking scores and simpler structures. The optimal decoder depth dropped from 12 to 4 layers without pre-training due to overfitting.
Shuffled pocket-ligand pairs (TamGen-r): Substantially worse docking scores, confirming TamGen learns meaningful pocket-ligand interactions rather than generic compound distributions.
Without distance-aware attention: Significant decline in docking scores when removing the geometric attention term from Eq. 2.
Without coordinate augmentation: Performance degradation when removing the roto-translation augmentation $\rho$, highlighting the importance of geometric invariance.

Validated Drug-Like Generation with Practical Limitations

TamGen demonstrates that 1D SMILES-based generation with pre-training on natural compounds produces molecules with better drug-likeness properties than 3D generation methods. The experimental validation against ClpP is a notable strength, as most generative drug design methods lack biochemical assay confirmation.

Key limitations acknowledged by the authors include:

Insufficient sensitivity to minor target differences: TamGen cannot reliably distinguish targets with point mutations or protein isoforms, limiting applicability for cancer-related proteins
Requires known structure and pocket: As a structure-based method, TamGen needs the 3D structure of the target protein and binding pocket information
Limited cellular validation: The study focuses on hit identification; cellular activities and toxicities of proposed compounds were not extensively tested
1D generation trade-off: SMILES-based generation does not fully exploit 3D protein-ligand geometric interactions available in coordinate space

Future directions include integrating insights from 3D autoregressive methods, using Monte Carlo Tree Search or reinforcement learning to guide generation for better docking scores and ADME/T properties, and property-guided generation as explored in PrefixMol.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pre-training	PubChem (random sample)	10M SMILES	Compound decoder pre-training
Fine-tuning	CrossDocked2020	~100k pairs	Filtered pocket-ligand pairs
Extended fine-tuning	CrossDocked + PDB	~300k pairs	Used for TB compound generation
Evaluation	CrossDocked2020 test	100 pockets	Same split as TargetDiff/Pocket2Mol

Algorithms

Compound decoder: 12-layer GPT with hidden dimension 768, pre-trained for 200k steps
Protein encoder: 4-layer Transformer with hidden dimension 256, distance-aware attention
VAE encoder: 4-layer standard Transformer encoder with hidden dimension 256
Optimizer: Adam with initial learning rate $3 \times 10^{-5}$
VAE $\beta$: 0.1 or 1.0 depending on generation stage
Beam search: beam sizes of 4, 10, or 20 depending on stage
Pocket definition: residues within 10 or 15 Angstrom distance cutoff from ligand center

Models

Pre-trained model weights are available via Zenodo at https://doi.org/10.5281/zenodo.13751391.

Evaluation

Metric	TamGen	Best Baseline	Notes
Overall MRR	Best	TargetDiff (2nd)	Ranked across 6 metrics
Fused rings (avg)	1.78	~3-5 (others)	Matches FDA-approved drug average
Generation speed	9 sec/100 compounds	~13 min (ResGen)	Single A6000 GPU
ClpP hit rate	6/8 synthesized	N/A	$\text{IC}_{50}$ < 40 $\mu$M

Hardware

Pre-training: 8x V100 GPUs for 200k steps
Inference benchmarking: 1x A6000 GPU
Generation time: ~9 seconds per 100 compounds per target

Artifact	Type	License	Notes
TamGen code	Code	MIT	Official implementation
Model weights and data	Model + Data	CC-BY-4.0	Pre-trained weights, source data

Paper Information

Citation: Wu, K., Xia, Y., Deng, P., Liu, R., Zhang, Y., Guo, H., Cui, Y., Pei, Q., Wu, L., Xie, S., Chen, S., Lu, X., Hu, S., Wu, J., Chan, C.-K., Chen, S., Zhou, L., Yu, N., Chen, E., Liu, H., Guo, J., Qin, T., & Liu, T.-Y. (2024). TamGen: drug design with target-aware molecule generation through a chemical language model. Nature Communications, 15, 9360. https://doi.org/10.1038/s41467-024-53632-4

@article{wu2024tamgen,
  title={TamGen: drug design with target-aware molecule generation through a chemical language model},
  author={Wu, Kehan and Xia, Yingce and Deng, Pan and Liu, Renhe and Zhang, Yuan and Guo, Han and Cui, Yumeng and Pei, Qizhi and Wu, Lijun and Xie, Shufang and Chen, Si and Lu, Xi and Hu, Song and Wu, Jinzhi and Chan, Chi-Kin and Chen, Shawn and Zhou, Liangliang and Yu, Nenghai and Chen, Enhong and Liu, Haiguang and Guo, Jinjiang and Qin, Tao and Liu, Tie-Yan},
  journal={Nature Communications},
  volume={15},
  number={1},
  pages={9360},
  year={2024},
  publisher={Nature Publishing Group},
  doi={10.1038/s41467-024-53632-4}
}

Failure Modes in Molecule Generation & Optimization

Wed, 25 Mar 2026 00:00:00 +0000

An Empirical Critique of Molecular Generation Evaluation

This is an Empirical paper that critically examines evaluation practices for molecular generative models. Rather than proposing a new generative method, the paper exposes systematic weaknesses in both distribution-learning metrics and goal-directed optimization scoring functions. The primary contributions are: (1) demonstrating that a trivially simple “AddCarbon” model can achieve near-perfect scores on widely used distribution-learning benchmarks, and (2) introducing an experimental framework with optimization scores and control scores that reveals model-specific and data-specific biases when ML models serve as scoring functions for goal-directed generation.

Evaluation Gaps in De Novo Molecular Design

The rapid growth of deep learning methods for molecular generation (RNN-based SMILES generators, VAEs, GANs, graph neural networks) created a need for standardized evaluation. Benchmarking suites like GuacaMol and MOSES introduced metrics for validity, uniqueness, novelty, KL divergence over molecular properties, and Frechet ChemNet Distance (FCD). For goal-directed generation, penalized logP became a common optimization target.

However, these metrics leave significant blind spots. Distribution-learning metrics do not detect whether a model merely copies training molecules with minimal modifications. Goal-directed benchmarks often use scoring functions that fail to capture the full requirements of drug discovery (synthetic feasibility, drug-likeness, absence of reactive substructures). When ML models serve as scoring functions, the problem worsens because generated molecules can exploit artifacts of the learned model rather than exhibiting genuinely desirable properties.

At the time of writing, wet-lab validations of generative models remained scarce, with only a handful of studies (Merk et al., Zhavoronkov et al.) demonstrating in vitro activity for generated compounds. The lack of rigorous evaluation left the field unable to distinguish meaningfully innovative methods from those that simply exploit metric weaknesses.

The Copy Problem and Control Score Framework

The paper introduces two key conceptual contributions.

The AddCarbon Model for Distribution-Learning

The AddCarbon model is deliberately trivial: it samples a molecule from the training set, inserts a single carbon atom at a random position in its SMILES string, and returns the result if it produces a valid, novel molecule. This model achieves near-perfect scores across most GuacaMol distribution-learning benchmarks:

Benchmark	RS	LSTM	GraphMCTS	AAE	ORGAN	VAE	AddCarbon
Validity	1.000	0.959	1.000	0.822	0.379	0.870	1.000
Uniqueness	0.997	1.000	1.000	1.000	0.841	0.999	0.999
Novelty	0.000	0.912	0.994	0.998	0.687	0.974	1.000
KL divergence	0.998	0.991	0.522	0.886	0.267	0.982	0.982
FCD	0.929	0.913	0.015	0.529	0.000	0.863	0.871

The AddCarbon model beats all baselines except the LSTM on the FCD metric, despite being practically useless. This exposes what the authors call the “copy problem”: current metrics check only for exact matches to training molecules, so minimal edits evade novelty detection. The authors argue that likelihood-based evaluation on hold-out test sets, analogous to standard practice in NLP, would provide a more comprehensive metric.

Control Scores for Goal-Directed Generation

For goal-directed generation, the authors introduce a three-score experimental design:

Optimization Score (OS): Output of a classifier trained on data split 1, used to guide the molecular optimizer.
Model Control Score (MCS): Output of a second classifier trained on split 1 with a different random seed. Divergence between OS and MCS quantifies model-specific biases.
Data Control Score (DCS): Output of a classifier trained on data split 2. Divergence between OS and DCS quantifies data-specific biases.

This mirrors the training/test split paradigm in supervised learning. If a generator truly produces molecules with the desired bioactivity, the control scores should track the optimization score. Divergence between them indicates the optimizer is exploiting artifacts of the specific model or training data rather than learning generalizable chemical properties.

Experimental Setup: Three Targets, Three Generators

Targets and Data

The authors selected three biological targets from ChEMBL: Janus kinase 2 (JAK2), epidermal growth factor receptor (EGFR), and dopamine receptor D2 (DRD2). For each target, the data was split into two halves (split 1 and split 2) with balanced active/inactive ratios. Random forest classifiers using binary folded ECFP4 fingerprints (radius 2, size 1024) were trained to produce three scoring functions per target: the OS and MCS on split 1 (different random seeds), and the DCS on split 2.

Generators

Three molecular generators were evaluated:

Graph-based Genetic Algorithm (GA): Iteratively applies random mutations and crossovers to a population of molecules, retaining the best in each generation. One of the top performers in GuacaMol.
SMILES-LSTM: An autoregressive model that generates SMILES character by character, optimized via hill climbing (iteratively sampling, keeping top molecules, fine-tuning). Also a top GuacaMol performer.
Particle Swarm Optimization (PS): Optimizes molecules in the continuous latent space of a SMILES-based sequence-to-sequence model.

Each optimizer was run 10 times per target dataset.

Score Divergence and Exploitable Biases

Optimization vs. Control Score Divergence

Across all three targets and all three generators, the OS consistently outpaced both control scores during optimization. The DCS sometimes stagnated or even decreased while the OS continued to climb. This divergence demonstrates that the generators exploit biases in the scoring function rather than discovering genuinely active compounds.

The MCS also diverged from the OS despite being trained on exactly the same data, confirming model-specific biases: the optimization exploits features unique to the particular random forest instance. The larger gap between OS and DCS (compared to OS and MCS) indicates that data-specific biases contribute more to the divergence than model-specific biases.

Chemical Space Migration

Optimized molecules migrated toward the region of split 1 actives (used to train the OS), as shown by t-SNE embeddings and nearest-neighbor Tanimoto similarity analysis. Optimized molecules had more similar neighbors in split 1 than in split 2, confirming data-specific bias. By the end of optimization, generated molecules occupied different regions of chemical space than known actives when measured by logP and molecular weight, with compounds from the same optimization run forming distinct clusters.

Quality of Generated Molecules

High-scoring generated molecules frequently contained problematic substructures: reactive dienes, nitrogen-fluorine bonds, long heteroatom chains that are synthetically infeasible, and highly uncommon functional groups. The LSTM optimizer showed a bias toward high molecular weight, low diversity, and high logP values. These molecules would be rejected by medicinal chemists despite their high optimization scores.

Key Takeaways

The authors emphasize several practical implications:

Early stopping: Control scores can indicate when further optimization is exploiting biases rather than finding better molecules. Optimization should stop when control scores plateau.
Scoring function iteration: In practice, generative models are “highly adept at exploiting” incomplete scoring functions, necessitating several iterations of generation and scoring function refinement.
Synthetic accessibility: Even high-scoring molecules are useless if they cannot be synthesized. The authors consider this a major challenge for practical adoption.
Likelihood-based evaluation: For distribution-learning, the authors recommend reporting test-set likelihoods for likelihood-based models, following standard NLP practice.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Bioactivity data	ChEMBL (JAK2, EGFR, DRD2)	See Table S1	Binary classification tasks, split 50/50
Distribution-learning	GuacaMol training set	Subset of ChEMBL	Used as starting population for GA and PS

Algorithms

Scoring function: Random forest classifier (scikit-learn) on binary ECFP4 fingerprints (size 1024, radius 2, RDKit)
GA: Graph-based genetic algorithm from Jensen (2019)
LSTM: SMILES-LSTM with hill climbing, pretrained model from GuacaMol
PS: Particle swarm optimization in latent space of a sequence-to-sequence model (Winter et al. 2019)
Each optimizer run 10 times per target

Evaluation

Metric	Description	Notes
Optimization Score (OS)	RF classifier on split 1	Guides optimization
Model Control Score (MCS)	RF on split 1, different seed	Detects model-specific bias
Data Control Score (DCS)	RF on split 2	Detects data-specific bias
GuacaMol metrics	Validity, uniqueness, novelty, KL div, FCD	For distribution-learning

Artifacts

Artifact	Type	License	Notes
ml-jku/mgenerators-failure-modes	Code	Unknown	Data, code, and results

Hardware

Not specified in the paper.

Citation

@article{renz2019failure,
  title={On failure modes in molecule generation and optimization},
  author={Renz, Philipp and Van Rompaey, Dries and Wegner, J{\"o}rg Kurt and Hochreiter, Sepp and Klambauer, G{\"u}nter},
  journal={Drug Discovery Today: Technologies},
  volume={32-33},
  pages={55--63},
  year={2019},
  publisher={Elsevier},
  doi={10.1016/j.ddtec.2020.09.003}
}

Paper Information

Citation: Renz, P., Van Rompaey, D., Wegner, J. K., Hochreiter, S., & Klambauer, G. (2019). On failure modes in molecule generation and optimization. Drug Discovery Today: Technologies, 32-33, 55-63. https://doi.org/10.1016/j.ddtec.2020.09.003

Publication: Drug Discovery Today: Technologies, Volume 32-33, 2019

Additional Resources:

Code and data (GitHub)

Back Translation for Semi-Supervised Molecule Generation

Wed, 25 Mar 2026 00:00:00 +0000

Semi-Supervised Data Augmentation for Molecular Tasks

This is a Method paper that introduces back translation, a semi-supervised technique from neural machine translation, to the domain of molecular generation. The primary contribution is a general-purpose data augmentation strategy that leverages large pools of unlabeled molecules (from databases like ZINC) to improve the performance of both sequence-based and graph-based models on molecule optimization and retrosynthesis prediction tasks.

Bridging the Labeled Data Gap in Molecular Generation

Molecular generation tasks, such as property optimization and retrosynthesis, require paired training data: an input molecule (or property specification) mapped to a desired output molecule. Obtaining these labeled pairs is expensive and labor-intensive. Meanwhile, enormous databases of unlabeled molecules exist. ZINC alone contains over 750 million compounds, and PubChem has 109 million.

Prior approaches to using unlabeled molecular data include variational autoencoders (VAEs) for learning latent representations, conditional recurrent neural networks for inverse design, and pretraining techniques borrowed from NLP. However, these methods either focus on representation learning rather than direct generation, or require task-specific architectural modifications. The authors identify back translation, a well-established technique in machine translation, as a natural fit for molecular generation tasks that can be treated as sequence-to-sequence mappings.

Back Translation as Molecular Data Augmentation

The core idea is straightforward. Given a main task that maps from source domain $\mathcal{X}$ to target domain $\mathcal{Y}$ (e.g., mapping low-QED molecules to high-QED molecules), the method trains a reverse model $g$ that maps from $\mathcal{Y}$ back to $\mathcal{X}$. This reverse model then “back translates” unlabeled molecules from $\mathcal{Y}$ to generate synthetic source molecules, creating pseudo-labeled training pairs.

The theoretical motivation comes from maximizing the reconstruction probability. Given an unlabeled molecule $y_u \in \mathcal{U}_y$, the logarithmic reconstruction probability through the reverse model $g$ and forward model $f$ is:

$$ \log P(y_u = \hat{y}_u \mid y_u; g, f) = \log \sum_{\hat{x}_u \in \mathcal{X}} P(\hat{x}_u \mid y_u; g) P(y_u = \hat{y}_u \mid \hat{x}_u; f) $$

Since summing over the exponentially large space $\mathcal{X}$ is intractable, the authors apply Jensen’s inequality to obtain a lower bound:

$$ \log P(y_u = \hat{y}_u \mid y_u; g, f) \geq \mathbb{E}_{\hat{x}_u \sim P(\cdot \mid y_u; g)} \log P(y_u = \hat{y}_u \mid \hat{x}_u; f) $$

This lower bound is optimized via Monte Carlo sampling in three steps:

Step 1: Train both forward model $f$ and reverse model $g$ on the labeled data $\mathcal{L}$:

$$ \begin{aligned} \min_{\theta_f} \sum_{(x,y) \in \mathcal{L}} -\log P(y \mid x; \theta_f) \\ \min_{\theta_g} \sum_{(x,y) \in \mathcal{L}} -\log P(x \mid y; \theta_g) \end{aligned} $$

Step 2: Use the trained reverse model $g$ to back translate each unlabeled molecule $y_u \in \mathcal{U}_y$, producing synthetic pairs:

$$ \hat{\mathcal{L}} = {(\hat{x}_u, y_u) \mid y_u \in \mathcal{U}_y, \hat{x}_u \text{ sampled from } P(\cdot \mid y_u; \theta_g)} $$

Step 3: Retrain the forward model $f$ on the combined labeled and synthetic data $\mathcal{L} \cup \hat{\mathcal{L}}$, warm-starting from the parameters obtained in Step 1:

$$ \min_{\theta_f^} \sum_{(x,y) \in \mathcal{L} \cup \hat{\mathcal{L}}} -\log P(y \mid x; \theta_f^) $$

A key practical finding is that data filtration matters. When using large amounts of unlabeled data (1M molecules), keeping only the synthetic pairs that satisfy the same constraints as the labeled data (e.g., similarity thresholds and property ranges) significantly improves performance over using all back-translated data unfiltered.

Experiments on Property Optimization and Retrosynthesis

Molecular Property Improvement

The authors evaluate on four tasks from Jin et al. (2019, 2020), each requiring the model to improve a specific molecular property while maintaining structural similarity (measured by Dice similarity on Morgan fingerprints):

LogP (penalized partition coefficient): two settings with similarity thresholds $\delta \geq 0.4$ and $\delta \geq 0.6$
QED (quantitative estimation of drug-likeness): translate molecules from QED range [0.7, 0.8] to [0.9, 1.0]
DRD2 (dopamine type 2 receptor activity): translate inactive ($P < 0.5$) to active ($P \geq 0.5$)

Two backbone architectures are tested: a Transformer (6 layers, 4 heads, 128-dim embeddings, 512-dim FFN) and HierG2G, a hierarchical graph-to-graph translation model. Unlabeled molecules are sampled from ZINC at 250K and 1M scales.

Method	LogP ($\delta \geq 0.6$)	LogP ($\delta \geq 0.4$)	QED (%)	DRD2 (%)
JT-VAE	0.28	1.03	8.8	3.4
GCPN	0.79	2.49	9.4	4.4
JTNN	2.33	3.55	59.9	77.8
Transformer baseline	2.45	3.69	71.9	60.2
+BT (1M, filtered)	2.86	4.41	82.9	67.4
HierG2G baseline	2.49	3.98	76.9	85.9
+BT (250K, filtered)	2.75	4.24	79.1	87.3

Retrosynthesis Prediction

On the USPTO-50K benchmark (50K reactions, 10 reaction types, 80/10/10 train/val/test split), the method is applied to Transformer and GLN (Graph Logic Network) backbones. For other approaches to this benchmark, see Tied Two-Way Transformers and Data Transfer for Retrosynthesis. Unlabeled reactant sets are constructed by sampling molecules from ZINC and concatenating them following the training data’s reactant count distribution ($N_1 : N_2 : N_3 = 29.3% : 70.4% : 0.3%$).

Method	Top-1	Top-3	Top-5	Top-10
Reaction type given
GLN	64.2	79.1	85.2	90.0
Ours + GLN	67.9	82.5	87.3	91.5
Transformer	52.2	68.2	72.7	77.4
Ours + Transformer	55.9	72.8	77.8	79.7
Reaction type unknown
GLN	52.5	69.0	75.6	83.7
Ours + GLN	54.7	70.2	77.0	84.4
Transformer	37.9	57.3	62.7	68.1
Ours + Transformer	43.5	58.8	64.6	69.7

The improvements are largest at lower $k$ values (top-1 and top-3), suggesting that back translation helps the model make more precise high-confidence predictions.

Ablation Studies

Effect of unlabeled data size: On retrosynthesis with Transformer, performance improves as unlabeled data increases from 50K to 250K, then plateaus or declines beyond 250K. The authors attribute this to noise in the back-translated data outweighing the benefits at larger scales.

Effect of labeled data size: With only 5K labeled samples, adding back-translated data hurts performance because the reverse model is too weak to generate useful synthetic data. As labeled data increases (10K, 25K, 50K), the benefit of back translation grows. This confirms that the method requires a reasonably well-trained reverse model to be effective.

Data filtration: Using 1M unfiltered back-translated molecules sometimes hurts performance (e.g., QED drops from 71.9% to 75.1% vs. 82.9% with filtering), while filtering to enforce the same constraints as the labeled data recovers and exceeds the 250K filtered results.

Consistent Gains Across Architectures and Tasks

The method achieves state-of-the-art results on all four molecular property improvement tasks and the USPTO-50K retrosynthesis benchmark at time of publication. Several observations stand out:

Architecture agnosticism: Back translation improves both sequence-based (Transformer) and graph-based (HierG2G, GLN) models, confirming that the approach is independent of the underlying architecture.
Filtration is essential at scale: Unfiltered 1M back-translated data can degrade performance, but filtered data at the same scale consistently outperforms smaller unfiltered sets.
Training overhead is moderate: On the DRD2 task, back translation with Transformer takes about 2.5x the supervised training time (11.0h vs. 8.5h for initial training), with the back-translation step itself taking under 1 hour.
Diversity and novelty increase: Back translation improves both diversity (average pairwise distance among generated molecules) and novelty (fraction of generated molecules not seen in training) across QED and DRD2 tasks.

The authors acknowledge limitations: the method does not form a closed loop between forward and reverse models (as in dual learning approaches), and the data filtration strategy is rule-based rather than learned. They suggest joint training of forward and reverse models and learned filtration as future directions.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Training (property improvement)	Jin et al. (2019, 2020) datasets	34K-99K pairs	LogP, QED, DRD2 tasks
Training (retrosynthesis)	USPTO-50K	40K reactions	80/10/10 split from Dai et al. (2019)
Unlabeled molecules	ZINC	250K or 1M	Randomly sampled
Evaluation	Same as training	800-1000 test samples	Per-task test sets

Algorithms

Back translation with optional data filtration
Beam search with $k=20$ for inference
Random sampling for back-translation step (Equation 5)
Dice similarity on Morgan fingerprints for similarity constraint

Models

Transformer: 6 layers, 4 attention heads, 128-dim embeddings, 512-dim FFN (for property improvement); 4 layers, 8 heads, 256-dim embeddings, 2048-dim FFN (for retrosynthesis)
HierG2G: Settings from Jin et al. (2020)
GLN: Settings from Dai et al. (2019)

Evaluation

Metric	Task	Best Value	Baseline	Notes
LogP improvement	LogP ($\delta \geq 0.6$)	2.86	2.49 (HierG2G)	Transformer + BT(1M, filtered)
LogP improvement	LogP ($\delta \geq 0.4$)	4.41	3.98 (HierG2G)	Transformer + BT(1M, filtered)
Success rate	QED	82.9%	76.9% (HierG2G)	Transformer + BT(1M, filtered)
Success rate	DRD2	87.3%	85.9% (HierG2G)	HierG2G + BT(250K, filtered)
Top-1 accuracy	USPTO-50K (known type)	67.9%	64.2% (GLN)	Ours + GLN

Hardware

The paper reports training times (8.5h for Transformer, 16.8h for HierG2G on DRD2 with 1M unlabeled data) but does not specify the GPU hardware used.

Artifacts

Artifact	Type	License	Notes
BT4MolGen	Code	MIT	Official implementation in Python

Paper Information

Citation: Fan, Y., Xia, Y., Zhu, J., Wu, L., Xie, S., & Qin, T. (2021). Back translation for molecule generation. Bioinformatics, 38(5), 1244-1251. https://doi.org/10.1093/bioinformatics/btab817

@article{fan2022back,
  title={Back translation for molecule generation},
  author={Fan, Yang and Xia, Yingce and Zhu, Jinhua and Wu, Lijun and Xie, Shufang and Qin, Tao},
  journal={Bioinformatics},
  volume={38},
  number={5},
  pages={1244--1251},
  year={2021},
  publisher={Oxford University Press},
  doi={10.1093/bioinformatics/btab817}
}

UnCorrupt SMILES: Post Hoc Correction for De Novo Design

Sun, 22 Mar 2026 00:00:00 +0000

A Transformer-Based SMILES Error Corrector

This is a Method paper that proposes a post hoc approach to fixing invalid SMILES produced by de novo molecular generators. Rather than trying to prevent invalid outputs through alternative representations (SELFIES) or constrained architectures (graph models), the authors train a transformer model to translate invalid SMILES into valid ones. The corrector is framed as a sequence-to-sequence translation task, drawing on techniques from grammatical error correction (GEC) in natural language processing.

The Problem of Invalid SMILES in Molecular Generation

SMILES-based generative models produce some percentage of invalid outputs that cannot be converted to molecules. The invalidity rate varies substantially across model types:

RNN models (DrugEx): 5.7% invalid (pretrained) and 4.7% invalid (target-directed)
GANs (ORGANIC): 9.5% invalid
VAEs (GENTRL): 88.9% invalid

These invalid outputs represent wasted computation and potentially introduce bias toward molecules that are easier to generate correctly. Previous approaches to this problem include using alternative representations (DeepSMILES, SELFIES) or graph-based models, but these either limit the search space or increase computational cost. The authors propose a complementary strategy: fix the errors after generation.

Error Taxonomy Across Generator Types

The paper classifies invalid SMILES errors into six categories based on RDKit error messages:

Syntax errors: malformed SMILES grammar
Unclosed rings: unmatched ring closure digits
Parentheses errors: unbalanced open/close parentheses
Bond already exists: duplicate bonds between the same atoms
Aromaticity errors: atoms incorrectly marked as aromatic or kekulization failures
Valence errors: atoms exceeding their maximum bond count

The distribution of error types differs across generators. RNN-based models primarily produce aromaticity errors, suggesting they learn SMILES grammar well but struggle with chemical validity. The GAN (ORGANIC) produces mostly valence errors. The VAE (GENTRL) produces more grammar-level errors (syntax, parentheses, unclosed rings), indicating that sampling from the continuous latent space often produces sequences that violate basic SMILES structure.

Architecture and Training

The SMILES corrector uses a standard encoder-decoder transformer architecture based on Vaswani et al., with learned positional encodings. Key specifications:

Embedding dimension: 256
Encoder/decoder layers: 3 each
Attention heads: 8
Feed-forward dimension: 512
Dropout: 0.1
Optimizer: Adam (learning rate 0.0005)
Training: 20 epochs, batch size 16

Since no dataset of manually corrected invalid-valid SMILES pairs exists, the authors create synthetic training data by introducing errors into valid SMILES from the Papyrus bioactivity dataset (approximately 1.3M pairs). Errors are introduced through random perturbations following SMILES syntax rules: character substitutions, bond order changes, fragment additions from the GDB-8 database to atoms with full valence, and other structural modifications.

Training with Multiple Errors Improves Correction

A key finding is that training the corrector on inputs with multiple errors per SMILES substantially improves performance on real generator outputs. The baseline model (1 error per input) fixes 35-80% of invalid outputs depending on the generator. Increasing errors per training input to 12 raises this to 62-95%:

Generator	1 error/input	12 errors/input
RNN (DrugEx)	~60% fixed	62% fixed
Target-directed RNN	~60% fixed	68% fixed
GAN (ORGANIC)	~80% fixed	95% fixed
VAE (GENTRL)	~35% fixed	80% fixed

Training beyond 12 errors per input yields diminishing returns (80% average at 20 errors vs. 78% at 12). The improvement from multi-error training is consistent with GEC literature, where models learn to “distrust” inputs more when exposed to higher error rates.

The model also shows low overcorrection: only 14% of valid SMILES are altered during translation, comparable to overcorrection rates in spelling correction systems.

Fixed Molecules Are Comparable to Generator Outputs

The corrected molecules are evaluated against both the training set and the readily generated (valid) molecules from each generator:

Uniqueness: 97% of corrected molecules are unique
Novelty vs. generated: 97% of corrected molecules are novel compared to the valid generator outputs
Similarity to nearest neighbor (SNN): 0.45 between fixed and generated sets, indicating the corrected molecules explore different parts of chemical space
Property distributions: KL divergence scores between fixed molecules and the training set are comparable to those between generated molecules and the training set

This demonstrates that SMILES correction produces molecules that are as chemically reasonable as the generator’s valid outputs while exploring complementary regions of chemical space.

Local Chemical Space Exploration via Error Introduction

Beyond fixing generator errors, the authors propose using the SMILES corrector for analog generation. The workflow is:

Take a known active molecule
Introduce random errors into its SMILES (repeated 1000 times)
Correct the errors using the trained corrector

This “local sequence exploration” generates novel analogs with 97% validity. The uniqueness (39%) and novelty (16-37%) are lower than for generator correction because the corrector often regenerates the original molecule. However, the approach produces molecules that are structurally similar to the starting compound (SNN of 0.85 to known ligands).

The authors demonstrate this on selective Aurora kinase B (AURKB) inhibitors. The generated analogs occupy the same binding site region as the co-crystallized ligand VX-680 in docking studies, with predicted bioactivities similar to known compounds. Compared to target-directed RNN generation, SMILES exploration produces molecules closer to known actives (higher SNN, scaffold similarity, and KL divergence scores).

Limitations

The corrector performance drops when applied to real generator outputs compared to synthetic test data, because the synthetic error distribution does not perfectly match the errors that generators actually produce. Generator-specific correctors trained on actual invalid outputs could improve performance. The local exploration approach has limited novelty since the corrector frequently regenerates the original molecule. The evaluation uses predicted rather than experimental bioactivities for the Aurora kinase case study.

Reproducibility

Artifact	Type	License	Notes
LindeSchoenmaker/SMILES-corrector	Code + Data	MIT	Training code, synthetic error generation, and evaluation scripts

Data: Synthetic training pairs derived from the Papyrus bioactivity dataset (v5.5). Approximately 1.3M invalid-valid pairs per error-count setting.

Code: Transformer implemented in PyTorch, adapted from Ben Trevett’s seq2seq tutorial. Generative model baselines use DrugEx, GENTRL, and ORGANIC.

Evaluation: Validity assessed with RDKit. Similarity metrics (SNN, fragment, scaffold) and KL divergence computed following MOSES and GuacaMol benchmark protocols.

Paper Information

Citation: Schoenmaker, L., Béquignon, O. J. M., Jespers, W., & van Westen, G. J. P. (2023). UnCorrupt SMILES: a novel approach to de novo design. Journal of Cheminformatics, 15, 22.

Publication: Journal of Cheminformatics, 2023

Additional Resources:

GitHub: LindeSchoenmaker/SMILES-corrector

@article{schoenmaker2023uncorrupt,
  title={UnCorrupt SMILES: a novel approach to de novo design},
  author={Schoenmaker, Linde and B{\'e}quignon, Olivier J. M. and Jespers, Willem and van Westen, Gerard J. P.},
  journal={Journal of Cheminformatics},
  volume={15},
  number={1},
  pages={22},
  year={2023},
  publisher={Springer},
  doi={10.1186/s13321-023-00696-x}
}

RetMol: Retrieval-Based Controllable Molecule Generation

Sun, 22 Mar 2026 00:00:00 +0000

Retrieval-Augmented Generation for Molecules

This is a Method paper that introduces RetMol, a retrieval-based framework for controllable molecule generation. The key idea is to guide a pre-trained generative model using a small set of exemplar molecules that partially satisfy the desired design criteria, retrieved from a task-specific database. The approach requires no task-specific fine-tuning of the generative backbone and works effectively with very few exemplar molecules (as few as 23).

Limitations of Existing Controllable Generation

Existing approaches to controllable molecule generation fall into three categories, each with drawbacks:

Reinforcement learning (RL)-based methods require task-specific fine-tuning of the generative model for each new objective
Supervised learning (SL)-based methods need molecules with desired properties as training data, which may be scarce
Latent optimization-based methods require training property predictors in the latent space, which is challenging with limited active molecules and incompatible with variable-length latent spaces like those in transformers

RetMol addresses all three issues by keeping the generative backbone frozen and using a lightweight, task-agnostic retrieval module that can be applied to new tasks simply by swapping the retrieval database.

The RetMol Framework

RetMol consists of four components built around a pre-trained encoder-decoder backbone (Chemformer, a BART variant trained on ZINC):

Retrieval Database

A task-specific collection of exemplar molecules that at least partially satisfy the design criteria. The database can be very small (e.g., 23 known inhibitors for the SARS-CoV-2 task) and is dynamically updated during inference with newly generated molecules.

Molecule Retriever

A heuristic-based module that selects the $K$ most relevant exemplar molecules (default $K = 10$). It first constructs a feasible set of molecules satisfying all constraints, then selects those with the best property scores. If too few molecules satisfy all constraints, it progressively relaxes constraints until enough candidates are available.

Information Fusion via Cross-Attention

The core trainable component. Retrieved exemplar embeddings are fused with the input molecule embedding using cross-attention:

$$\boldsymbol{e} = f_{\text{CA}}(\boldsymbol{e}_{\text{in}}, \boldsymbol{E}_r; \theta) = \text{Attn}(\text{Query}(\boldsymbol{e}_{\text{in}}), \text{Key}(\boldsymbol{E}_r)) \cdot \text{Value}(\boldsymbol{E}_r)$$

where $\boldsymbol{e}_{\text{in}} = \text{Enc}(x_{\text{in}}) \in \mathbb{R}^{L \times D}$ is the input embedding and $\boldsymbol{E}_r = [\boldsymbol{e}_r^1, \ldots, \boldsymbol{e}_r^K]$ are the retrieved exemplar embeddings. This module adds less than 5% parameter overhead (460K parameters over the 10M base model).

Self-Supervised Training: Nearest Neighbor Prediction

Rather than reconstructing the input molecule (which would make the retrieval module unnecessary), RetMol trains the fusion module to predict the nearest neighbor of the input:

$$\mathcal{L}(\theta) = \sum_{i=1}^{B} \text{CE}\left(\text{Dec}\left(f_{\text{CA}}(\boldsymbol{e}_{\text{in}}^{(i)}, \boldsymbol{E}_r^{(i)}; \theta)\right), x_{\text{1NN}}^{(i)}\right)$$

The remaining $K - 1$ nearest neighbors serve as the retrieved exemplar molecules. This forces the fusion module to learn how to use exemplar molecules to transform the input toward a related target. Only the fusion module parameters are updated; the encoder and decoder remain frozen.

During inference, RetMol uses an iterative process:

Encode the input molecule and retrieved exemplars
Fuse embeddings via cross-attention
Perturb the fused embedding $M$ times with Gaussian noise
Greedily decode $M$ candidate molecules
Replace the input with the best candidate if it improves upon the current score
Add remaining good candidates to the retrieval database
Repeat until convergence or a maximum number of iterations

The dynamic update of the retrieval database is critical for extrapolating beyond the initial set of exemplar molecules.

Experiments and Results

RetMol is evaluated on four tasks of increasing difficulty:

QED Optimization Under Similarity Constraint

Goal: generate molecules with QED $\geq$ 0.9 while maintaining Tanimoto similarity $\geq$ 0.4 to the input. RetMol achieves 94.5% success rate, compared to 92.8% for the previous best (QMO).

Penalized LogP Optimization

Goal: maximize penalized LogP while maintaining structural similarity. At $\delta = 0.4$, RetMol achieves 11.55 average improvement, compared to 7.71 for QMO.

GSK3$\beta$ + JNK3 Dual Inhibitor Design

Goal: simultaneously satisfy four constraints (GSK3$\beta$ inhibition $\geq$ 0.5, JNK3 inhibition $\geq$ 0.5, QED $\geq$ 0.6, SA $\leq$ 4). Results:

Method	Success %	Novelty	Diversity
REINVENT	47.9	0.561	0.621
RationaleRL	74.8	0.568	0.701
MARS	92.3	0.824	0.719
MolEvol	93.0	0.757	0.681
RetMol	96.9	0.862	0.732

RetMol achieves this without task-specific fine-tuning and requires only 80 iterations compared to MARS’s 550.

SARS-CoV-2 Main Protease Inhibitor Optimization

A real-world task using only 23 known inhibitors as the retrieval database and optimizing 8 weakly-binding drugs. Under the milder similarity constraint ($\delta = 0.4$), RetMol achieves 2.84 kcal/mol average binding affinity improvement versus 1.67 for Graph GA. Under the stricter constraint ($\delta = 0.6$), RetMol succeeds on 5/8 molecules versus 3/8 for Graph GA.

Key Analysis Findings

Database size: Strong performance even with 100 molecules, already outperforming baselines on success rate
Database quality: Molecules satisfying all four constraints give the best results (96.9%), but partial satisfaction still works reasonably (84.7% with two properties)
Training objective: The nearest neighbor prediction objective outperforms conventional reconstruction on validity (0.902 vs. 0.834) and uniqueness (0.922 vs. 0.665)
Dynamic database update: Essential for extrapolating beyond the initial retrieval database, generating molecules with property values exceeding the best in the original database

Limitations

RetMol requires exemplar molecules that at least partially satisfy the design criteria. When such molecules are entirely unavailable, the framework cannot be applied. The method also relies on property predictors (for scoring and retrieval), whose accuracy directly affects generation quality. The iterative refinement process adds computational overhead at inference time, and the results depend on the Chemformer backbone’s generation capabilities.

Reproducibility

Artifact	Type	License	Notes
NVlabs/RetMol	Code	NVIDIA Source Code License-NC	Full training and inference code
NVlabs/RetMol (checkpoints)	Model	CC BY-NC-SA 4.0	Pre-trained model checkpoints

Data: ZINC250k and ChEMBL datasets for training. Task-specific retrieval databases constructed from these datasets. COVID-19 task uses 23 known SARS-CoV-2 Mpro inhibitors.

Training: Information fusion module trained on 4x V100 GPUs (16GB each) for approximately 2 hours. Batch size of 256 per GPU, 50K iterations.

Inference: Single V100 GPU. Greedy decoding with Gaussian perturbation ($\sigma = 1$) for sampling multiple candidates per iteration.

Backbone: Chemformer (BART variant) pre-trained on ZINC. Frozen during RetMol training and inference.

Paper Information

Citation: Wang, Z., Nie, W., Qiao, Z., Xiao, C., Baraniuk, R. G., & Anandkumar, A. (2023). Retrieval-based Controllable Molecule Generation. Proceedings of the Eleventh International Conference on Learning Representations (ICLR 2023).

Publication: International Conference on Learning Representations (ICLR) 2023

Additional Resources:

@inproceedings{wang2023retrieval,
  title={Retrieval-based Controllable Molecule Generation},
  author={Wang, Zichao and Nie, Weili and Qiao, Zhuoran and Xiao, Chaowei and Baraniuk, Richard G. and Anandkumar, Anima},
  booktitle={International Conference on Learning Representations},
  year={2023},
  url={https://openreview.net/forum?id=vDFA1tpuLvk}
}

LIMO: Latent Inceptionism for Targeted Molecule Generation

Sun, 22 Mar 2026 00:00:00 +0000

Paper Information

Citation: Eckmann, P., Sun, K., Zhao, B., Feng, M., Gilson, M. K., & Yu, R. (2022). LIMO: Latent Inceptionism for Targeted Molecule Generation. Proceedings of the 39th International Conference on Machine Learning (ICML 2022), PMLR 162, 5777–5792.

Publication: ICML 2022

Additional Resources:

@inproceedings{eckmann2022limo,
  title={LIMO: Latent Inceptionism for Targeted Molecule Generation},
  author={Eckmann, Peter and Sun, Kunyang and Zhao, Bo and Feng, Mudong and Gilson, Michael K and Yu, Rose},
  booktitle={International Conference on Machine Learning},
  pages={5777--5792},
  year={2022},
  organization={PMLR}
}

Gradient-Based Reverse Optimization in Molecular Latent Space

This is a Method paper that introduces LIMO, a framework for generating molecules with desired properties using gradient-based optimization on a VAE latent space. The key innovation is a stacked architecture where a property predictor operates on the decoded molecular representation rather than directly on the latent space, combined with an inceptionism-like technique that backpropagates through the frozen decoder and predictor to optimize the latent code. This approach is 6-8x faster than RL baselines and 12x faster than sampling-based approaches while producing molecules with higher binding affinities.

Slow Property Optimization in Existing Methods

Generating molecules with high binding affinity to target proteins is a central goal of early drug discovery, but existing computational approaches are slow when optimizing for properties that are expensive to evaluate (such as docking-based binding affinity). RL-based methods require many calls to the property function during training. Sampling-based approaches like MARS need hundreds of iterations. Latent optimization methods that predict properties directly from the latent space suffer from poor prediction accuracy because the mapping from latent space to molecular properties is difficult to learn.

The LIMO Framework

LIMO consists of three components: a VAE for learning a molecular latent space, a property predictor with a novel stacked architecture, and a gradient-based reverse optimization procedure.

SELFIES-Based VAE

The VAE encodes molecules represented as SELFIES strings into a 1024-dimensional latent space $\mathbf{z} \in \mathbb{R}^m$ and decodes to probability distributions over SELFIES symbols. Since all SELFIES strings correspond to valid molecules, this guarantees 100% chemical validity. The output molecule is obtained by taking the argmax at each position:

$$\hat{x}_i = s_{d_i^*}, \quad d_i^* = \operatorname{argmax}_{d} \{y_{i,1}, \ldots, y_{i,d}\}$$

The VAE uses fully-connected layers (not recurrent), with a 64-dimensional embedding layer, four batch-normalized linear layers (2000-dimensional first layer, 1000-dimensional for the rest) with ReLU activation, and is trained with ELBO loss (0.9 weight on reconstruction, 0.1 on KL divergence).

Stacked Property Predictor

The critical architectural choice: the property predictor $g_\theta$ takes the decoded molecular representation $\hat{\mathbf{x}}$ as input rather than the latent code $\mathbf{z}$. The predictor is trained after the VAE is frozen by minimizing MSE on VAE-generated molecules:

$$\ell_0(\theta) = \left\| g_\theta\left(f_{\text{dec}}(\mathbf{z})\right) - \pi\left(f_{\text{dec}}(\mathbf{z})\right) \right\|^2$$

where $\pi$ is the ground-truth property function. This stacking improves prediction accuracy from $r^2 = 0.04$ (predicting from $\mathbf{z}$) to $r^2 = 0.38$ (predicting from $\hat{\mathbf{x}}$) on an unseen test set. The improvement comes because the mapping from molecular space to property is easier to learn than the mapping from latent space to property.

Reverse Optimization (Inceptionism)

After training, the decoder and predictor weights are frozen and $\mathbf{z}$ becomes the trainable parameter. For multiple properties with weights $(w_1, \ldots, w_k)$, the optimization minimizes:

$$\ell_1(\mathbf{z}) = -\sum_{i=1}^{k} w_i \cdot g^i\left(f_{\text{dec}}(\mathbf{z})\right)$$

Since both the decoder and predictor are neural networks, gradients flow through the entire chain, enabling efficient optimization with Adam. This is analogous to the “inceptionism” (DeepDream) technique from computer vision, where network inputs are optimized to maximize specific outputs.

Substructure-Constrained Optimization

For lead optimization, LIMO can fix a molecular substructure during optimization by adding a regularization term:

$$\ell_2(\mathbf{z}) = \lambda \sum_{i=1}^{n} \sum_{j=1}^{d} \left(M_{i,j} \cdot \left(f_{\text{dec}}(\mathbf{z})_{i,j} - (\hat{\mathbf{x}}_{\text{start}})_{i,j}\right)\right)^2$$

where $M$ is a binary mask specifying which SELFIES positions must remain unchanged and $\lambda = 1000$. This capability is enabled by the intermediate decoded representation, which most VAE-based methods lack.

Experiments and Results

Benchmark Tasks (QED and Penalized LogP)

LIMO achieves competitive results with deep generative and RL-based models in 1 hour, compared to 8-24 hours for baselines. Top QED score: 0.947 (maximum possible: 0.948). Top penalized LogP: 10.5 (among length-limited models, comparable to MolDQN’s 11.8).

The ablation study (“LIMO on z”) confirms the stacked predictor architecture: predicting from $\hat{\mathbf{x}}$ yields top p-logP of 10.5 versus 6.52 when predicting directly from $\mathbf{z}$.

Binding Affinity Maximization

The primary contribution. LIMO generates molecules with substantially higher computed binding affinities (lower $K_D$) than baselines against two protein targets:

Method	ESR1 best $K_D$ (nM)	ACAA1 best $K_D$ (nM)	Time (hrs)
GCPN	6.4	75	6
MolDQN	373	240	6
MARS	17	163	6
GraphDF	25	370	12
LIMO	0.72	37	1

For ESR1, LIMO’s best molecule has a $K_D$ of 0.72 nM from docking, nearly 10x better than the next method (GCPN at 6.4 nM). When corroborated with more rigorous absolute binding free energy (ABFE) calculations, one LIMO compound achieved a predicted $K_D$ of $6 \times 10^{-14}$ M (0.00006 nM), far exceeding the affinities of approved drugs tamoxifen ($K_D$ = 1.5 nM) and raloxifene ($K_D$ = 0.03 nM).

Multi-Objective Optimization

Single-objective optimization produces molecules with high affinity but problematic structures (polyenes, large rings). Multi-objective optimization simultaneously targeting binding affinity, QED ($>$ 0.4), and SA ($<$ 5.5) produces drug-like, synthesizable molecules that still have nanomolar binding affinities. Generated molecules satisfy Lipinski’s rule of 5 with zero PAINS alerts.

Limitations

The LIMO property predictor achieves only moderate prediction accuracy ($r^2$ = 0.38), meaning the optimization relies on gradient direction being correct rather than absolute predictions being accurate. AutoDock-GPU docking scores do not correlate well with the more accurate ABFE results, a known limitation of docking. The fully-connected VAE architecture limits the molecular diversity compared to recurrent or attention-based alternatives (LSTM decoder produced max QED of only 0.3). The greedy fine-tuning step (replacing carbons with heteroatoms) is a heuristic rather than a learned procedure.

Reproducibility

Artifact	Type	License	Notes
Rose-STL-Lab/LIMO	Code	UC San Diego Custom (non-commercial)	Full training, optimization, and evaluation code

Data: ZINC250k dataset for optimization tasks. MOSES dataset for random generation evaluation. Binding affinities computed with AutoDock-GPU.

Hardware: Two GTX 1080 Ti GPUs (one for PyTorch, one for AutoDock-GPU), 4 CPU cores, 32 GB memory.

Training: VAE trained for 18 epochs with learning rate 0.0001. Property predictor uses 3 layers of 1000 units, trained for 5 epochs. Reverse optimization uses learning rate 0.1 for 10 epochs.

Targets: Human estrogen receptor (ESR1, PDB 1ERR) and human peroxisomal acetyl-CoA acyl transferase 1 (ACAA1, PDB 2IIK).

MolGen: Molecular Generation with Chemical Feedback

Fri, 20 Mar 2026 00:00:00 +0000

A SELFIES-Based Method for Molecular Generation

This is a Method paper that introduces MolGen, a pre-trained molecular language model for generating molecules with desired chemical properties. The primary contribution is a three-part framework: (1) pre-training on 100M+ molecular SELFIES to learn structural and grammatical knowledge, (2) domain-agnostic molecular prefix tuning for cross-domain knowledge transfer, and (3) a chemical feedback paradigm that aligns the model’s generative probabilities with real-world chemical preferences. MolGen is the first language model pre-trained on SELFIES rather than SMILES, which guarantees 100% syntactic validity of generated molecules.

Challenges in Language Model-Based Molecule Generation

Generating novel molecules with desirable properties is a central task in drug discovery and chemical design. The molecular space is estimated at $10^{33}$ possible structures, making exhaustive search impractical. Prior deep generative approaches face several limitations:

Syntactic invalidity: SMILES-based language models frequently generate strings that do not correspond to valid molecular graphs. A single random mutation of a SMILES string has only a 9.9% chance of remaining valid.
Narrow domain focus: Most existing models focus exclusively on synthetic molecules and neglect natural products, which have distinct structural complexity and scaffold diversity.
Molecular hallucinations: Generated molecules may satisfy chemical structural rules yet fail to exhibit anticipated chemical activity in practical applications. The authors formally define this as molecules that “comply with chemical structural rules, yet fail to exhibit practical utility or the anticipated properties.”
Limited optimization signals: Existing approaches rely on reinforcement learning (high variance), fixed-dimensional latent spaces, or expert-provided generation rules, all of which impede efficient exploration of chemical space.

Core Innovation: Pre-training with SELFIES and Chemical Feedback

MolGen’s novelty rests on three interconnected components.

SELFIES-Based Pre-training

MolGen uses SELFIES (Self-Referencing Embedded Strings) instead of SMILES. SELFIES guarantees that every possible combination of symbols in the alphabet corresponds to a chemically valid molecular graph. The model uses a compact vocabulary of 185 tokens.

The first pre-training stage uses a BART-style encoder-decoder. Tokens from a SELFIES string $S = {s_1, \ldots, s_l}$ are randomly replaced with [MASK], then the corrupted input is encoded bidirectionally and decoded left-to-right. The reconstruction loss is:

$$ \mathcal{L}_{\text{ce}}(S) = -\sum_{j=1}^{l} \sum_{s} p_{\text{true}}(s \mid S, S_{< j}) \log p_{\theta}(s \mid S, S_{< j}; \theta) $$

where $S_{< j}$ denotes the partial sequence ${s_0, \ldots, s_{j-1}}$ and $p_{\text{true}}$ is the one-hot distribution under standard maximum likelihood estimation.

Domain-Agnostic Molecular Prefix Tuning

The second pre-training stage introduces shared prefix vectors $P_k, P_v \in \mathbb{R}^{m \times d}$ prepended to the keys and values of multi-head attention at each layer. Unlike conventional prefix tuning that freezes model parameters, MolGen updates the entire model. The attention output becomes:

$$ \text{head} = \text{Attn}\left(xW_q, [P_k, XW_k], [P_v, XW_v]\right) $$

This decomposes into a linear interpolation between prefix attention and standard attention:

$$ \text{head} = \lambda(x) \cdot \text{Attn}(xW_q, P_k, P_v) + (1 - \lambda(x)) \cdot \text{Attn}(xW_q, XW_k, XW_v) $$

where $\lambda(x)$ is a scalar representing the sum of normalized attention weights on the prefixes. The prefixes are trained simultaneously across synthetic and natural product domains, acting as a domain instructor.

Chemical Feedback Paradigm

To address molecular hallucinations, MolGen aligns the model’s probabilistic rankings with chemical preference rankings. Given a molecule $S$ and a set of candidate outputs $\mathcal{S}^*$ with distinct property scores $\text{Ps}(\cdot)$, the model should satisfy:

$$ p_{\text{true}}(S_i \mid S) > p_{\text{true}}(S_j \mid S), \quad \forall S_i, S_j \in \mathcal{S}^*, \text{Ps}(S_i) > \text{Ps}(S_j) $$

This is enforced via a rank loss:

$$ \mathcal{L}_{\text{rank}}(S) = \sum_{i} \sum_{j > i} \max\left(0, f(S_j) - f(S_i) + \gamma_{ij}\right) $$

where $\gamma_{ij} = (j - i) \cdot \gamma$ is a margin scaled by rank difference and $f(S) = \sum_{t=1}^{l} \log p_{\theta}(s_t \mid S, S_{< t}; \theta)$ is the estimated log-probability. The overall training objective combines cross-entropy and rank loss:

$$ \mathcal{L} = \mathcal{L}_{\text{ce}} + \alpha \mathcal{L}_{\text{rank}} $$

Label smoothing is applied to the target distribution in $\mathcal{L}_{\text{ce}}$, allocating probability mass $\beta$ to non-target tokens to maintain generative diversity.

Experiments Across Distribution Learning and Property Optimization

Datasets

Stage 1 pre-training: 100M+ unlabeled molecules from ZINC-15 (molecular weight $\leq$ 500 Da, LogP $\leq$ 5)
Stage 2 pre-training: 2.22M molecules spanning synthetic (ZINC, MOSES) and natural product (NPASS, 30,926 compounds) domains
Downstream evaluation: MOSES synthetic dataset, ZINC250K, and natural product molecules

Molecular Distribution Learning

MolGen generates 10,000 synthetic and 80,000 natural product molecules, evaluated on seven metrics (Validity, Fragment similarity, Scaffold similarity, SNN, Internal Diversity, FCD, and Novelty). Baselines include AAE, LatentGAN, CharRNN, VAE, JT-VAE, LIMO, and Chemformer.

Model	Validity	Frag	Scaf	SNN	IntDiv	FCD	Novelty
Chemformer	.9843	.9889	.9248	.5622	.8553	.0061	.9581
MolGen	1.000	.9999	.9999	.9996	.8567	.0015	1.000

On synthetic molecules, MolGen achieves 100% validity, near-perfect fragment and scaffold similarity, and the lowest FCD (0.0015). For natural products, MolGen achieves FCD of 0.6519 compared to Chemformer’s 0.8346.

Targeted Molecule Discovery

For penalized logP maximization (top-3 scores):

Model	1st	2nd	3rd
MARS (no length limit)	44.99	44.32	43.81
MolGen (no length limit)	80.30	74.70	69.85
MolGen (length-limited)	30.51	28.98	28.95

For QED maximization, MolGen achieves the maximum score of 0.948 across the top-3.

Molecular Docking

MolGen optimizes binding affinity for two protein targets (ESR1 and ACAA1), measured by dissociation constant $K_D$ (lower is better):

Model	ESR1 1st	ESR1 2nd	ESR1 3rd	ACAA1 1st	ACAA1 2nd	ACAA1 3rd
LIMO	0.72	0.89	1.4	37	37	41
MolGen	0.13	0.35	0.47	3.36	3.98	8.50

MolGen achieves the lowest dissociation constants across both targets. Optimization of the 1,000 worst-affinity molecules yields 96.7% relative improvement for ESR1 and 70.4% for ACAA1.

Constrained Molecular Optimization

Optimizing 800 molecules from ZINC250K with lowest p-logP scores under Tanimoto similarity constraints:

Model	$\delta = 0.6$	$\delta = 0.4$
RetMol	3.78 (3.29)	11.55 (11.27)
MolGen	12.08 (0.82)	12.35 (1.21)

MolGen achieves the highest mean improvement with the lowest standard deviation under both constraints.

Ablation Studies

Chemical feedback: Without it, the model generates molecules with property scores similar to initial molecules. With it ($\alpha = 3$), property scores increase progressively across generation rounds.
Prefix tuning: Removing prefix tuning reduces constrained optimization improvement by 0.45 at $\delta = 0.6$ and 2.12 at $\delta = 0.4$.
Label smoothing: Enhances diversity of generated molecules as measured by Internal Diversity.
Substructure attention: MolGen focuses attention on chemically meaningful functional groups (fluoro, phenyl, hydroxyl), while SMILES-based PLMs scatter attention across syntactic tokens. The Substructure Attention Level (SAL) metric confirms MolGen’s superior focus.

Key Findings, Limitations, and Future Directions

Key Findings

SELFIES pre-training guarantees 100% molecular validity, eliminating the need for external valency checks.
Domain-agnostic prefix tuning enables effective knowledge transfer between synthetic and natural product domains.
The chemical feedback paradigm aligns model outputs with chemical preferences without requiring external annotated data or reference databases.
MolGen achieves the best or competitive results across all evaluated tasks: distribution learning, targeted molecule discovery, constrained optimization, and molecular docking.

Limitations

The authors acknowledge several limitations:

Computational cost: Training and fine-tuning on large datasets is computationally intensive.
Model interpretability: The transformer architecture makes it difficult to understand explicit rationale behind decisions.
Single-target optimization only: The chemical feedback paradigm handles single-target optimization; multiple conflicting objectives could create ambiguous optimization trajectories.
Task specificity: MolGen is designed for 2D molecular generation; 3D conformation information is not incorporated.
Reaction prediction: When applied to reaction prediction (an off-target task), MolGen achieves only 71.4% accuracy on 39,990 reaction samples.

Future Directions

The authors suggest applying MolGen to retrosynthesis and reaction prediction, exploring multimodal pre-training, and incorporating additional knowledge sources.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Stage 1 pre-training	ZINC-15	100M+ molecules	MW $\leq$ 500 Da, LogP $\leq$ 5
Stage 2 pre-training	ZINC + MOSES + NPASS	2.22M molecules	Synthetic and natural product domains
Distribution learning (synthetic)	MOSES	~1.9M molecules	Standard benchmark split
Distribution learning (natural)	NPASS	30,926 compounds	30,126 train / 800 test
Constrained optimization	ZINC250K	800 molecules	Lowest p-logP scores

Algorithms

Architecture: BART-based encoder-decoder with SELFIES vocabulary (185 tokens)
Prefix length: 5 tunable vectors per layer
Optimizer: LAMB (pre-training), AdamW (fine-tuning)
Pre-training: 600M steps with linear warm-up (180,000 steps) followed by linear decay
Rank loss weight ($\alpha$): Recommended values of 3 or 5
Candidate generation: 30 candidates per molecule (synthetic), 8 candidates (natural products)

Models

MolGen is publicly available on Hugging Face. The model uses a vocabulary of 185 SELFIES tokens and is comparable in size to Chemformer-large.

Evaluation

Metric	Domain	MolGen	Best Baseline	Notes
FCD (lower is better)	Synthetic	0.0015	0.0061 (Chemformer)	Distribution learning
p-logP top-1 (no limit)	Synthetic	80.30	44.99 (MARS)	Targeted discovery
QED top-1	Synthetic	0.948	0.948 (several)	Tied at maximum
ESR1 $K_D$ top-1	Docking	0.13	0.72 (LIMO)	Binding affinity
p-logP improvement ($\delta=0.4$)	Synthetic	12.35 (1.21)	11.55 (11.27) (RetMol)	Constrained optimization

Hardware

6 NVIDIA V100 GPUs
Pre-training batch size: 256 molecules per GPU
Fine-tuning batch size: 6 (synthetic and natural product)
Training: 100 epochs for fine-tuning tasks

Artifacts

Artifact	Type	License	Notes
zjunlp/MolGen	Code	Unknown	Official PyTorch implementation
zjunlp/MolGen-large	Model	Unknown	Pre-trained weights on Hugging Face

Paper Information

Citation: Fang, Y., Zhang, N., Chen, Z., Guo, L., Fan, X., & Chen, H. (2024). Domain-Agnostic Molecular Generation with Chemical Feedback. Proceedings of the Twelfth International Conference on Learning Representations (ICLR 2024).

Additional Resources:

@inproceedings{fang2024domain,
  title={Domain-Agnostic Molecular Generation with Chemical Feedback},
  author={Fang, Yin and Zhang, Ningyu and Chen, Zhuo and Guo, Lingbing and Fan, Xiaohui and Chen, Huajun},
  booktitle={The Twelfth International Conference on Learning Representations},
  year={2024},
  url={https://openreview.net/forum?id=9rPyHyjfwP}
}

GP-MoLFormer: Molecular Generation via Transformers

Thu, 25 Dec 2025 00:00:00 +0000

Contribution and Taxonomic Focus

This is primarily a Methodological paper, as it proposes a specific neural architecture (GP-MoLFormer) and a novel fine-tuning algorithm (Pair-tuning) for molecular generation. It validates these contributions against standard baselines (e.g., JT-VAE, MolGen-7b).

It also contains a secondary Theoretical contribution by establishing an empirical scaling law that relates inference compute (generation size) to the novelty of the generated molecules.

Motivation: Data Scale and Prompt-Based Optimization

While large language models (LLMs) have transformed text generation, the impact of training data scale and memorization on molecular generative models remains under-explored. Specifically, there is a need to understand how training on billion-scale datasets affects the novelty of generated molecules and whether biases in public databases (like ZINC and PubChem) perpetuate memorization. Furthermore, existing optimization methods often require computationally expensive property predictors or reinforcement learning loops; there is a practical need for more efficient “prompt-based” optimization techniques.

Core Innovations: Architecture and Pair-Tuning

Architecture: The application of a linear-attention transformer decoder with Rotary Positional Embeddings (RoPE) to generative chemistry, allowing for efficient training on 1.1 billion SMILES.
Pair-Tuning: A novel, parameter-efficient fine-tuning method that uses property-ordered molecular pairs to learn “soft prompts” for optimization without updating the base model weights.
Scaling Analysis: An extensive empirical investigation mapping the trade-off between inference compute (up to 10B generations) and chemical novelty, fitting an exponential decay curve that demonstrates how novelty saturates as generation volume grows.

Experimental Methodology and Downstream Tasks

The authors evaluated GP-MoLFormer on three distinct tasks, though the comparisons highlight the difficulty of evaluating foundation models against classical baselines:

De Novo Generation: Comparing validity, uniqueness, and novelty against baselines (CharRNN, VAE, LIMO, MolGen-7b) on a held-out test set. Notably, this is an unequal comparison; most baselines were trained on the 1.6M molecule MOSES dataset, whereas GP-MoLFormer uses up to 1.1B molecules, meaning performance gains are heavily driven by data scale.
Scaffold-Constrained Decoration: Generating molecules from DRD2 active binder scaffolds and measuring the hit rate of active compounds against specialized scaffold decorators.
Property-Guided Optimization: Using Pair-tuning to optimize for Drug-likeness (QED), Penalized logP, and DRD2 binding activity, comparing the results to graph-based and reinforcement learning benchmarks.

Additionally, they performed a Scaling Study:

Comparing models trained on raw (1.1B) vs. de-duplicated (650M) data.
Generating up to 10 billion molecules to fit empirical scaling laws for novelty.

Key Findings and Scaling Laws

Scale Driven Performance: GP-MoLFormer achieves high internal diversity and validity on generation metrics. However, its baseline novelty percentage (~32%) is considerably lower than classical models. The authors attribute this to the massive training scale forcing the model to heavily prioritize matching real-world molecule frequencies over pure exploration. GP-MoLFormer’s advantage in generation metrics over LLM-baselines like MolGen-7b likely stems heavily from its 10x larger training dataset rather than fundamental architectural superiority.
Pair-Tuning Efficacy: The proposed pair-tuning method effectively optimizes properties (e.g., improving DRD2 activity scores) without requiring full model fine-tuning or external reward loops. While successful, the text-based generation yields ~94.5% validity during optimization, which lags behind graph and SELFIES-based baselines that guarantee 100% structural validity.
Memorization vs. Novelty: Training on de-duplicated data (GP-MoLFormer-UNIQ) yields higher novelty (approx. 5-8% higher) than training on raw data, confirming that duplication bias in public databases leads directly to memorization.
Inference Scaling Law: Novelty decays exponentially with generation size ($y = ae^{-bx}$), yet the model maintains generative capability (~16.7% novelty) even after generating an unprecedented 10 billion molecules.

Reproducibility Details

Data

Sources: A combination of PubChem (111M SMILES) and ZINC (1B SMILES) databases. Downloading and pre-training instructions are located in the repository’s data/README.md.
Preprocessing:
- All SMILES were canonicalized using RDKit (no isomeric information).
- GP-MoLFormer (Base): Trained on the full 1.1B dataset (includes duplicates).
- GP-MoLFormer-UNIQ: Trained on a de-duplicated subset of 650M SMILES.
Tokenization: Uses the tokenizer from Schwaller et al. (2019) with a vocabulary size of 2,362 tokens.
Filtering: Sequences restricted to a maximum length of 202 tokens.

Algorithms

Pair-Tuning (Algorithm 1):

Objective: Learn task-specific soft prompts $\phi_T$ to maximize the conditional probability of target molecule $b$ given a seed molecule $a$, where pair $(a, b)$ satisfies the property condition $b > a$. The base model parameters $\theta$ remain frozen.
Prompt Structure: Autoregressive training optimizes the continuous embeddings of $n$ enhancement tokens against the cross-entropy loss of the target sequence: $$ \mathcal{L}(\phi_T) = - \sum_{i=1}^{|b|} \log P_{\theta}(b_i | \phi_T, a, b_{
Hyperparameters: Trained for 1,000 epochs with a batch size of 35 and a fixed learning rate of $3 \times 10^{-2}$.
Inference: The learned prompt $\phi_T$ and seed molecule $a$ are prepended as context, and candidates are sampled autoregressively until a termination token is produced.

Models

Availability: The model trained on deduplicated data (GP-MoLFormer-UNIQ) is publicly available on Hugging Face. The full 1.1B base model is not explicitly hosted. The source code repository includes a disclosure that IBM will not maintain the code going forward.
Architecture: Transformer decoder (~47M parameters: 12 layers, 12 heads, hidden size 768).
Attention Mechanism: Combines Linear Attention (Generalized Random Feature map, $\phi$) with Rotary Positional Embeddings (RoPE). To avoid the quadratic complexity of standard attention while maintaining relative positional awareness, RoPE is applied to queries ($Q$) and keys ($K$) prior to the random feature mapping: $$ \text{Attention}(Q, K, V) = \frac{\sum_{n=1}^N \langle \phi(R_m q_m), \phi(R_n k_n) \rangle v_n}{\sum_{n=1}^N \langle \phi(R_m q_m), \phi(R_n k_n) \rangle} $$
Inference Speed: ~3ms per forward pass on a single A100 GPU.

Evaluation

Generation Quality Metrics: Validity, Uniqueness, Novelty (MOSES suite), Fréchet ChemNet Distance (FCD), Scaffold similarity (Scaf), and Similarity to Nearest Neighbor (SNN).
MoLFormer-Based Metrics: The authors introduce Fréchet MoLFormer Distance (FMD) and MoLFormer-space IntDiv2 to measure distributional similarity using their own pre-trained continuous embeddings instead of standard fingerprints.
Optimization Metrics: Penalized logP (calculated as $\text{logP} - \text{SA} - \text{max}(\text{maxrings}(size) - 6, 0)$), Drug-likeness (QED), and DRD2 activity scores.
Scaling Metrics: Empirical fit for novelty decay: $y = ae^{-bx}$.

Hardware

Compute: 16 x NVIDIA A100 (80 GB) GPUs across 2 nodes connected via EDR Infiniband.
Training Time:
- GP-MoLFormer (1.1B data): ~115 hours total (28.75 hours/epoch for 4 epochs).
- GP-MoLFormer-UNIQ (650M data): ~80 hours total.
Hyperparameters: Used a batch size of 1,600 molecules per GPU with a fixed learning rate of $1.6 \times 10^{-4}$ (scaled up to $8\times$ factor as GPUs increased).
Optimization: Used distributed data-parallel training and adaptive bucketing by sequence length to handle scale.

Artifacts

Artifact	Type	License	Notes
GP-MoLFormer (GitHub)	Code	Apache 2.0	Official implementation; IBM will not maintain going forward
GP-MoLFormer-Uniq (Hugging Face)	Model	Apache 2.0	Pre-trained on 650M de-duplicated SMILES

The full 1.1B base model weights are not publicly hosted. The training data (PubChem and ZINC) is publicly available, and instructions for downloading and pre-processing are in the repository’s data/README.md.

Paper Information

Citation: Ross, J., Belgodere, B., Hoffman, S. C., Chenthamarakshan, V., Navratil, J., Mroueh, Y., & Das, P. (2025). GP-MoLFormer: A Foundation Model For Molecular Generation. Digital Discovery, 4(10), 2684–2696. https://doi.org/10.1039/D5DD00122F

Publication: Digital Discovery, vol. 4, no. 10, pp. 2684–2696 (2025)

@article{ross2025gpmolformer,
  title={GP-MoLFormer: a foundation model for molecular generation},
  author={Ross, Jerret and Belgodere, Brian and Hoffman, Samuel C and Chenthamarakshan, Vijil and Navratil, Jiri and Mroueh, Youssef and Das, Payel},
  journal={Digital Discovery},
  volume={4},
  number={10},
  pages={2684--2696},
  year={2025},
  publisher={Royal Society of Chemistry},
  doi={10.1039/D5DD00122F}
}

Chemformer: A Pre-trained Transformer for Comp Chem

Tue, 23 Dec 2025 00:00:00 +0000

Paper Contribution and Methodological Classification

This is a Methodological ($\Psi_{\text{Method}}$) paper. It proposes an architecture adaptation (Chemformer based on BART) and a specific pre-training strategy (“Combined” masking and augmentation). The paper validates this method by benchmarking against established models on multiple tasks, including direct synthesis, retrosynthesis, and molecular optimization. It also includes a secondary Resource ($\Psi_{\text{Resource}}$) contribution by making the pre-trained models and code available.

Motivation: Computational Bottlenecks in Cheminformatics

Existing Transformer models for cheminformatics are often developed for single applications and are computationally expensive to train from scratch. For example, training a Molecular Transformer for reaction prediction can take days, limiting hyperparameter exploration. Self-supervised pre-training (like BERT or T5) has significantly advanced NLP by reducing fine-tuning time and improving performance. In chemistry, applications have traditionally focused on task-specific datasets or encoder-only architectures, which perform poorly on sequence generation tasks. The authors aim to use transfer learning on a large unlabelled dataset to create a model that converges quickly and performs well across diverse sequence-to-sequence and discriminative tasks.

Core Innovation: BART Architecture and Combined Pre-training

The primary insight lies in the adaptation of the BART architecture for chemistry and the introduction of a “Combined” self-supervised pre-training task.

Architecture: Chemformer uses the BART encoder-decoder structure, allowing it to handle both discriminative (property prediction) and generative (reaction prediction) tasks efficiently. This provides an alternative to encoder-only (BERT) or decoder-only (GPT) models.
Combined Pre-training: The authors introduce a task that applies both Span Masking (randomly replacing tokens with ) and SMILES Augmentation (permuting atom order, see Randomized SMILES) simultaneously. Formally, given a canonical SMILES sequence $x$, a corrupted sequence $\tilde{x} = \text{Mask}(\text{Augment}(x))$ is generated. The model is trained using an autoregressive cross-entropy loss to reconstruct the canonical sequence from the corrupted input: $$ \mathcal{L}_{\text{pre-train}} = -\sum_{t=1}^{|x|} \log P(x_t \mid x_{
Tunable Augmentation: A downstream augmentation strategy is proposed where the probability of augmenting the input/output SMILES ($p_{aug}$) is a tunable hyperparameter, performed on-the-fly.

Experimental Setup and Pre-training Tasks

The authors pre-trained Chemformer on 100 million molecules from ZINC-15 and fine-tuned it on three distinct task types:

Seq2Seq Reaction Prediction:
- Direct Synthesis: USPTO-MIT dataset (Mixed and Separated).
- Retrosynthesis: USPTO-50K dataset (see also Molecular Transformer, Tied Two-Way Transformers).
Molecular Optimization: Generating molecules with improved properties (LogD, solubility, clearance) starting from ChEMBL matched molecular pairs.
Discriminative Tasks:
- QSAR: Predicting properties (ESOL, FreeSolv, Lipophilicity) from MoleculeNet.
- Bioactivity: Predicting pXC50 values for 133 genes using ExCAPE data.

Ablation studies compared three pre-training strategies (Masking, Augmentation, Combined) against a randomly initialized baseline.

Results, Trade-offs, and Conclusions

Performance: Chemformer achieved competitive top-1 accuracy on USPTO-MIT (91.3% Mixed) and USPTO-50K (53.6-54.3%), outperforming the Augmented Transformer and graph-based models (GLN, GraphRetro).
Convergence Speed: Pre-training significantly accelerated training; fine-tuning for just 20 epochs (30 mins) outperformed the previous baselines trained for significantly longer.
Pre-training Tasks: The “Combined” task generally performed best for reaction prediction and bioactivity, while “Masking” was superior for molecular optimization.
Augmentation Trade-off: The augmentation strategy improved top-1 accuracy but significantly degraded top-5/10 accuracy because beam search outputs became populated with augmented versions of the same molecule. This presents a considerable limitation for practical applications like retrosynthesis mapping, where retrieving a diverse set of candidate reactions is often critical.
Discriminative Evaluation Caveats: Chemformer underperformed specialized baselines (like D-MPNN or MolBERT) on small discriminative datasets. The authors note that direct comparison is difficult: Chemformer was trained simultaneously on multiple subtasks (multi-task learning), while the literature baselines were trained and tuned on each subtask separately. Additionally, the Chemformer encoder uses fewer than 20M parameters compared to MolBERT’s approximately 85M, and Chemformer’s pre-training does not include molecular property objectives. For other transfer learning approaches to QSAR, see MolPMoFiT.
Pre-training Data Scope: The 100M pre-training dataset from ZINC-15 was selected with constraints on molecular weight ($\le 500$ Da) and LogP ($\le 5$), focusing the learned representations on small, drug-like molecules.

Reproducibility Details

Data

Note: The primary GitHub repository for Chemformer was officially archived on February 11, 2026. Pre-trained weights and datasets used in the paper are still hosted externally on Box. Active development of Chemformer models has moved to the AiZynthModels repository.

Artifact	Type	License	Notes
Chemformer (GitHub)	Code	Apache-2.0	Archived; original PyTorch implementation
AiZynthModels (GitHub)	Code	Apache-2.0	Active successor repository
Pre-trained weights (Box)	Model	Unknown	Base and Large model checkpoints

The following datasets were used for pre-training and benchmarking.

Purpose	Dataset	Size	Notes
Pre-training	ZINC-15	100M	Selected subset (reactive, annotated purchasability, MW $\le 500$, LogP $\le 5$). Split: 99% Train / 0.5% Val / 0.5% Test.
Direct Synthesis	USPTO-MIT	~470k	Evaluated on “Mixed” and “Separated” variants.
Retrosynthesis	USPTO-50K	~50k	Standard benchmark for retrosynthesis.
Optimization	ChEMBL MMPs	~160k Train	Matched Molecular Pairs for LogD, solubility, and clearance optimization.
Properties	MoleculeNet	Small	ESOL (1128), FreeSolv (642), Lipophilicity (4200).
Bioactivity	ExCAPE	~312k	133 gene targets; >1200 compounds per gene.

Preprocessing:

Tokenization: Regex-based tokenization (523 tokens total) derived from ChEMBL 27 canonical SMILES.
Augmentation: SMILES enumeration (permuting atom order) used for pre-training and on-the-fly during fine-tuning ($p_{aug}=0.5$ for Seq2Seq, $p_{aug}=1.0$ for discriminative).

Algorithms

Pre-training Tasks:
1. Masking: Span masking (BART style).
2. Augmentation: Input is a randomized SMILES; target is canonical SMILES.
3. Combined: Input is augmented then masked; target is canonical SMILES.
Optimization:
- Optimizer: Adam ($\beta_1=0.9, \beta_2=0.999$).
- Schedule: Linear warm-up (8000 steps) for pre-training; One-cycle schedule for fine-tuning.
Inference: Beam search with width 10 for Seq2Seq tasks. Used molbart/inference_score.py and molbart/retrosynthesis/round_trip_inference.py for standard and round-trip validation.

Models

Two model sizes were trained. Both use the Pre-Norm Transformer layout with GELU activation.

Hyperparameter	Chemformer (Base)	Chemformer-Large
Layers	6	8
Model Dimension	512	1024
Feed-forward Dim	2048	4096
Attention Heads	8	16
Parameters	~45M	~230M
Pre-training Task	All 3 variants	Combined only

Evaluation

Comparisons relied on Top-N accuracy for reaction tasks and validity metrics for optimization.

Metric	Task	Key Result	Baseline
Top-1 Acc	Direct Synthesis (Sep)	92.8% (Large)	91.1% (Aug Transformer)
Top-1 Acc	Retrosynthesis	54.3% (Large)	53.7% (GraphRetro) / 52.5% (GLN)
Desirable %	Mol Optimization	75.0% (Base-Mask)	70.2% (Transformer-R)
RMSE	Lipophilicity	0.598 (Combined)	0.555 (D-MPNN)

Hardware

Compute: 4 NVIDIA V100 GPUs (batch size 128 per GPU).
Training Time:
- Pre-training: 2.5 days (Base) / 6 days (Large) for 1M steps.
- Fine-tuning: ~20-40 epochs for reaction prediction (<12 hours).

Paper Information

Citation: Irwin, R., Dimitriadis, S., He, J., & Bjerrum, E. J. (2022). Chemformer: a pre-trained transformer for computational chemistry. Machine Learning: Science and Technology, 3(1), 015022. https://doi.org/10.1088/2632-2153/ac3ffb

Publication: Machine Learning: Science and Technology 2022

@article{irwinChemformerPretrainedTransformer2022,
  title = {Chemformer: A Pre-Trained Transformer for Computational Chemistry},
  shorttitle = {Chemformer},
  author = {Irwin, Ross and Dimitriadis, Spyridon and He, Jiazhen and Bjerrum, Esben Jannik},
  year = 2022,
  month = jan,
  journal = {Machine Learning: Science and Technology},
  volume = {3},
  number = {1},
  pages = {015022},
  publisher = {IOP Publishing},
  issn = {2632-2153},
  doi = {10.1088/2632-2153/ac3ffb}
}