Drug-Design on Hunter Heidenreich | ML Research Scientist

REINVENT: Reinforcement Learning for Mol. Design

Sat, 28 Mar 2026 00:00:00 +0000

Augmented Episodic Likelihood for Goal-Directed Generation

This is a Method paper that introduces REINVENT, a policy-based reinforcement learning framework for molecular de novo design. The primary contribution is a novel cost function, the augmented episodic likelihood, that fine-tunes a SMILES-based recurrent neural network (RNN) pre-trained on ChEMBL toward generating molecules satisfying user-defined property objectives. The method anchors the agent to the prior distribution of valid drug-like molecules, addressing failure modes of standard REINFORCE algorithms (reward exploitation and mode collapse to trivially simple structures).

De Novo Design Needs Flexible, Data-Driven Approaches

Traditional de novo design methods fall into three categories, each with limitations:

Structure-based approaches grow ligands to fit binding pockets but often produce molecules with poor DMPK profiles and synthetic intractability.
Ligand-based virtual library approaches generate large libraries and score them, but are constrained by pre-defined reaction rules or transformation rules that limit chemical diversity.
Inverse QSAR methods attempt to map favorable activity regions back to molecular structures, but require descriptors suitable for both forward prediction and inverse mapping.

RNN-based generative models trained on SMILES offer a data-driven alternative that can learn the underlying distribution of drug-like chemical space without rigid rules. Segler et al. (2017) showed that fine-tuning a pre-trained RNN on focused actives yields high fractions of predicted actives. However, this maximum likelihood fine-tuning cannot use negative or continuous scores and risks catastrophic forgetting.

Prior RL approaches had significant issues. Jaques et al. (2016) used Deep Q-learning with prior likelihood regularization for sequence generation, but reported dependence on hand-written rules to penalize undesirable sequences and still observed reward exploitation producing unrealistically simple molecules. Standard REINFORCE algorithms tend to converge on trivial solutions (e.g., generating only “C” to satisfy a scoring function).

The Augmented Episodic Likelihood Framework

The core innovation is a formulation where the agent learns a policy that minimizes the squared difference between its own log-likelihood and an augmented target likelihood.

The RNN is first pre-trained on 1.5 million canonical SMILES from ChEMBL via maximum likelihood estimation:

$$ J(\Theta) = -\sum_{t=1}^{T} \log P(x^{t} \mid x^{t-1}, \dots, x^{1}) $$

The pre-trained model (the Prior) is then used as the starting point for the Agent. For a generated SMILES sequence $A = a_1, a_2, \dots, a_T$, the model likelihood is $P(A) = \prod_{t=1}^{T} \pi(a_t \mid s_t)$, and a scoring function $S(A) \in [-1, 1]$ rates desirability.

The augmented likelihood combines prior likelihood with the score:

$$ \log P(A)_{\mathbb{U}} = \log P(A)_{Prior} + \sigma S(A) $$

where $\sigma$ is a scalar coefficient controlling the trade-off between prior fidelity and score optimization.

The return is defined as the negative squared difference between the augmented likelihood and the agent’s likelihood:

$$ G(A) = -\left[\log P(A)_{\mathbb{U}} - \log P(A)_{\mathbb{A}}\right]^{2} $$

The agent minimizes $J(\Theta) = -G$, effectively learning a policy whose sequence likelihoods match the prior modulated by the scoring function. The authors show in supplementary material that this is equivalent to a REINFORCE algorithm with a specific final-step reward formulation.

This design has three key advantages over standard REINFORCE:

The target policy is explicitly stochastic, preserving diversity in generated molecules
The prior anchoring prevents catastrophic forgetting of SMILES syntax and chemical space coverage
No hand-written rules are needed to penalize degenerate solutions

The Agent is trained on-policy with batches of 128 generated sequences, using SGD with learning rate 0.0005 and gradient clipping to $[-3, 3]$.

Three Experiments: Sulphur Avoidance, Celecoxib Analogues, and DRD2 Activity

Prior Network Architecture

The Prior is a 3-layer RNN with 1024 Gated Recurrent Units per layer, trained on RDKit canonical SMILES from ChEMBL (molecules with 10-50 heavy atoms, elements from ${H, B, C, N, O, F, Si, P, S, Cl, Br, I}$). Training used Adam ($\beta_1 = 0.9$, $\beta_2 = 0.999$, $\epsilon = 10^{-8}$) for 50,000 steps with batch size 128 and learning rate decay of 0.02 every 100 steps. The Prior generates 94% valid SMILES, of which 90% are novel.

Experiment 1: Learning to Avoid Sulphur

A proof-of-principle task where the scoring function assigns $S(A) = 1$ for valid sulphur-free molecules, $S(A) = 0$ for invalid SMILES, and $S(A) = -1$ for sulphur-containing molecules.

The Agent method was compared against three alternatives:

Method	Fraction Valid	Fraction No S	Avg MW	Avg cLogP	Avg RotBonds	Avg AromRings
Prior	0.94	0.66	371	3.36	5.39	2.26
Agent	0.95	0.98	367	3.37	5.41	2.26
Action basis	0.95	0.92	372	3.39	6.08	2.09
REINFORCE	0.98	0.98	585	11.3	30.0	0.57
REINFORCE + Prior	0.98	0.92	232	3.05	2.8	2.11

Standard REINFORCE exploited the reward by generating sequences of predominantly “C” (average MW 585, cLogP 11.3). REINFORCE + Prior avoided this but collapsed to small, simplistic structures (MW 232). The Agent achieved 98% sulphur-free structures while maintaining molecular properties nearly identical to the Prior, demonstrating that augmented episodic likelihood preserves the prior distribution.

Experiment 2: Similarity-Guided Generation (Celecoxib Analogues)

The scoring function uses Jaccard similarity on FCFP4 fingerprints:

$$ S(A) = -1 + 2 \times \frac{\min{J_{i,j}, k}}{k} $$

where $k$ caps the rewarded similarity. With $k = 1$ and $\sigma = 15$, the Agent recovers Celecoxib itself within 200 training steps. Even when all structures with $J > 0.5$ to Celecoxib (1,804 molecules) were removed from the Prior training set, the Agent still found Celecoxib after 400 steps, despite a 700-fold reduction in prior likelihood ($\log_e P$ from $-12.7$ to $-19.2$).

With moderate similarity targets ($k = 0.7$, $\sigma = 12$), the Agent generates diverse analogues including scaffold hops where functional groups are rearranged.

Experiment 3: Target Activity (DRD2)

The most drug-discovery-relevant task: generating molecules predicted active against the dopamine receptor type 2 (DRD2). An SVM classifier (Gaussian kernel, $C = 2^7$, $\gamma = 2^{-6}$) was trained on bioactivity data from ExCAPE-DB (7,218 actives with pIC50 > 5, 100,000 sampled inactives). The actives were split by Butina clustering (ECFP6, cutoff 0.4) to decrease nearest-neighbor similarity between train and test sets.

Metric	Prior	Agent	Prior (reduced)	Agent (reduced)
Fraction valid SMILES	0.94	0.99	0.94	0.99
Fraction predicted actives	0.03	0.97	0.02	0.96
Fraction similar to train active	0.02	0.79	0.02	0.75
Fraction similar to test active	0.01	0.46	0.01	0.38
Test actives recovered (x10^-3)	13.5	126	2.85	72.6

The Agent increased the fraction of predicted actives from 2-3% (Prior) to 96-97%, representing a 250-fold enrichment in the probability of generating a test set active. The Agent based on the reduced Prior (DRD2 actives removed from ChEMBL) still recovered 7% of test actives, meaning it generated experimentally confirmed actives that appeared in neither the generative model nor the activity prediction model training data.

Anchored Policy Learning Prevents Reward Exploitation

The key finding is that augmented episodic likelihood successfully balances score optimization with prior distribution preservation. The Agent achieves task objectives (sulphur avoidance, similarity targets, activity prediction) while maintaining the molecular property distributions learned from ChEMBL. This is a significant improvement over standard REINFORCE, which either exploits rewards trivially or collapses to simple structures.

Analysis of the conditional probability distributions between the Prior and Agent (for DRD2 active generation) shows that the policy changes are not drastic: most trends learned by the Prior carry over, with targeted modifications at specific steps that substantially alter sequence likelihoods and generated structure types.

Limitations acknowledged by the authors:

All experiments use single-parameter scoring functions; multi-parametric optimization (activity + DMPK + synthetic accessibility) is left for future work
The quality of generated structures depends heavily on the Prior’s coverage of chemical space
The activity model (SVM) has limited domain of applicability, and structures outside this domain may be falsely scored
No exhaustive study of how Prior training set size, model size, and regularization affect generation quality

Future directions include multi-parametric scoring functions, exploration of token embeddings, and adversarial training where the scoring function is replaced by a discriminator network (GAN-style training).

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Prior training	ChEMBL	1.5M structures	10-50 heavy atoms, filtered elements
DRD2 activity model	ExCAPE-DB	7,218 actives + 100K inactives	Butina clustering split (ECFP6, cutoff 0.4)
Similarity target	Celecoxib	1 query structure	FCFP4 fingerprints for Jaccard similarity

Algorithms

Prior: 3-layer GRU RNN (1024 units/layer), Adam optimizer, 50K steps, batch size 128, LR 0.001 with 0.02 decay/100 steps
Agent: Same architecture, SGD with LR 0.0005, gradient clipping [-3, 3], on-policy batches of 128
DRD2 model: SVM with Gaussian kernel ($C = 2^7$, $\gamma = 2^{-6}$), grid search on validation set

Models

Artifact	Type	License	Notes
REINVENT	Code	MIT	Original implementation in TensorFlow/Python 2.7
Archived version	Code	MIT	Zenodo archive (DOI: 10.5281/zenodo.572576)

Evaluation

SMILES validity rate (RDKit parsing)
Fraction of structures satisfying scoring function
Molecular property distributions (MW, cLogP, rotatable bonds, aromatic rings)
Jaccard similarity on ECFP6/FCFP4 fingerprints
Recovery rate of known actives from test set

Hardware

Not specified in the paper. The implementation uses TensorFlow 1.0.1 with Python 2.7, RDKit, and Scikit-learn.

Paper Information

Citation: Olivecrona, M., Blaschke, T., Engkvist, O., & Chen, H. (2017). Molecular de-novo design through deep reinforcement learning. Journal of Cheminformatics, 9(1), 48.

@article{olivecrona2017molecular,
  title={Molecular de-novo design through deep reinforcement learning},
  author={Olivecrona, Marcus and Blaschke, Thomas and Engkvist, Ola and Chen, Hongming},
  journal={Journal of Cheminformatics},
  volume={9},
  number={1},
  pages={48},
  year={2017},
  publisher={Springer},
  doi={10.1186/s13321-017-0235-x}
}

PharMolixFM: Multi-Modal All-Atom Molecular Models

Sat, 28 Mar 2026 00:00:00 +0000

A Unified Framework for All-Atom Molecular Foundation Models

PharMolixFM is a Method paper that introduces a unified framework for constructing all-atom foundation models for molecular modeling and generation. The primary contribution is the systematic implementation of three multi-modal generative model variants (diffusion, flow matching, and Bayesian flow networks) within a single architecture, along with a task-unifying denoising formulation that enables training on multiple structural biology tasks simultaneously. The framework achieves competitive performance on protein-small-molecule docking and structure-based drug design while providing the first empirical analysis of inference scaling laws for molecular generative models.

Existing all-atom foundation models such as AlphaFold3, RoseTTAFold All-Atom, and ESM-AA face two core challenges that limit their generalization across molecular modeling and generation tasks.

First, atomic data is inherently multi-modal: each atom comprises both a discrete atom type and continuous 3D coordinates. This poses challenges for structure models that need to jointly capture and predict both modalities. Unlike text or image data that exhibit a single modality, molecular structures require generative models that can handle discrete categorical variables (atom types, bond types) and continuous variables (coordinates) simultaneously.

Second, there has been no comprehensive analysis of how different training objectives and sampling strategies impact the performance of all-atom foundation models. Prior work has focused on individual model architectures without systematically comparing generative frameworks or studying how inference-time compute scaling affects prediction quality.

PharMolixFM addresses both challenges by providing a unified framework that implements three state-of-the-art multi-modal generative models and formulates all downstream tasks as a generalized denoising process with task-specific priors.

The core innovation of PharMolixFM is the formulation of molecular tasks as a generalized denoising process where task-specific priors control which parts of the molecular system are noised during training. The framework decomposes a biomolecular system into $N$ atoms represented as a triplet $\bar{\mathbf{S}}_0 = \langle \mathbf{X}_0, \mathbf{A}_0, \mathbf{E}_0 \rangle$, where $\mathbf{X}_0 \in \mathbb{R}^{N \times 3}$ are atom coordinates, $\mathbf{A}_0 \in \mathbb{Z}^{N \times D_1}$ are one-hot atom types, and $\mathbf{E}_0 \in \mathbb{Z}^{N \times N \times D_2}$ are one-hot bond types.

The generative model estimates the density $p_\theta(\langle \mathbf{X}_0, \mathbf{A}_0, \mathbf{E}_0 \rangle)$ subject to SE(3) invariance:

$$ p_\theta(\langle \mathbf{R}\mathbf{X}_0 + \mathbf{t}, \mathbf{A}_0, \mathbf{E}_0 \rangle) = p_\theta(\langle \mathbf{X}_0, \mathbf{A}_0, \mathbf{E}_0 \rangle) $$

The variational lower bound is optimized over latent variables $S_1, \ldots, S_T$ obtained by adding independent noise to different modalities and atoms:

$$ q(S_{1:T} \mid S_0) = \prod_{i=1}^{T} \prod_{j=1}^{N} q(\mathbf{X}_{i,j} \mid \mathbf{X}_{0,j}, \sigma_{i,j}^{(\mathbf{X})}) , q(\mathbf{A}_{i,j} \mid \mathbf{A}_{0,j}, \sigma_{i,j}^{(\mathbf{A})}) , q(\mathbf{E}_{i,j} \mid \mathbf{E}_{0,j}, \sigma_{i,j}^{(\mathbf{E})}) $$

A key design choice is the noise schedule $\sigma_{i,j}^{(\mathcal{M})} = \frac{i}{T} \cdot \text{fix}_j^{(\mathcal{M})}$, where $\text{fix}_j^{(\mathcal{M})}$ is a scaling factor between 0 and 1 that controls which atoms and modalities receive noise. This “Fix” mechanism enables multiple training tasks:

Docking ($\text{Fix} = 1$ for protein and molecular graph, $\text{Fix} = 0$ for molecule coordinates): predicts binding pose given known atom/bond types.
Structure-based drug design ($\text{Fix} = 1$ for protein, $\text{Fix} = 0$ for all molecule properties): generates novel molecules for a given pocket.
Robustness augmentation ($\text{Fix} = 0.7$ for 15% randomly selected atoms, $\text{Fix} = 0$ for rest): simulates partial structure determination.

Three Generative Model Variants

Multi-modal diffusion (PharMolixFM-Diff) uses a Markovian forward process. Continuous coordinates follow Gaussian diffusion while discrete variables use a D3PM categorical transition:

$$ q(\mathbf{X}_{i,j} \mid \mathbf{X}_{0,j}) = \mathcal{N}(\sqrt{\alpha_{i,j}} , \mathbf{X}_{0,j}, (1 - \alpha_{i,j}) \mathbf{I}), \quad \alpha_{i,j} = \prod_{k=1}^{i}(1 - \sigma_{i,j}^{(\mathbf{X})}) $$

$$ q(\mathbf{A}_{i,j} \mid \mathbf{A}_{0,j}) = \text{Cat}(\mathbf{A}_{0,j} \bar{Q}_{i,j}^{(\mathbf{A})}), \quad Q_{i,j}^{(\mathbf{A})} = (1 - \sigma_{i,j}^{(\mathbf{A})}) \mathbf{I} + \frac{\sigma_{i,j}^{(\mathbf{A})}}{D_1} \mathbb{1}\mathbb{1}^T $$

The training loss combines coordinate MSE with cross-entropy for discrete variables:

$$ \mathcal{L} = \mathbb{E}_{S_0, i, S_i} \left[ \lambda_i^{(\mathbf{X})} | \tilde{\mathbf{X}}_0 - \mathbf{X}_0 |_2^2 + \lambda_i^{(\mathbf{A})} \mathcal{L}_{CE}(\tilde{\mathbf{A}}_0, \mathbf{A}_0) + \lambda_i^{(\mathbf{E})} \mathcal{L}_{CE}(\tilde{\mathbf{E}}_0, \mathbf{E}_0) \right] $$

Multi-modal flow matching (PharMolixFM-Flow) constructs a direct mapping between data and prior distributions using conditional vector fields. For coordinates, the conditional flow uses a Gaussian path $q(\mathbf{X}_{i,j} \mid \mathbf{X}_{0,j}) = \mathcal{N}((1 - \sigma_{i,j}^{(\mathbf{X})}) \mathbf{X}_{0,j}, (\sigma_{i,j}^{(\mathbf{X})})^2 \mathbf{I})$, while discrete variables use the same D3PM Markov chain. Sampling proceeds by solving an ODE via Euler integration.

Bayesian flow networks (PharMolixFM-BFN) perform generative modeling in the parameter space of the data distribution rather than the data space. The Bayesian flow distribution for coordinates is:

$$ p_F(\tilde{\mathbf{X}}_{i,j}^{(\theta)} \mid \mathbf{X}_{0,j}) = \mathcal{N}(\gamma_{i,j} \mathbf{X}_{0,j}, \gamma_{i,j}(1 - \gamma_{i,j}) \mathbf{I}), \quad \gamma_{i,j} = 1 - \alpha^{2(1 - \sigma_{i,j}^{(\mathbf{X})})} $$

Network Architecture

The architecture follows PocketXMol with a dual-branch SE(3)-equivariant graph neural network. A protein branch (4-layer GNN with kNN graph) processes pocket atoms, then representations are passed to a molecule branch (6-layer GNN) that captures protein-molecule interactions. Independent prediction heads reconstruct atom coordinates, atom types, and bond types, with additional confidence heads for self-ranking during inference.

Docking and Drug Design Experiments

Protein-Small-Molecule Docking

PharMolixFM is evaluated on the PoseBusters benchmark (428 protein-small-molecule complexes) using the holo docking setting with a known protein structure and 10 Angstrom binding pocket. The metric is the ratio of predictions with RMSD < 2 Angstrom.

Method	Self-Ranking (%)	Oracle-Ranking (%)
DiffDock	38.0	-
RFAA	42.0	-
Vina	52.3	-
UniMol-Docking V2	77.6	-
SurfDock	78.0	-
AlphaFold3	90.4	-
PocketXMol (50 repeats)	82.2	95.3
PharMolixFM-Diff (50 repeats)	83.4	96.0
PharMolixFM-Flow (50 repeats)	73.4	93.7
PharMolixFM-BFN (50 repeats)	78.5	93.5
PharMolixFM-Diff (500 repeats)	83.9	98.1

PharMolixFM-Diff achieves the second-best self-ranking result (83.4%), outperforming PocketXMol by 1.7% absolute but trailing AlphaFold3 (90.4%). The key advantage is inference speed: approximately 4.6 seconds per complex on a single A800 GPU compared to approximately 249.0 seconds for AlphaFold3 (a 54x speedup). Under oracle-ranking with 500 repeats, PharMolixFM-Diff reaches 98.1%, suggesting that better ranking strategies could further improve practical performance.

Structure-Based Drug Design

Evaluation uses the CrossDocked test set (100 protein pockets, 100 molecules generated per pocket), measuring Vina binding affinity scores and drug-likeness properties (QED and SA).

Method	Vina Score (Avg/Med)	QED	SA
Pocket2Mol	-5.14 / -4.70	0.57	0.76
TargetDiff	-5.47 / -6.30	0.48	0.58
DecompDiff	-5.67 / -6.04	0.45	0.61
MolCRAFT	-6.61 / -8.14	0.46	0.62
PharMolixFM-Diff	-6.18 / -6.44	0.50	0.73
PharMolixFM-Flow	-6.34 / -6.47	0.49	0.74
PharMolixFM-BFN	-6.38 / -6.45	0.48	0.64

PharMolixFM achieves a better balance between binding affinity and drug-like properties compared to baselines. While MolCRAFT achieves the best Vina scores, PharMolixFM-Diff and Flow variants show notably higher QED (0.49-0.50 vs. 0.45-0.48) and SA (0.73-0.74 vs. 0.58-0.62), which are important for downstream validation and in-vivo application.

Inference Scaling Law

The paper explores whether inference-time scaling holds for molecular generative models, fitting the relationship:

$$ \text{Acc} = a \log(bR + c) + d $$

where $R$ is the number of sampling repeats. All three PharMolixFM variants exhibit logarithmic improvement in docking accuracy with increased sampling repeats, analogous to inference scaling laws observed in NLP. Performance plateaus eventually due to distributional differences between training and test sets.

Competitive Docking with Faster Inference, but Limited Task Scope

PharMolixFM demonstrates that multi-modal generative models can achieve competitive all-atom molecular modeling with substantial inference speed advantages over AlphaFold3. The key findings are:

Diffusion outperforms flow matching and BFN for docking under standard sampling budgets. The stochastic nature of diffusion sampling appears beneficial compared to the deterministic ODE integration of flow matching.
Oracle-ranking reveals untapped potential: the gap between self-ranking (83.4%) and oracle-ranking (98.1%) at 500 repeats indicates that confidence-based ranking is a bottleneck. Better ranking methods could close the gap with AlphaFold3.
The three variants show similar performance for drug design, suggesting that model architecture and training data may matter more than the generative framework for generation tasks.
Inference scaling laws hold for molecular generative models, paralleling findings in NLP.

Limitations include that the framework is only evaluated on two tasks (docking and SBDD), and the paper does not address protein structure prediction, protein-protein interactions, or nucleic acid modeling, which are part of AlphaFold3’s scope. The BFN variant underperforms the diffusion model, which the authors attribute to smaller noise scales at early sampling steps making training less challenging. The paper also does not compare against concurrent work on inference-time scaling for molecular models.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Training	PDBBind, Binding MOAD, CrossDocked2020, PepBDB	Not specified	Filtered by PocketXMol criteria
Docking eval	PoseBusters benchmark	428 complexes	Holo docking with known protein
SBDD eval	CrossDocked test set	100 pockets	100 molecules per pocket

Algorithms

Three generative variants: multi-modal diffusion (D3PM), flow matching, Bayesian flow networks
Task-specific noise via Fix mechanism (0, 0.7, or 1.0)
Training tasks selected with equal probability per sample
AdamW optimizer: weight decay 0.001, $\beta_1 = 0.99$, $\beta_2 = 0.999$
Linear warmup to learning rate 0.001 over 1000 steps
180K training steps with batch size 40

Models

Dual-branch SE(3)-equivariant GNN (protein: 4-layer, molecule: 6-layer)
kNN graph construction for protein and protein-molecule interactions
Independent prediction heads for coordinates, atom types, bond types
Confidence heads for self-ranking during inference

Evaluation

Metric	PharMolixFM-Diff	AlphaFold3	Notes
RMSD < 2A self-ranking	83.4% (50 rep)	90.4%	PoseBusters docking
RMSD < 2A oracle-ranking	98.1% (500 rep)	-	PoseBusters docking
Inference time (per complex)	~4.6s	~249.0s	Single A800 GPU
Vina score (avg)	-6.18	-	CrossDocked SBDD

Hardware

Training: 4x 80GB A800 GPUs
Inference benchmarked on single A800 GPU

Artifacts

Artifact	Type	License	Notes
OpenBioMed (GitHub)	Code	MIT	Official implementation

Paper Information

Citation: Luo, Y., Wang, J., Fan, S., & Nie, Z. (2025). PharMolixFM: All-Atom Foundation Models for Molecular Modeling and Generation. arXiv preprint arXiv:2503.21788.

@article{luo2025pharmolixfm,
  title={PharMolixFM: All-Atom Foundation Models for Molecular Modeling and Generation},
  author={Luo, Yizhen and Wang, Jiashuo and Fan, Siqi and Nie, Zaiqing},
  journal={arXiv preprint arXiv:2503.21788},
  year={2025}
}

ORGAN: Objective-Reinforced GANs for Molecule Design

Sat, 28 Mar 2026 00:00:00 +0000

Combining GANs and Reinforcement Learning for Goal-Directed Sequence Generation

This is a Method paper that introduces ORGAN (Objective-Reinforced Generative Adversarial Network), a framework for generating sequences that are both realistic (close to the training distribution) and optimized for domain-specific objectives. ORGAN extends SeqGAN by adding external reward functions to the reinforcement learning signal, with a tunable parameter $\lambda$ controlling the balance between adversarial (discriminator) and objective-based rewards. The authors demonstrate ORGAN on two domains: molecular generation using SMILES strings (optimizing druglikeness, solubility, and synthesizability) and musical melody generation (optimizing tonality and step ratios).

Exposure Bias and Mode Collapse in Discrete Sequence Generation

Generating discrete sequences with desirable properties presents two intertwined challenges. First, RNNs trained via maximum likelihood estimation (MLE) suffer from exposure bias, where the model sees only ground-truth prefixes during training but must condition on its own (potentially erroneous) outputs at generation time. Second, while GANs can address some of these issues through adversarial training, they were not initially applicable to discrete data due to non-differentiability of the sampling step. SeqGAN resolved this by framing the generator as an RL agent, but it optimizes only for distributional fidelity (fooling the discriminator) without any mechanism to steer generation toward specific property targets.

In drug discovery, simply generating valid, drug-like molecules is insufficient. Practitioners need to optimize for particular pharmaceutical properties (e.g., solubility, synthesizability, druglikeness) while maintaining structural diversity. Naive RL approaches can optimize properties effectively but tend to collapse onto trivial solutions (e.g., repeating “CCCCCCC” to maximize solubility). The challenge is to combine the distributional regularization of adversarial training with the goal-directedness of RL.

Mixed Reward: Interpolating Between Adversarial and Objective Signals

ORGAN’s core innovation is a reward function that linearly interpolates between the discriminator score and domain-specific objectives:

$$R(Y_{1:T}) = \lambda \cdot D_{\phi}(Y_{1:T}) + (1 - \lambda) \cdot O_{i}(Y_{1:T})$$

When $\lambda = 1$, the model reduces to SeqGAN (pure adversarial training). When $\lambda = 0$, it becomes naive RL optimizing only the objective. Intermediate values allow the adversarial component to regularize the generator, keeping samples within the distribution while the objective component steers toward desired properties.

The generator $G_{\theta}$ is an LSTM-based RNN that produces sequences token-by-token. Training follows the REINFORCE algorithm, where the expected long-term reward is:

$$J(\theta) = \mathbb{E}\left[R(Y_{1:T}) \mid s_{0}, \theta\right] = \sum_{y_{1} \in Y} G_{\theta}(y_{1} \mid s_{0}) \cdot Q(s_{0}, y_{1})$$

For intermediate timesteps (partial sequences), the action-value function $Q$ is estimated via $N$-time Monte Carlo rollouts:

$$Q(Y_{1:t-1}, y_{t}) = \begin{cases} \frac{1}{N} \sum_{n=1}^{N} R(Y_{1:T}^{n}), & \text{if } t < T \\ R(Y_{1:T}), & \text{if } t = T \end{cases}$$

where $Y_{1:T}^{n}$ are completions sampled by rolling out the current policy $G_{\theta}$ from state $Y_{1:t}$.

The policy gradient is:

$$\nabla_{\theta} J(\theta) \simeq \frac{1}{T} \sum_{t=1}^{T} \mathbb{E}_{y_{t} \sim G_{\theta}(y_{t} \mid Y_{1:t-1})} \left[\nabla_{\theta} \log G_{\theta}(y_{t} \mid Y_{1:t-1}) \cdot Q(Y_{1:t-1}, y_{t})\right]$$

Two additional mechanisms improve training:

Diversity penalty: Repeated sequences have their reward divided by their copy count, providing diminishing returns for non-unique outputs.
Wasserstein distance: The authors also implement a variant (OR(W)GAN) that replaces the standard GAN discriminator loss with the Wasserstein-1 distance via Kantorovich-Rubinstein duality, which can improve training stability and diversity.

Molecular and Musical Melody Generation Experiments

Architecture

The generator $G_{\theta}$ is an RNN with LSTM cells. The discriminator $D_{\phi}$ is a CNN for text classification following Kim (2014), with 75% dropout and L2 regularization. All optimization uses Adam. Molecular metrics are computed with RDKit.

Molecular Generation Setup

Training data consists of 5,000 random molecules from the QM9 dataset (134k stable small molecules with up to 9 heavy atoms), encoded as SMILES strings with maximum sequence length 51 and alphabet size 43. Each generator is pre-trained for 250 MLE epochs, with the discriminator trained for 10 epochs. Adversarial/RL training then proceeds for up to 100 additional epochs. The default $\lambda$ is 0.5.

Three molecular objectives are evaluated:

Solubility (LogP): water-octanol partition coefficient via RDKit’s Crippen function
Synthesizability: SA score estimating ease of synthesis (0 = hard, 1 = easy)
Druglikeness: QED score capturing medicinal chemistry aesthetics

Diversity is measured using average Jaccard distance of molecular fingerprints relative to a random training subset.

Molecular Generation Results

Objective	Algorithm	Validity (%)	Diversity	Druglikeness	Synthesizability	Solubility
None	MLE	75.9	0.64	0.48 (0%)	0.23 (0%)	0.30 (0%)
None	SeqGAN	80.3	0.61	0.49 (+2%)	0.25 (+6%)	0.31 (+3%)
Druglikeness	ORGAN	88.2	0.55	0.52 (+8%)	0.32 (+38%)	0.35 (+18%)
Druglikeness	OR(W)GAN	85.0	0.95	0.60 (+25%)	0.54 (+130%)	0.47 (+57%)
Druglikeness	Naive RL	97.1	0.80	0.57 (+19%)	0.53 (+126%)	0.50 (+67%)
Synthesizability	ORGAN	96.5	0.92	0.51 (+6%)	0.83 (+255%)	0.45 (+52%)
Synthesizability	OR(W)GAN	97.6	1.00	0.20 (-59%)	0.75 (+223%)	0.84 (+184%)
Solubility	ORGAN	94.7	0.76	0.50 (+4%)	0.63 (+171%)	0.55 (+85%)
Solubility	OR(W)GAN	94.1	0.90	0.42 (-12%)	0.66 (+185%)	0.54 (+81%)
Solubility	Naive RL	92.7	0.75	0.49 (+3%)	0.70 (+200%)	0.78 (+162%)
All (alternated)	ORGAN	96.1	92.3	0.52 (+9%)	0.71 (+206%)	0.53 (+79%)

Key observations: OR(W)GAN consistently achieves higher diversity than standard ORGAN. Naive RL often achieves higher raw objective scores but at the cost of generating trivial solutions (e.g., simple atom chains for solubility). The Wasserstein variant provides better diversity properties. Multi-objective training via alternating objectives across epochs achieves gains comparable to individually optimized models.

Music Generation Setup

Using 1,000 melodies from the EsAC folk dataset, each encoded as 36-token sequences where tokens represent sixteenth-note events across three octaves (C3-B5). Two metrics are optimized: tonality (proportion of perfect fifths) and ratio of steps (conjunct melodic motion). Diversity is measured as average pairwise edit distance.

Music Results

Objective	Algorithm	Diversity	Tonality	Ratio of Steps
None	MLE	0.221	0.007	0.010
None	SeqGAN	0.187	0.005	0.010
Tonality	Naive RL	0.100	0.478	2.9E-05
Tonality	ORGAN	0.268	0.372	1.78E-04
Tonality	OR(W)GAN	0.268	0.177	2.4E-04
Ratio of Steps	Naive RL	0.321	0.001	0.829
Ratio of Steps	ORGAN	0.433	0.001	0.632
Ratio of Steps	OR(W)GAN	0.134	5.95E-05	0.622

ORGAN outperforms SeqGAN and MLE on all metrics. Naive RL achieves higher raw scores but with lower diversity, producing simpler, less interesting outputs.

Capacity Ceilings, Trade-offs, and Future Directions

The authors identify several limitations and findings:

Capacity ceiling: GAN-based models tend to generate sequences matching the training set’s average length (15.42 characters). RL-only approaches can break this constraint, generating shorter (9.4) or longer (21.3) sequences depending on the objective. The upper bound of optimized properties also matches the training data’s maximum, suggesting dataset-dependent limits.

Lambda trade-off: Varying $\lambda$ reveals an optimal balance between objective optimization and distributional fidelity. This optimum depends on the model, dataset, and metric, suggesting that hyperparameter search over $\lambda$ is important in practice.

Tonality vs. steps inverse relationship: In the music task, optimizing for tonality (perfect fifths) inherently conflicts with optimizing for step ratios (consecutive notes), since consecutive scale notes do not form perfect fifths.

Limitations: The paper evaluates on relatively small datasets (5k molecules, 1k melodies) and short sequences. The molecular experiments use QM9 (small molecules with up to 9 heavy atoms), which limits the scope of conclusions for drug-like chemical space. The Wasserstein variant sometimes lags behind the standard GAN loss in raw metric scores, though it offers better diversity.

Future directions: The authors propose extending ORGAN to non-sequential data (images, audio) by framing GANs as RL problems more broadly, and investigating how different heuristic choices affect performance. They also suggest exploring other discrete GAN formulations (MaliGAN, BGAN) with RL extensions.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Molecular training	QM9 subset	5,000 molecules	Random subset from 134k stable small molecules with up to 9 heavy atoms
Music training	EsAC folk dataset	1,000 melodies	36-token sequences, processed following Chen et al. (2017)

Algorithms

Generator pre-trained for 250 epochs via MLE; discriminator for 10 epochs
Adversarial/RL training for up to 100 epochs
Default $\lambda = 0.5$ for reward mixing
Monte Carlo rollouts for intermediate reward estimation
Duplicate penalty: reward divided by copy count

Models

Generator: RNN with LSTM cells
Discriminator: CNN for text classification (Kim, 2014) with 75% dropout, L2 regularization
Optimizer: Adam for all gradient descent steps

Evaluation

Metric	Description	Domain
Validity (%)	Fraction of generated SMILES that decode to valid molecules	Molecules
Diversity	Average Jaccard distance of fingerprints to training subset	Molecules
Druglikeness (QED)	Quantitative Estimate of Drug-likeness	Molecules
Synthesizability (SA)	Synthetic accessibility score	Molecules
Solubility (LogP)	Water-octanol partition coefficient	Molecules
Tonality	Proportion of perfect fifths	Music
Ratio of Steps	Proportion of conjunct melodic intervals	Music
Diversity (edit)	Average pairwise edit distance	Music

Hardware

Not specified in the paper.

Artifacts

Artifact	Type	License	Notes
ORGAN	Code	GPL-2.0	Official implementation including metrics for molecules and music

Paper Information

Citation: Guimaraes, G. L., Sánchez-Lengeling, B., Outeiral, C., Farias, P. L. C., & Aspuru-Guzik, A. (2017). Objective-Reinforced Generative Adversarial Networks (ORGAN) for Sequence Generation Models. arXiv preprint arXiv:1705.10843.

@article{guimaraes2017organ,
  title={Objective-Reinforced Generative Adversarial Networks (ORGAN) for Sequence Generation Models},
  author={Guimaraes, Gabriel Lima and Sanchez-Lengeling, Benjamin and Outeiral, Carlos and Farias, Pedro Luis Cunha and Aspuru-Guzik, Al{\'a}n},
  journal={arXiv preprint arXiv:1705.10843},
  year={2017}
}

MolecularRNN: Graph-Based Molecular Generation and RL

Sat, 28 Mar 2026 00:00:00 +0000

A Graph Recurrent Model for Molecular Generation with Property Optimization

This is a Method paper that introduces MolecularRNN, a graph-based recurrent generative model for molecular structures. The model extends GraphRNN to handle typed nodes (atom types) and typed edges (bond types), enabling direct generation of molecular graphs rather than working through string representations like SMILES. Three key contributions are combined: (1) the MolecularRNN architecture for autoregressive graph generation, (2) valency-based rejection sampling for guaranteed 100% validity at inference, and (3) policy gradient reinforcement learning for shifting molecular property distributions toward desired ranges.

Why Generate Molecules as Graphs Rather Than Strings

Computational de novo molecular design aims to create novel molecules with desired properties, a task central to drug discovery. At the time of this work, most deep generative models for molecules operated on SMILES strings, inheriting the complications of SMILES grammar and the problem that structurally similar molecules can have very different string representations. Graph-based representations are more natural for molecules, with atoms mapping to nodes and bonds to edges, and they allow direct enforcement of chemical constraints during generation.

Existing graph-based methods had their own limitations. Junction tree VAE (JT-VAE) generates molecules from structural fragments, which introduces ambiguity when converting junction trees back to molecules, particularly problematic during property optimization since molecules sharing a junction tree can have very different property values. The GCPN model uses graph convolutional networks with reinforcement learning but was evaluated only on top-3 generated molecules, making it difficult to assess overall distribution quality. Prior atom-level graph generation models like Li et al. (2018a) were restricted to molecules with at most 20 heavy atoms, limiting practical applicability.

Core Innovation: Extending GraphRNN with Chemical Constraints and RL

MolecularRNN builds on the GraphRNN architecture by introducing atom type predictions alongside edge type predictions. The model generates molecular graphs sequentially: at each step, a NodeRNN predicts the type of the next atom, then an EdgeRNN predicts bond types to all preceding atoms within a BFS-ordered window.

Autoregressive Graph Generation

The joint likelihood over atom types $C^{\pi}$ and adjacency vectors $S^{\pi}$ under BFS ordering $\pi$ is factorized as:

$$ p\left(S^{\pi}, C^{\pi}\right) = \prod_{i=1}^{n+1} p\left(C_{i}^{\pi} \mid S_{

NodeRNN processes embeddings of previous atom types and adjacency vectors to produce a hidden state, from which a two-layer MLP with softmax predicts the next atom type $\psi_{i}$:

$$ h_{i}^{\text{node}} = \text{NodeRNN}\left(h_{i-1}^{\text{node}}, \left[\text{emb}(S_{i-1}^{\pi}), \text{emb}(C_{i-1}^{\pi})\right]\right) $$

$$ \psi_{i} = \text{NodeMLP}\left(h_{i}^{\text{node}}\right) $$

EdgeRNN then unrolls across preceding atoms to predict bond types $\phi_{i,j}$, initialized with the NodeRNN hidden state:

$$ h_{i,j}^{\text{edge}} = \text{EdgeRNN}\left(h_{i,j-1}^{\text{edge}}, \text{emb}(S_{i,j-1}^{\pi})\right), \quad h_{i,0}^{\text{edge}} = h_{i}^{\text{node}} $$

$$ \phi_{i,j} = \text{EdgeMLP}\left(h_{i,j}^{\text{edge}}\right) $$

Bond types are categorical over {no bond, single, double, triple}, and molecules are represented in kekulized form. BFS ordering limits the EdgeRNN window to $M = 12$ preceding atoms.

Valency-Based Rejection Sampling

During inference, each proposed bond of order $k$ between atoms $i$ and $j$ is accepted only if both atoms remain within their allowed valencies:

$$ \sum_{j} A_{i,j}^{\pi} + k \leq \text{valency}_{C_{i}^{\pi}} \quad \text{and} \quad \sum_{i} A_{i,j}^{\pi} + k \leq \text{valency}_{C_{j}^{\pi}} $$

Atoms that do not fill their valencies are complemented with hydrogens. This constraint can be enforced directly on graphs (unlike SMILES, where intermediate substrings are not chemically meaningful), yielding 100% valid molecules.

Property Optimization via Policy Gradient

For property optimization, MolecularRNN is formulated as a policy network in a Markov Decision Process. The loss function uses REINFORCE with a discounted final reward:

$$ L(\theta) = -\sum_{i=1}^{N} r(s_{N}) \cdot \gamma^{i} \cdot \log p(s_{i} \mid s_{i-1}; \theta) $$

where $r(s_{N})$ is the reward from a property critic and $\gamma$ is a discount factor. The authors also introduce a structural penalty during RL training that assigns a penalty of $-10$ to atoms violating valency constraints, providing a learning signal from invalid intermediate molecules.

Experimental Setup: Pretraining and Property Optimization

Pretraining

MolecularRNN is pretrained on three datasets: ChEMBL (~1.5M bioactive molecules), ZINC 250k (250K randomly selected commercially available compounds), and MOSES (~1.9M drug-like molecules from ZINC). The model considers 9 atom types (C, N, O, F, P, S, Cl, Br, I), 3 bond types (single, double, triple), and molecules with 10-50 heavy atoms. Architecture: NodeRNN with 4 GRU layers (hidden size 256), EdgeRNN with 4 GRU layers (hidden size 128), node embedding size 128, edge embedding size 16. Training uses Adam with learning rate 0.001 and multiplicative decay on 4 GPUs with batch size 512 per GPU for 250 epochs.

Generation Quality at Scale

The pretrained model generates 1 million molecules per dataset (far larger than prior work: JT-VAE used 5K samples, Li et al. used 100K). Results with valency-based rejection sampling:

Training Set	Valid	Unique	Novel	IntDiv (p=1)	IntDiv (p=2)	SA Score	QED
ChEMBL	100%	99.2%	99.3%	0.895	0.890	3.67 +/- 1.20	0.56 +/- 0.20
ZINC 250k	100%	99.8%	100%	0.892	0.887	3.60 +/- 1.01	0.68 +/- 0.16
MOSES	100%	99.4%	100%	0.881	0.876	3.24 +/- 0.97	0.74 +/- 0.14

Comparison with baselines on ZINC 250k (30K samples):

Method	Valid	Unique	Novel	SA Score	QED	IntDiv
JT-VAE	99.8%	100%	100%	3.37	0.76	0.85
GCPN	100%	99.97%	100%	4.62	0.61	0.90
MolecularRNN	100%	99.89%	100%	3.59	0.68	0.89

GCPN generates overly complex molecules (high SA score of 4.62), while MolecularRNN produces more realistic structures with higher internal diversity than JT-VAE.

Property Optimization Results

Policy gradient optimization is run for 300 iterations with batch size 512 and constant learning rate $10^{-5}$, discount factor $\gamma = 0.97$. Top-3 scores for penalized logP and QED:

Method	logP 1st	logP 2nd	logP 3rd	QED 1st	QED 2nd	QED 3rd
ORGAN	3.63	3.49	3.44	0.896	0.824	0.820
JT-VAE	5.30	4.93	4.49	0.925	0.911	0.910
GCPN	7.98	7.85	7.80	0.948	0.947	0.946
MolecularRNN	10.34	10.19	10.14	0.948	0.948	0.947

MolecularRNN achieves the highest penalized logP scores (10.34 vs. GCPN’s 7.98) while matching GCPN on QED. The authors also demonstrate melting temperature optimization using a GCN-based property predictor as the critic (RMSE 39.5 degrees C), showing that the RL framework generalizes to properties that cannot be computed directly from molecular graphs.

Distribution-Level Evaluation and Learned Chemical Patterns

The authors emphasize that reporting only top-3 scores is not informative, and they compare full property distributions. MolecularRNN shifts the QED distribution further toward maximum values compared to GCPN. They also note that during melting temperature optimization, the model rediscovered two chemical phenomena: fusing aromatic rings increases melting point, and the presence of polar groups (C=O, OH, NH2, heterocyclic nitrogens) enhances dipole-dipole interactions and raises melting temperature.

Without valency-based rejection sampling, the pretrained model achieves 65% validity. After structural penalty training (assigning -10 to valency-violating atoms and optimizing with policy gradient), validity increases to 90%. Enabling rejection sampling then achieves 100%.

Several limitations are worth noting. The BFS ordering introduces an arbitrary sequencing over equivalent graph traversals (the node order permutation problem is not addressed). The evaluation uses top-3 scores for property optimization, though the authors do advocate for distributional evaluation. The molecule size is capped at 50 heavy atoms. The paper does not report training time or wall-clock generation speed. Future directions mentioned include multi-objective property optimization and scaffold completion (graph completion from a given core structure).

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pretraining	ChEMBL	~1.5M molecules	Bioactive molecules with experimental measurements
Pretraining	ZINC 250k	250K molecules	Random subset of ZINC database
Pretraining	MOSES	~1.9M molecules	Drug-like subset of ZINC
Melting point critic	Custom split	37,940 train / 9,458 test	Melting temperatures from -196 to 517 degrees C

Algorithms

Pretraining: Maximum likelihood with Adam optimizer, learning rate 0.001 with multiplicative decay to $10^{-5}$, 250 epochs
Structural penalty: Policy gradient with -10 penalty per valency-violating atom
Property optimization: REINFORCE (policy gradient), 300 iterations, batch size 512, learning rate $10^{-5}$, discount factor $\gamma = 0.97$
Melting point critic: GCN regression (4 layers, hidden size 128), Adam with learning rate 0.001, exponential decay $\gamma = 0.8$, 30 epochs, batch size 32

Models

NodeRNN: 4 GRU layers, hidden size 256, node embedding 128
EdgeRNN: 4 GRU layers, hidden size 128, edge embedding 16
NodeMLP/EdgeMLP: 2-layer MLP with 128 hidden units, ReLU activation, softmax output
BFS window: $M = 12$ preceding atoms
Atom types: 9 (C, N, O, F, P, S, Cl, Br, I)
Bond types: 3 (single, double, triple) + no bond

Evaluation

Metric	Description
Validity	% chemically valid molecules (RDKit)
Uniqueness	% unique in generated pool (up to 1M)
Novelty	% not in training set
Internal Diversity	Average pairwise Tanimoto distance
SA Score	Synthetic accessibility (2-4 optimal range)
QED	Drug-likeness score (0-1)
Penalized logP	Lipophilicity with ring and SA penalties

Hardware

4 GPUs (NVIDIA, specific model not stated)
Per-GPU batch size of 512 for pretraining
Training time not reported

Paper Information

Citation: Popova, M., Shvets, M., Oliva, J., & Isayev, O. (2019). MolecularRNN: Generating realistic molecular graphs with optimized properties. arXiv preprint arXiv:1905.13372.

@article{popova2019molecularrnn,
  title={MolecularRNN: Generating realistic molecular graphs with optimized properties},
  author={Popova, Mariya and Shvets, Mykhailo and Oliva, Junier and Isayev, Olexandr},
  journal={arXiv preprint arXiv:1905.13372},
  year={2019}
}

Memory-Assisted RL for Diverse De Novo Mol. Design

Sat, 28 Mar 2026 00:00:00 +0000

A Memory Module for Diverse Molecular Generation via RL

This is a Method paper that introduces a memory unit for reinforcement learning (RL)-based molecular generation. The primary contribution is a hash-table-based memory mechanism that integrates into the REINVENT framework’s scoring function. By tracking previously generated high-scoring molecules and penalizing the reward when new molecules are too similar to those already stored, the memory unit forces the generative model to explore different regions of chemical space rather than collapsing onto a single scaffold family.

Policy Collapse Limits RL-Based De Novo Design

Recurrent neural networks (RNNs) trained with reinforcement learning can generate novel molecules optimized for desired properties. The REINVENT algorithm and related approaches (ORGANIC, GENTRL) demonstrated the viability of coupling a pretrained SMILES-based generative model with a scoring function via RL. However, a persistent problem is policy collapse (also called mode collapse): once the model discovers a high-scoring region of chemical space, it continues to exploit that region, producing structurally similar compounds with minor substitution differences. This severely limits the practical utility of RL-based generation in drug design, where medicinal chemists need diverse scaffolds to explore structure-activity relationships and manage intellectual property concerns.

Prior work by Liu et al. [31] attempted to address this by engineering an explorative RNN alongside the standard generative RNN, but it did not substantially increase diversity compared to standard REINVENT. Other approaches like Generative Examination Networks (GEN) performed statistical analysis during training but were not evaluated in optimization scenarios.

Core Innovation: Hash-Table Memory Unit for Reward Modification

The key insight is to dynamically modify the reward surface during RL by maintaining a memory of previously explored chemical space. The memory unit is a hash table of index-bucket pairs. Each bucket stores up to a fixed number of high-scoring molecules (default: 25) that are chemically similar to a seed molecule (the index).

Integration with REINVENT

The memory unit modifies the augmented likelihood used in REINVENT. For a generated compound $c$, the augmented log-likelihood becomes:

$$ \log P(c)_{Aug} = \log P(c)_{PriorNetwork} + \sigma \times S(c) \times M(c) $$

where $\sigma$ is a scalar coefficient, $S(c)$ is the scoring function output, and $M(c)$ is the memory unit output (either 0 or 1). The reward is:

$$ R(c) = \left(\log P(c)_{Aug} - \log P(c)_{AgentNetwork}\right)^2 $$

and the loss is $\text{loss} = -R(c)$.

Memory Unit Operation

When a high-scoring molecule is generated:

Its fingerprint or scaffold is compared against all index structures in the memory
If it is similar to an index (above a Tanimoto cutoff, default 0.6) and the corresponding bucket is not full, $M(c) = 1$ and the molecule is added to the bucket
If the bucket is full, $M(c) = 0$, effectively zeroing the reward contribution and discouraging the model from generating similar molecules
If no similar index exists, a new index-bucket pair is created

Four Similarity Criteria

The authors evaluate four criteria for grouping molecules in the memory:

Compound similarity: ECFP4 Tanimoto similarity at the whole-molecule level
Identical Bemis-Murcko (BM) scaffold: exact match of Bemis-Murcko frameworks
Identical carbon skeleton: exact match of carbon skeletons (BM scaffolds with all heteroatoms replaced by carbon and bonds set to single)
Scaffold similarity: atom pair fingerprint Tanimoto similarity between carbon skeletons (fuzzy matching)

Alternative Output Modes

Beyond the binary output ($M(c) \in {0, 1}$), the authors also explored smooth output functions. The linear mode:

$$ M(c) = 1 - \frac{\text{compounds in bucket}}{\text{bucket size}} $$

And the sigmoid mode:

$$ M(c) = 1 - \frac{1}{1 + e^{-\left(\frac{\frac{\text{compounds in bucket}}{\text{bucket size}} \times 2 - 1}{0.15}\right)}} $$

Both smooth modes yielded slightly fewer analogs than the binary mode and were not pursued further.

Experimental Setup: LogP Optimization and Target Activity Prediction

Case Study 1: LogP Optimization

As a proof of concept, the authors optimized LogP values for known DRD2 inhibitors. Starting from 487 DRD2 compounds with LogP >= 5 (from ExCAPE-DB), they applied transfer learning to the prior model for 20 epochs, then ran RL for 150 iterations (100 compounds per iteration, 15,000 total). The scoring function was:

$$ S = 1 - \tanh\left(\min\left(|2 - \text{AlogP}|, |3 - \text{AlogP}|\right)\right) $$

targeting LogP values between 2.0 and 3.0.

Case Study 2: HTR1A and DRD2 Activity Prediction

For a more complex scenario, the authors trained SVM classifiers (with Platt scaling for probabilistic output) on bioactivity data from ExCAPE-DB to predict activity against two neurotransmitter receptors:

HTR1A: 3,599 actives (pIC50 >= 7) and 66,684 inactives
DRD2: 2,981 actives (pIC50 >= 7) and 346,206 inactives (100,000 sampled)

Data was split using Butina clustering on ECFP6 at a 0.4 Tanimoto cutoff (60/20/20 train/val/test). The SVM models achieved excellent performance:

Target	Set	Balanced Accuracy	ROC AUC	F1	MCC
HTR1A	Test	0.96	0.99	0.75	0.75
DRD2	Test	0.95	0.99	0.71	0.72

RL was run for 300 iterations (100 compounds each, 30,000 total). Compounds with predicted activity >= 0.7 were considered active.

Generative Model Architecture

The RNN prior model followed the REINVENT architecture: an embedding layer, three GRU layers with 256 dimensions, and a linear output layer. It was pretrained on ~1.5 million ChEMBL 25 compounds (filtered to remove known HTR1A actives and DRD2 analogs) for 10 epochs using Adam with a learning rate of 0.01.

Comparisons

The authors compared memory-assisted RL against:

Standard REINVENT RL (no memory)
Experience replay (re-presenting 8 high-scoring compounds per iteration)
Temperature scaling (values from 1.0 to 10.0)
Memory + experience replay combined

Results: Up to Fourfold Increase in Diverse Active Compounds

LogP Optimization Results

Memory-assisted RL increased the number of optimized compounds (LogP 2-3) by roughly threefold:

Memory Type	Optimized Compounds	Unique BM Scaffolds	Unique Carbon Skeletons
No memory	938	727	396
Compound similarity	3,451	2,963	1,472
Identical BM Scaffold	3,428	2,865	1,398
Identical Carbon Skeleton	3,315	3,002	1,799
Scaffold Similarity	3,591	3,056	1,538

The memory unit also increased the generation of relevant analogs. ECFP6 analogs (Tanimoto >= 0.4 to training set) increased from 145 to up to 549, and shared MMP cores increased from 5 to up to 19, confirming that the memory unit promoted exploration of chemically relevant space rather than random drift.

HTR1A and DRD2 Activity Optimization Results

The improvements were even more pronounced for target activity optimization:

Target	Memory Type	Active Compounds	Unique BM Scaffolds	Unique Carbon Skeletons
HTR1A	No memory	9,323	7,312	5,446
HTR1A	Compound similarity	16,779	13,304	9,887
HTR1A	Identical Carbon Skeleton	17,597	15,531	12,408
DRD2	No memory	5,143	2,635	1,949
DRD2	Compound similarity	21,486	17,844	12,749
DRD2	Scaffold Similarity	22,784	20,712	16,434

For DRD2, the effect was particularly striking: standard RL showed clear policy collapse with only 576 ECFP6 analogs to the training set, while memory-assisted RL generated up to 6,315. The compound similarity memory unit produced the most MMP analogs (217 to the training set vs. 7 without memory).

Parameter Sensitivity

Bucket size had a modest effect: larger buckets (allowing more compounds before penalization) slightly increased analog generation. The Tanimoto similarity threshold of 0.6 was near-optimal for the scaffold similarity memory; higher thresholds reduced diversity gains. The compound similarity memory showed increasing analogs with higher thresholds, but BM scaffold and carbon skeleton counts plateaued above 0.6.

Comparison with Experience Replay and Temperature Scaling

Experience replay alone increased diversity compared to vanilla RL but was less effective than the memory unit alone
Memory + experience replay achieved the best results overall, as experience replay provided the model with diverse starting points for exploration after the memory unit altered the reward landscape
Temperature scaling was largely ineffective: only a value of 1.25 showed improvement, and even then it achieved only about 50% of the analogs generated by memory-assisted RL. Temperatures above 2.0 degraded SMILES validity, and above 4.0 prevented valid molecule generation entirely

Limitations

The authors acknowledge several limitations:

All evaluations are retrospective; no synthesized compounds were experimentally tested
The SVM activity models, while accurate, may have applicability domain limitations for highly novel scaffolds
The binary memory output mode was found to work best, but the transition from exploration to exploitation is abrupt
The method was only tested with two biological targets and one physicochemical property
Computational overhead of the memory unit is not discussed

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Prior model training	ChEMBL 25	~1.5M compounds	Filtered: max 50 heavy atoms, no stereochemistry, removed HTR1A actives and DRD2 analogs
HTR1A activity data	ExCAPE-DB	3,599 actives + 66,684 inactives	pIC50 >= 7 threshold for actives
DRD2 activity data	ExCAPE-DB	2,981 actives + 100,000 inactives (sampled)	pIC50 >= 7 threshold for actives

Algorithms

Generative model: RNN with embedding + 3 GRU layers (256 dim) + linear output (REINVENT architecture)
RL: Augmented likelihood formulation with sigma scaling coefficient
SVM classifiers: Non-linear SVM with MinMax kernel, Platt scaling, ECFP6 count-based fingerprints (2048 dim)
Butina clustering: ECFP6 Tanimoto cutoff 0.4 for train/val/test splitting

Evaluation

Metric	Description
Unique compounds	Number of distinct valid SMILES generated
Unique BM scaffolds	Bemis-Murcko framework diversity
Unique carbon skeletons	Carbon skeleton diversity (stripped BM scaffolds)
ECFP6 analogs	Compounds with Tanimoto >= 0.4 to known actives
MMP analogs	Matched molecular pair relationships with known actives
Shared MMP cores	Scaffold cores shared between generated and known compounds

Artifacts

Artifact	Type	License	Notes
reinvent-memory	Code	MIT	Official implementation with prepared datasets

Hardware

Not specified in the paper.

Paper Information

Citation: Blaschke, T., Engkvist, O., Bajorath, J., & Chen, H. (2020). Memory-assisted reinforcement learning for diverse molecular de novo design. Journal of Cheminformatics, 12, 68. https://doi.org/10.1186/s13321-020-00473-0

@article{blaschke2020memory,
  title={Memory-assisted reinforcement learning for diverse molecular de novo design},
  author={Blaschke, Thomas and Engkvist, Ola and Bajorath, J{\"u}rgen and Chen, Hongming},
  journal={Journal of Cheminformatics},
  volume={12},
  number={1},
  pages={68},
  year={2020},
  publisher={Springer},
  doi={10.1186/s13321-020-00473-0}
}

LatentGAN: Latent-Space GAN for Molecular Generation

Sat, 28 Mar 2026 00:00:00 +0000

A GAN Operating in Learned Latent Space for Molecular Design

LatentGAN is a Method paper that introduces a two-stage architecture for de novo molecular generation. The first stage trains a heteroencoder to map SMILES strings into a continuous latent vector space. The second stage trains a Wasserstein GAN with gradient penalty (WGAN-GP) to generate new latent vectors that, when decoded, produce valid and novel molecular structures. The key contribution is decoupling the GAN from direct SMILES string generation, allowing the adversarial training to focus on learning the distribution of molecular latent representations rather than character-level sequence generation.

Limitations of Direct SMILES Generation with GANs

Prior GAN-based molecular generation methods such as ORGAN and ORGANIC operated directly on SMILES strings. This created a fundamental challenge: the generator had to simultaneously learn valid SMILES syntax and the distribution of chemically meaningful molecules. ORGAN struggled with optimizing discrete molecular properties like Lipinski’s Rule of Five, while ORGANIC showed limited success beyond the QED drug-likeness score. Other approaches (RANC, ATNC) substituted more advanced recurrent architectures but still operated in the discrete SMILES space.

Meanwhile, variational autoencoders (VAEs) demonstrated that working in continuous latent space could enable molecular generation, but they relied on forcing the latent distribution to match a Gaussian prior through KL divergence. This assumption is not necessarily appropriate for chemical space, which is inherently discontinuous.

RNN-based methods with transfer learning offered an alternative for target-biased generation, but the authors hypothesized that combining GANs with learned latent representations could produce complementary chemical space coverage.

Heteroencoder Plus Wasserstein GAN Architecture

The core innovation of LatentGAN is separating molecular representation learning from adversarial generation through a two-component pipeline.

Heteroencoder

The heteroencoder is an autoencoder trained on pairs of different non-canonical (randomized) SMILES representations of the same molecule. This is distinct from a standard autoencoder because the input and target SMILES are different representations of the same structure.

The encoder uses a two-layer bidirectional LSTM with 512 units per layer (256 forward, 256 backward). The concatenated output feeds into a 512-dimensional feed-forward layer. During training, zero-centered Gaussian noise with $\sigma = 0.1$ is added to the latent vector as regularization. The decoder is a four-layer unidirectional LSTM with a softmax output layer. Batch normalization with momentum 0.9 is applied to all hidden layers except the noise layer.

Training uses teacher forcing with categorical cross-entropy loss for 100 epochs. The learning rate starts at $10^{-3}$ for the first 50 epochs and decays exponentially to $10^{-6}$ by the final epoch. After training, the noise layer is deactivated for deterministic encoding and decoding.

An important design choice is that the heteroencoder makes no assumption about the latent space distribution (unlike VAEs with their KL divergence term). The latent space is shaped purely by reconstruction loss, and the GAN later learns to sample from this unconstrained distribution.

Wasserstein GAN with Gradient Penalty

The GAN uses the WGAN-GP formulation. The critic (discriminator) consists of three feed-forward layers of 256 dimensions each with leaky ReLU activations (no activation on the final layer). The generator has five feed-forward layers of 256 dimensions each with batch normalization and leaky ReLU between layers.

The training ratio is 5:1, with five critic updates for every generator update. The generator takes random vectors sampled from a uniform distribution and learns to produce latent vectors indistinguishable from the real encoded molecular latent vectors.

The WGAN-GP loss for the critic is:

$$L_{\text{critic}} = \mathbb{E}_{\tilde{x} \sim \mathbb{P}_g}[D(\tilde{x})] - \mathbb{E}_{x \sim \mathbb{P}_r}[D(x)] + \lambda \mathbb{E}_{\hat{x} \sim \mathbb{P}_{\hat{x}}}[(|\nabla_{\hat{x}} D(\hat{x})|_2 - 1)^2]$$

where $\lambda$ is the gradient penalty coefficient, $\mathbb{P}_r$ is the real data distribution (encoded latent vectors), $\mathbb{P}_g$ is the generator distribution, and $\mathbb{P}_{\hat{x}}$ samples uniformly along straight lines between pairs of real and generated points.

Generation Pipeline

At inference time, the full pipeline operates as: (1) sample a random vector, (2) pass through the trained generator to produce a latent vector, (3) decode the latent vector into a SMILES string using the pretrained heteroencoder decoder.

Experiments on Drug-Like and Target-Biased Generation

Datasets

The heteroencoder was trained on 1,347,173 SMILES from ChEMBL 25, standardized with MolVS and restricted to molecules with atoms from {H, C, N, O, S, Cl, Br} and at most 50 heavy atoms.

For general drug-like generation, a random subset of 100,000 ChEMBL compounds was used to train the GAN model for 30,000 epochs.

For target-biased generation, three datasets were extracted from ExCAPE-DB for EGFR, HTR1A, and S1PR1 targets. These were clustered into training and test sets to ensure chemical series were not split across sets.

Target	Training Set	Test Set	SVM ROC-AUC	SVM Kappa
EGFR	2,949	2,326	0.850	0.56
HTR1A	48,283	23,048	0.993	0.90
S1PR1	49,381	23,745	0.995	0.91

SVM target prediction models using 2048-bit FCFP6 fingerprints were built with scikit-learn to evaluate generated compounds.

Baselines

RNN-based generative models with transfer learning served as the primary baseline. A prior RNN model was trained on the same ChEMBL set, then fine-tuned on each target dataset. The LatentGAN was also benchmarked on the MOSES platform against VAE, JTN-VAE, and AAE architectures.

Heteroencoder Performance

The heteroencoder achieved 99% valid SMILES on the training set and 98% on the test set. Reconstruction error (decoding to a different molecule) was 18% on training and 20% on test. Notably, decoding to a different valid SMILES of the same molecule is not counted as an error.

Target-Biased Generation Results

From 50,000 sampled SMILES per target model:

Target	Arch.	Valid (%)	Unique (%)	Novel (%)	Active (%)	Recovered Actives (%)	Recovered Neighbors
EGFR	GAN	86	56	97	71	5.26	196
EGFR	RNN	96	46	95	65	7.74	238
HTR1A	GAN	86	66	95	71	5.05	284
HTR1A	RNN	96	50	90	81	7.28	384
S1PR1	GAN	89	31	98	44	0.93	24
S1PR1	RNN	97	35	97	65	3.72	43

MOSES Benchmark

On the MOSES benchmark (trained on a ZINC subset of 1,584,663 compounds, sampled 30,000 SMILES), LatentGAN showed comparable or better results than JTN-VAE and AAE on Frechet ChemNet Distance (FCD), Fragment similarity, and Scaffold similarity, while producing slightly worse nearest-neighbor cosine similarity (SNN). The standard VAE showed signs of mode collapse with high test metric overlap and low novelty.

Complementary Generation and Drug-Likeness Preservation

Key Findings

Validity and novelty: LatentGAN achieved 86-89% validity on target-biased tasks (lower than RNN’s 96-97%) but produced higher uniqueness on two of three targets and comparable or higher novelty (95-98%).

Complementary chemical space: The overlap between LatentGAN-generated and RNN-generated active compounds was very small at both compound and scaffold levels. A probabilistic analysis showed that the RNN model would be very unlikely to eventually cover the LatentGAN output space. This suggests the two architectures can work complementarily in de novo design campaigns.

Drug-likeness: QED score distributions of LatentGAN-generated compounds closely matched training set distributions across all three targets, with training compounds showing only slightly higher drug-likeness. SA score distributions were similarly well-preserved.

Chemical space coverage: PCA analysis using MQN fingerprints confirmed that generated compounds occupy most of the chemical space of the training sets. Some regions of the PCA plots contained compounds predicted as inactive, which corresponded to non-drug-like outliers in the training data.

Novel scaffolds: About 14% of scaffolds in the sampled sets had similarity below 0.4 to the training set across all three targets, indicating LatentGAN can generate genuinely novel chemical scaffolds. Around 5% of generated compounds were identical to training set compounds, while 21-25% had Tanimoto similarity below 0.4.

Limitations

The paper acknowledges several limitations. The 18-20% heteroencoder reconstruction error means a non-trivial fraction of encoded molecules decode to different structures. Validity rates (86-89%) are lower than RNN baselines (96-97%). The S1PR1 target showed notably lower uniqueness (31%) and predicted activity (44%) compared to the other targets, possibly due to the smaller effective training set of active compounds. The paper does not report specific hardware requirements or training times. No wet-lab experimental validation of generated compounds was performed.

Future Directions

The authors envision LatentGAN as a complementary tool to existing RNN-based generative models, with the two architectures covering different regions of chemical space. The approach of operating in learned latent space rather than directly on SMILES strings offers a general framework that could be extended to other molecular representations or generation objectives.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Heteroencoder training	ChEMBL 25 (subset)	1,347,173 SMILES	Standardized with MolVS; atoms restricted to H, C, N, O, S, Cl, Br; max 50 heavy atoms
General GAN training	ChEMBL 25 (random subset)	100,000	Subset of heteroencoder training set
Target-biased training	ExCAPE-DB (EGFR)	2,949 actives	Clustered train/test split
Target-biased training	ExCAPE-DB (HTR1A)	48,283 actives	Clustered train/test split
Target-biased training	ExCAPE-DB (S1PR1)	49,381 actives	Clustered train/test split
Benchmarking	ZINC (MOSES subset)	1,584,663	Canonical SMILES

Algorithms

Heteroencoder: Bidirectional LSTM encoder (2 layers, 512 units) + unidirectional LSTM decoder (4 layers), trained with teacher forcing and categorical cross-entropy for 100 epochs
GAN: WGAN-GP with 5:1 critic-to-generator training ratio. General model trained 30,000 epochs; target models trained 10,000 epochs
Evaluation: SVM classifiers with FCFP6 fingerprints (2048 bits) for activity prediction; MQN fingerprints for PCA-based chemical space analysis; Murcko scaffolds for scaffold-level analysis

Models

Heteroencoder: 512-dim latent space, bidirectional LSTM encoder, unidirectional LSTM decoder
Generator: 5 feed-forward layers of 256 dims with batch norm and leaky ReLU
Critic: 3 feed-forward layers of 256 dims with leaky ReLU

Evaluation

Metric	LatentGAN (EGFR)	RNN Baseline (EGFR)	Notes
Validity	86%	96%	Percent valid SMILES
Uniqueness	56%	46%	Percent unique among valid
Novelty	97%	95%	Not in training set
Predicted active	71%	65%	By SVM model

Hardware

Not specified in the paper.

Artifacts

Artifact	Type	License	Notes
LatentGAN source code	Code	Not specified	Includes trained heteroencoder model and training sets

Paper Information

Citation: Prykhodko, O., Johansson, S.V., Kotsias, P.-C., Arús-Pous, J., Bjerrum, E.J., Engkvist, O., & Chen, H. (2019). A de novo molecular generation method using latent vector based generative adversarial network. Journal of Cheminformatics, 11(1), 74. https://doi.org/10.1186/s13321-019-0397-9

@article{prykhodko2019latentgan,
  title={A de novo molecular generation method using latent vector based generative adversarial network},
  author={Prykhodko, Oleksii and Johansson, Simon Viet and Kotsias, Panagiotis-Christos and Ar{\'u}s-Pous, Josep and Bjerrum, Esben Jannik and Engkvist, Ola and Chen, Hongming},
  journal={Journal of Cheminformatics},
  volume={11},
  number={1},
  pages={74},
  year={2019},
  publisher={Springer},
  doi={10.1186/s13321-019-0397-9}
}

DrugEx v2: Pareto Multi-Objective RL for Drug Design

Sat, 28 Mar 2026 00:00:00 +0000

Multi-Objective De Novo Drug Design with Pareto Optimization

This is a Method paper that extends the DrugEx framework (v1) to handle multi-objective optimization in de novo drug design. The primary contribution is integrating Pareto-based ranking with evolutionary algorithm concepts (crossover and mutation) into an RNN-based reinforcement learning pipeline. The system generates SMILES-based molecules optimized simultaneously for activity toward multiple protein targets while avoiding off-targets, addressing polypharmacology scenarios where drugs must bind multiple specific receptors.

Polypharmacology and the Limits of Single-Objective Generation

Traditional drug discovery follows the “one drug, one target, one disease” paradigm, but drug molecules interact with an average of six protein targets. Off-target binding causes side effects that remain a leading cause of clinical failure and post-approval drug withdrawals (over 500 drugs withdrawn due to fatal toxicity). Complex diseases often require modulating multiple targets simultaneously, making polypharmacology an important design objective.

Prior deep learning approaches for de novo design, including DrugEx v1, focused on generating molecules active against a single target. Extending these methods to multiple objectives introduces fundamental challenges: objectives are often contradictory (high affinity for one target may correlate with high affinity for an undesired off-target), and naive weighted-sum approaches can collapse diversity by over-optimizing a single dominant objective. The authors specifically target the adenosine receptor system, where $A_1AR$ and $A_{2A}AR$ selectivity profiles matter for therapeutic efficacy, and hERG channel binding must be avoided to prevent cardiac toxicity.

Evolutionary Exploration and Pareto Ranking in RL

The core innovation of DrugEx v2 has two components: an evolutionary exploration strategy and Pareto-based reward assignment.

Evolutionary Exploration Strategy

The generation process uses three RNN networks with identical LSTM architectures:

Agent net ($G_A$): the primary generator, updated at each training epoch via policy gradient
Crossover net ($G_C$): initialized from the fine-tuned model, updated iteratively from $G_A$ after each convergence period
Mutation net ($G_M$): initialized from the pre-trained model, parameters fixed throughout training

At each token-generation step, a random number determines whether the token probability comes from the combination of $G_A$ and $G_C$ (with probability $1 - \varepsilon$) or from $G_M$ (with probability $\varepsilon$). This mirrors crossover and mutation operations from evolutionary algorithms, maintaining diversity while steering toward desired properties.

Pareto Front Reward Scheme

For $n$ objectives (three in this study: $A_1AR$, $A_{2A}AR$, hERG), each molecule receives a score $R_i$ based on its predicted bioactivity:

$$ R_{i} = \begin{cases} \text{minmax}(pX_{i}), & \text{if high affinity required} \\ 1 - \text{minmax}(pX_{i}), & \text{if low affinity required} \\ 0, & \text{if SMILES invalid} \end{cases} $$

where $pX_i$ is the predicted bioactivity (range 3.0 to 10.0), normalized to [0, 1].

For the multi-target case, high affinity is required for both $A_1AR$ and $A_{2A}AR$ while low affinity is required for hERG. For the target-specific case, high affinity is required only for $A_{2A}AR$ while low affinity is required for both $A_1AR$ and hERG.

Molecules are ranked using a non-dominated sorting algorithm to construct Pareto fronts. Within each front, molecules are ranked by average Tanimoto distance (using ECFP6 fingerprints) rather than crowding distance, favoring chemically diverse solutions. The final reward is:

$$ R_i^{*} = \begin{cases} 0.5 + \frac{k - N_{undesired}}{2N_{desired}}, & \text{if desired} \\ \frac{k}{2N_{undesired}}, & \text{if undesired} \end{cases} $$

where $k$ is the molecule’s index in the Pareto rank. Rewards for undesired and desired solutions are distributed in $(0, 0.5]$ and $(0.5, 1.0]$, respectively.

The agent is trained via policy gradient:

$$ J(\theta) = \mathbb{E}\left[R^{*}(y_{1:T}) \middle|\theta\right] = \sum_{t=1}^{T} \log G(y_t | y_{1:t-1}) \cdot R^{*}(y_{1:T}) $$

Weighted Sum Alternative

The authors also implement a weighted sum (WS) scheme with dynamic weights proportional to the ratio of undesired to desired molecules per objective:

$$ w_i = \frac{r_i}{\sum_{k=1}^{M} r_k}, \quad R^{*} = \sum_{i=1}^{n} w_i R_i $$

This auto-adjusts importance toward under-performing objectives during training.

Molecular Diversity Metric

Diversity is measured using the Solow-Polasky metric adapted from ecological biodiversity:

$$ I(A) = \frac{1}{|A|} \mathbf{e}^{\top} F(\mathbf{s})^{-1} \mathbf{e} $$

where $F(\mathbf{s})$ is a distance matrix with entries $f(d_{ij}) = e^{-\theta d_{ij}}$ and $d_{ij}$ is the Tanimoto distance between ECFP6 fingerprints of molecules $s_i$ and $s_j$.

Multi-Target and Target-Specific Experiments

QSAR Environment

Four ML algorithms were benchmarked for the bioactivity prediction environment: Random Forest (RF), SVM, PLS, and Multi-task DNN (MT-DNN). Input features combined 2048-bit ECFP6 fingerprints with 19 physicochemical descriptors (2067D total). The training data came from ChEMBL v26: 25,731 ligands with bioactivity measurements toward $A_1AR$, $A_{2A}AR$, and hERG. RF was selected as the final predictor based on superior performance in temporal-split independent testing ($R^2$ and RMSE), prioritizing robustness over cross-validation metrics.

Generative Model Architecture

The RNN generator uses six layers: input, embedding (128D), three LSTM recurrent layers (512 hidden units), and output. LSTM was chosen over GRU based on higher valid SMILES rates (97.5% vs. 93.1% for pre-trained, 97.9% vs. 95.7% for fine-tuned). Pre-training used 1.7M molecules from ChEMBL; fine-tuning used the 25,731 LIGAND set molecules.

Baselines

DrugEx v2 was compared against DrugEx v1, REINVENT, and ORGANIC, all using the same RNN architecture and pre-trained/fine-tuned models, with only the RL framework differing. Both Pareto front (PF) and weighted sum (WS) reward schemes were tested.

Multi-Target Results

In the multi-target case (high affinity for $A_1AR$ and $A_{2A}AR$, low affinity for hERG):

Method	Scheme	Validity	Desirability	Uniqueness	Diversity
DrugEx v2	PF	99.57%	80.81%	87.29%	0.70
DrugEx v2	WS	99.80%	97.45%	89.08%	0.49
REINVENT	PF	99.54%	57.43%	98.84%	0.77
ORGANIC	PF	98.84%	66.01%	82.67%	0.65
DrugEx v1	PF	98.28%	43.27%	88.96%	0.71

DrugEx v2 achieved the highest desirability under both schemes. The WS scheme maximized desirability (97.45%) but at the cost of diversity (0.49). The PF scheme maintained higher diversity (0.70) with still-strong desirability (80.81%).

Target-Specific Results

In the target-specific case (high $A_{2A}AR$, low $A_1AR$ and hERG):

Method	Scheme	Validity	Desirability	Uniqueness	Diversity
DrugEx v2	PF	99.53%	89.49%	90.55%	0.73
DrugEx v2	WS	99.62%	97.86%	90.54%	0.31
REINVENT	WS	99.55%	81.27%	98.87%	0.34
ORGANIC	PF	98.29%	86.98%	80.30%	0.64

DrugEx v2 with PF achieved high desirability (89.49%) while maintaining diversity (0.73), outperforming both the WS scheme’s diversity collapse (0.31) and competing methods.

Chemical Space Coverage

t-SNE visualization with ECFP6 descriptors showed that the PF scheme guided generators to cover chemical space more broadly than the WS scheme. DrugEx v1 and v2 covered nearly all of the chemical space occupied by known active ligands, while REINVENT and ORGANIC covered only partial regions in the target-specific case.

Substructure Distribution

Generated molecules were evaluated for purine ring, furan ring, and benzene ring frequencies. DrugEx v2 with PF produced substructure distributions closest to the LIGAND set, suggesting it better preserves the chemical characteristics of known active molecules compared to REINVENT (which over-represented benzene rings) and ORGANIC.

GuacaMol Benchmark

DrugEx v2 was tested on 20 goal-directed tasks from the GuacaMol benchmark, achieving the best score in 12 of 20 tasks and an overall second place. The method struggled with tasks requiring contradictory objectives in narrow chemical spaces (e.g., the Sitagliptin MPO task), reflecting its emphasis on diverse feasible molecules rather than optimal individual solutions.

Diversity-Desirability Trade-off and Limitations

The key finding is that the Pareto front scheme and weighted sum scheme offer complementary strengths: PF produces molecules with higher diversity and more realistic substructure distributions, while WS achieves higher raw desirability scores. The Pareto front scheme is preferred for polypharmacology applications where chemical diversity matters for lead optimization.

The mutation rate $\varepsilon$ controls the diversity-desirability trade-off. Higher $\varepsilon$ increases diversity at the cost of desirability. The authors tested $\varepsilon \in {10^{-2}, 10^{-3}, 10^{-4}, 0}$ and found that appropriate tuning is important.

Limitations acknowledged by the authors include:

The method is less effective for tasks with contradictory objectives in narrow chemical spaces
Emphasis is on generating diverse feasible molecules rather than individual optimal solutions
REINVENT 2.0 did not converge with the PF scheme, suggesting the Pareto approach may not be universally compatible with all RL frameworks
Bioactivity predictions rely on QSAR models (RF), which may not generalize perfectly to novel chemical scaffolds

Future directions mentioned include adopting newer architectures (BERT, Transformer, GPT-2), handling graph and fragment representations, and integrating additional objectives like stability and synthesizability.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pre-training	ChEMBL v26 (ChEMBL set)	1.7M molecules	SMILES syntax learning, drug-like molecules
Fine-tuning / Environment	LIGAND set	25,731 ligands	Bioactivities for $A_1AR$, $A_{2A}AR$, hERG from ChEMBL
Benchmark	GuacaMol	20 tasks	Goal-directed generation tasks

Active/inactive thresholds: $pX \geq 6.5$ (active), $pX < 6.5$ (inactive). Low-quality data without exact pX assigned $pX = 3.99$ with sample weight 0.1.

Algorithms

QSAR predictor: Random Forest, 1000 trees, Gini criterion. Input: 2048-bit ECFP6 + 19 physicochemical properties (2067D). MinMax normalization.
Generator: 6-layer RNN with LSTM cells (512 hidden units), embedding dim 128, vocabulary 84 tokens. Adam optimizer, lr $10^{-3}$, batch size 512, 1000 epochs.
RL training: Policy gradient with Pareto-based or weighted-sum reward. Mutation rates tested: $\varepsilon \in {10^{-2}, 10^{-3}, 10^{-4}, 0}$.
Pareto ranking: GPU-accelerated non-dominated sorting via PyTorch. Tanimoto-based crowding distance with ECFP6 fingerprints.

Models

Component	Architecture	Parameters
Generator	LSTM (3 layers, 512 hidden)	Embedding 128D, vocab 84
Predictor	Random Forest	1000 trees, 2067D input
MT-DNN (alternative)	3 hidden layers (4000, 2000, 1000)	ReLU, 20% dropout

Evaluation

Metric	Description
Validity	Fraction of generated SMILES that parse to valid molecules
Desirability	Fraction of molecules meeting all activity thresholds ($pX \geq 6.5$ on-targets, $pX < 6.5$ off-targets)
Uniqueness	Fraction of non-duplicate molecules
Diversity	Solow-Polasky metric on ECFP6 Tanimoto distances
SA score	Synthetic accessibility (1-10, lower is easier)
QED	Quantitative estimate of drug-likeness (0-1, higher is better)

Hardware

GPU acceleration was used for Pareto optimization via PyTorch. Specific hardware details (GPU model, training time) are not reported in the paper.

Artifacts

Artifact	Type	License	Notes
DrugEx GitHub	Code	MIT	Official implementation (Python, PyTorch)
ChEMBL v26	Dataset	CC BY-SA 3.0	Source of training molecules and bioactivity data

Paper Information

Citation: Liu, X., Ye, K., van Vlijmen, H. W. T., Emmerich, M. T. M., IJzerman, A. P., & van Westen, G. J. P. (2021). DrugEx v2: de novo design of drug molecules by Pareto-based multi-objective reinforcement learning in polypharmacology. Journal of Cheminformatics, 13(1), 85. https://doi.org/10.1186/s13321-021-00561-9

@article{liu2021drugex,
  title={DrugEx v2: de novo design of drug molecules by Pareto-based multi-objective reinforcement learning in polypharmacology},
  author={Liu, Xuhan and Ye, Kai and van Vlijmen, Herman W. T. and Emmerich, Michael T. M. and IJzerman, Adriaan P. and van Westen, Gerard J. P.},
  journal={Journal of Cheminformatics},
  volume={13},
  number={1},
  pages={85},
  year={2021},
  doi={10.1186/s13321-021-00561-9}
}

DrugChat: Conversational QA on Drug Molecule Graphs

Sat, 28 Mar 2026 00:00:00 +0000

A Prototype for Conversational Drug Compound Analysis

Method ($\Psi_{\text{Method}}$)

DrugChat is a prototype system that enables ChatGPT-like conversational interaction with drug molecule graphs. Users upload a compound’s molecular graph and ask free-form, multi-turn questions about its properties, mechanism of action, or therapeutic applications. The system generates natural language answers by combining a graph neural network (GNN) encoder, a large language model (LLM), and a lightweight linear adaptor that bridges the two modalities. The primary contribution is the architecture and the accompanying instruction tuning datasets (10,834 drug compounds, 143,517 QA pairs) that make this graph-to-language interaction possible.

Why Conversational Interfaces for Drug Molecules?

Drug discovery is time-intensive and expensive, often requiring years and billions of dollars to bring a single compound to market. Traditional computational chemistry tools provide specialized outputs but lack the ability to support open-ended, interactive exploration of molecular properties. Researchers working with drug compound data frequently need quick answers to diverse questions: What is the mechanism of action? Are there known drug interactions? What structural modifications could improve efficacy?

At the time of this work, large language models had demonstrated strong conversational capabilities for text, and multimodal extensions (MiniGPT-4, LLaVA) had connected vision encoders to LLMs. However, no system had bridged graph-structured molecular data with LLMs for interactive dialogue. DrugChat addresses this gap by proposing the first system (to the authors’ knowledge) that connects molecular graph representations directly to an LLM for multi-turn question answering.

Architecture: GNN-Adaptor-LLM Pipeline

The core innovation is the three-component architecture and its training strategy:

Graph Neural Network (GNN): A pre-trained GNN from Hu et al. (2020) processes the compound’s molecular graph. At each layer $k$, node representations are updated by aggregating features from neighboring nodes:

$$ h_{v}^{k} = \sigma\left(h_{v}^{k-1}, \text{AGG}\left(\left\{h_{u}^{k-1}, u \in \mathcal{N}(v)\right\}\right)\right) $$

A permutation-invariant pooling function produces the graph-level representation:

$$ h_{G} = f\left(\left\{h_{v}^{K}, v \in G\right\}\right) $$

Linear Adaptor: A single linear transformation matrix converts the GNN graph representation into a soft prompt vector compatible with the LLM’s input space. This is the only component whose weights are updated during training.

Large Language Model (Vicuna-13B): The pre-trained Vicuna-13B model takes the transformed graph prompt vector along with user questions and generates answers. Both the GNN and LLM weights remain frozen during training.

The prompt template follows the Vicuna conversational format:

$$ \mathbf{Q}: \langle\text{Graph}\rangle\langle\text{GraphFeature}\rangle\langle/\text{Graph}\rangle\langle\text{Instruction}\rangle \quad \mathbf{A}: \langle\text{Desc}\rangle $$

During training, the system minimizes a negative log-likelihood loss between generated and ground-truth answers. The entire training procedure updates only the adaptor’s parameters, making the approach computationally lightweight compared to full fine-tuning.

Instruction Tuning Datasets from ChEMBL and PubChem

The authors constructed two instruction tuning datasets:

Dataset	Drug Compounds	QA Pairs	Source
ChEMBL	3,892	129,699	ChEMBL database (Feb 2023)
PubChem	6,942	13,818	PubChem (May 2023)
Total	10,834	143,517

ChEMBL Dataset: Starting from 2,354,965 compounds in ChEMBL, the authors identified 14,816 with drug information and filtered to 3,892 with sufficient descriptive content. For each drug, they gathered SMILES strings, molecular features (formula, acid/base classification), and drug-specific properties (mechanism of action, therapeutic applications). They manually crafted QA pairs covering topics like rotatable bond count, Lipinski rule violations, chirality, polar surface area, development stage, approval year, and USAN classification.

PubChem Dataset: From 66,469,244 compounds in PubChem, 19,319 had drug information, and 6,942 were retained after filtering for detailed descriptions. Descriptions were sourced from ChEBI, LOTUS, and YMDB databases, yielding 13,818 QA pairs primarily asking for drug descriptions.

The QA pairs are formulaic: the ChEMBL set covers up to 34 question types per drug (an example drug in the paper shows all 34), while PubChem questions ask for descriptive summaries from different source databases.

Qualitative Demonstrations Only

The paper presents only qualitative results. Two demonstration examples show DrugChat answering multi-turn questions about test compounds not seen during training. Questions like “what makes this compound unique?” and “what diseases can this compound potentially treat?” are answered in natural language.

No systematic quantitative evaluation is reported. The authors state they “will perform a systematic quantitative evaluation by collaborating with pharmaceutical scientists,” but this evaluation is not included in the technical report.

Limitations and Future Directions

The authors identify language hallucination as the primary limitation. Since DrugChat incorporates an LLM, it may produce convincing but incorrect text descriptions about drugs, which could mislead decision-makers in real drug discovery pipelines.

Proposed mitigations include:

Higher-quality training data and filtering strategies
More advanced GNN encoders and LLMs
Reinforcement learning from human feedback (RLHF) as the user base grows

Several additional limitations are worth noting:

The QA pairs are largely factoid-style questions with short, formulaic answers, which may not capture the nuanced reasoning needed for real drug discovery tasks
The evaluation is entirely qualitative, with no comparison to baselines or quantitative metrics
The linear adaptor is a minimal alignment mechanism; it remains unclear how much molecular structural information is preserved through this single linear transformation
The training data covers only a small fraction of known chemical space (10,834 compounds out of millions)

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Training	ChEMBL Drug Instruction Tuning	3,892 drugs, 129,699 QA pairs	From ChEMBL (Feb 2023 dump)
Training	PubChem Drug Instruction Tuning	6,942 drugs, 13,818 QA pairs	From PubChem (May 2023)

Algorithms

GNN: Pre-trained model from Hu et al. (2020), “Strategies for Pre-training Graph Neural Networks”
Adaptor: Single linear transformation matrix (only trainable component)
Loss: Negative log-likelihood between generated and ground-truth answers
Training: Only adaptor weights updated; GNN and LLM weights frozen

Models

Component	Model	Parameters	Status
GNN Encoder	Pre-trained GNN (Hu et al., 2020)	Not specified	Frozen during training
LLM	Vicuna-13B	~13B	Frozen during training
Adaptor	Linear projection	Not specified	Trained

Evaluation

No quantitative evaluation metrics are reported. The paper provides only qualitative demonstrations on unseen compounds.

Hardware

No hardware specifications are reported for training or inference.

Artifacts

Artifact	Type	License	Notes
DrugChat Code	Code	Not specified	Official implementation (repository returned 404 as of March 2026)

Paper Information

Citation: Liang, Y., Zhang, R., Zhang, L., & Xie, P. (2023). DrugChat: Towards Enabling ChatGPT-Like Capabilities on Drug Molecule Graphs. arXiv preprint arXiv:2309.03907.

@article{liang2023drugchat,
  title={DrugChat: Towards Enabling ChatGPT-Like Capabilities on Drug Molecule Graphs},
  author={Liang, Youwei and Zhang, Ruiyi and Zhang, Li and Xie, Pengtao},
  journal={arXiv preprint arXiv:2309.03907},
  year={2023}
}

DrugAssist: Interactive LLM Molecule Optimization

Sat, 28 Mar 2026 00:00:00 +0000

An Interactive LLM for Molecule Optimization

DrugAssist is a Method paper that proposes an interactive molecule optimization model built by fine-tuning Llama2-7B-Chat with LoRA on a newly constructed instruction dataset. The primary contribution is twofold: (1) the MolOpt-Instructions dataset containing over one million molecule pairs with six molecular properties and three optimization task categories, and (2) a dialogue-based molecule optimization system that allows domain experts to iteratively refine molecular modifications through multi-turn natural language conversations.

Why Interactive Molecule Optimization Matters

Molecule optimization is a core step in the drug discovery pipeline, where lead compounds must be modified to improve specific pharmacological properties while maintaining structural similarity. Existing approaches fall into sequence-based methods (treating SMILES optimization as machine translation) and graph-based methods (graph-to-graph translation), but they share a critical limitation: they are non-interactive. These models learn patterns from chemical structure data without incorporating expert feedback.

The drug discovery process is inherently iterative and requires integrating domain expertise. Medicinal chemists typically refine candidates through repeated cycles of suggestion, evaluation, and adjustment. Prior LLM-based approaches like ChatDrug relied on prompt engineering with general-purpose models (GPT-3.5-turbo) rather than fine-tuning, limiting their optimization accuracy. Additionally, most existing molecule optimization benchmarks focus on single-property optimization with vague objectives (e.g., “maximize QED”), while real-world drug design requires optimizing property values within specific ranges across multiple properties simultaneously.

Instruction-Based Fine-Tuning with MolOpt-Instructions

The core innovation has two components: the MolOpt-Instructions dataset construction pipeline and the multi-task instruction tuning strategy.

Dataset Construction

MolOpt-Instructions is built from one million molecules randomly sampled from the ZINC database. The construction workflow uses mmpdb (an open-source Matched Molecular Pair platform) to generate structurally similar molecule pairs through Matched Molecular Pair Analysis (MMPA). Pairs are filtered to satisfy two criteria: Tanimoto similarity greater than 0.65 and logP difference greater than 2.5. Property values for six properties (Solubility, BBBP, hERG inhibition, QED, hydrogen bond donor count, and hydrogen bond acceptor count) are computed using Tencent’s iDrug platform. The final dataset contains 1,029,949 unique pairs covering 1,595,839 unique molecules, with mean similarity of 0.69 and mean logP difference of 2.82.

Three categories of optimization tasks are defined:

Loose: Increase or decrease a given property value (no threshold)
Strict: Increase or decrease by at least a specified threshold
Range: Optimize the property value to fall within a given interval

Instruction templates are generated with ChatGPT assistance and manually refined. To ensure balance, source and target molecules are swapped for some pairs to maintain a roughly 1:1 ratio of property increases to decreases.

Murcko scaffold analysis confirms chemical diversity: the average molecules per scaffold is 2.95, and over 93.7% of scaffolds contain no more than five molecules.

Multi-Task Instruction Tuning

The model is fine-tuned on Llama2-7B-Chat using LoRA (rank 64, alpha 128). To prevent catastrophic forgetting of general language capabilities, the training data combines MolOpt-Instructions with the Stanford Alpaca dataset (52k instruction-following examples, replicated 5x to balance the mixture). The training objective minimizes the negative log-likelihood over the response tokens:

$$L(R; \boldsymbol{\theta}) = -\sum_{u_i \in R} \log \Phi(u_i \mid u_{

where $I$ is the instruction, $R$ is the response, and $\Phi$ is the model’s conditional probability.

Training runs for 10 epochs with batch size 512, using AdamW ($\beta = (0.9, 0.999)$), learning rate 1e-4, 3% warm-up steps with cosine decay, and no weight decay. The data is split 90/5/5 for train/validation/test.

Experimental Setup and Multi-Property Optimization Results

Comparison with Traditional Approaches

DrugAssist is compared against Mol-Seq2Seq and Mol-Transformer (He et al., 2021) on simultaneous Solubility and BBBP optimization with range constraints. The evaluation prompt asks the model to generate an optimized molecule with solubility within a given range and BBBP category changed from one level to another.

Model	Solubility	BBBP	Both	Valid Rate	Similarity
Mol-Seq2Seq	0.46	0.55	0.35	0.76	0.61
Mol-Transformer	0.70	0.78	0.59	0.96	0.70
DrugAssist	0.74	0.80	0.62	0.98	0.69

DrugAssist achieves the highest success rates in both single-property and multi-property optimization while maintaining high validity (0.98) and comparable structural similarity (0.69).

Comparison with LLMs

DrugAssist is compared against Llama2-7B-Chat, GPT-3.5-turbo (via ChatDrug), and BioMedGPT-LM-7B on 16 tasks covering all three optimization categories. These comparisons use multi-turn dialogues following the ChatDrug protocol: if the model’s output fails to meet requirements, a database-retrieved molecule meeting the criteria and similar to the model’s output is provided as a hint for iterative refinement.

Selected results on single-property tasks (valid ratio / correct ratio, loose/strict):

Task	Llama2-7B-Chat	GPT-3.5-turbo	BioMedGPT-LM	DrugAssist
QED+	0.17 / 0.16	0.15 / 0.15	0.15 / 0.09	0.76 / 0.63
Acceptor+	0.08 / 0.08	0.04 / 0.06	0.18 / 0.13	0.71 / 0.67
Donor+	0.15 / 0.08	0.10 / 0.04	0.17 / 0.09	0.72 / 0.76
Solubility+	0.36 / 0.20	0.16 / 0.05	0.18 / 0.09	0.80 / 0.41
BBBP+	0.19 / 0.14	0.10 / 0.10	0.16 / 0.07	0.82 / 0.61
hERG-	0.39 / 0.31	0.13 / 0.15	0.13 / 0.12	0.71 / 0.67

Multi-property tasks:

Task	Llama2-7B-Chat	GPT-3.5-turbo	BioMedGPT-LM	DrugAssist
Sol+ & Acc+	0.15 / 0.04	0.09 / 0.02	0.10 / 0.07	0.50 / 0.27
QED+ & BBBP+	0.14 / 0.09	0.09 / 0.06	0.16 / 0.11	0.65 / 0.41

DrugAssist outperforms all baselines across every task. BioMedGPT-LM frequently misunderstands the task, generating guidance text rather than molecules. GPT-3.5-turbo achieves high validity but often outputs the input molecule unchanged.

Key Findings

Zero-shot transferability: Although DrugAssist trains on single-property optimization data, it successfully handles multi-property optimization requests at inference time. In a case study, the model simultaneously increased both BBBP and QED by at least 0.1 while maintaining structural similarity, without any multi-property training examples.

Few-shot generalization: DrugAssist optimizes properties not seen during training (e.g., logP) when provided with a few in-context examples of successful optimizations, a capability that traditional sequence-based or graph-based models cannot achieve without retraining.

Iterative optimization: When an initial optimization fails to meet requirements, DrugAssist can incorporate feedback (a database-retrieved hint molecule) and modify different functional groups in a second attempt to produce a compliant molecule.

Limitations

The authors acknowledge that DrugAssist has a relatively lower success rate on the most challenging task category, strict range-constrained solubility optimization (0.41 success rate under strict criteria vs. 0.80 under loose criteria). The model also relies on iDrug for property prediction of Solubility, BBBP, and hERG inhibition, meaning its optimization quality is bounded by the accuracy of these property predictors. The evaluation uses only 500 test molecules for LLM comparisons, which is a relatively small evaluation set. The paper does not report statistical significance tests or confidence intervals for any results.

Future Directions

The authors plan to improve multimodal data handling to reduce hallucination problems and to further enhance DrugAssist’s interactive capabilities for better understanding of user needs and feedback.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Training	MolOpt-Instructions	1,029,949 molecule pairs	Sourced from ZINC via mmpdb; 6 properties
Training (auxiliary)	Stanford Alpaca	52k instructions (5x replicated)	Mitigates catastrophic forgetting
Evaluation (traditional)	From He et al. (2021)	Not specified	Multi-property optimization test
Evaluation (LLM)	ZINC subset	500 molecules	Randomly selected

Algorithms

Base model: Llama2-7B-Chat
Fine-tuning: LoRA with rank 64, alpha 128
Optimizer: AdamW, $\beta = (0.9, 0.999)$, lr = 1e-4, no weight decay
Schedule: 3% warm-up, cosine decay
Epochs: 10
Batch size: 512
Property calculation: iDrug (Solubility, BBBP, hERG); RDKit (H-bond donors/acceptors, QED)
Molecular pairs: mmpdb for Matched Molecular Pair Analysis

Models

Fine-tuned Llama2-7B-Chat with LoRA adapters
No pre-trained weights released (code and data available)

Evaluation

Metric	Description
Success rate	Fraction of molecules meeting optimization criteria
Valid rate	Fraction of generated SMILES that parse as valid molecules
Similarity	Tanimoto similarity between input and optimized molecules

Hardware

8 NVIDIA Tesla A100-SXM4-40GB GPUs

Artifacts

Artifact	Type	License	Notes
DrugAssist Code	Code	Not specified	Training and inference code
MolOpt-Instructions	Dataset	Not specified	1M+ molecule pairs, 6 properties

Paper Information

Citation: Ye, G., Cai, X., Lai, H., Wang, X., Huang, J., Wang, L., Liu, W., & Zeng, X. (2024). DrugAssist: A Large Language Model for Molecule Optimization. Briefings in Bioinformatics, 26(1), bbae693.

@article{ye2024drugassist,
  title={DrugAssist: A Large Language Model for Molecule Optimization},
  author={Ye, Geyan and Cai, Xibao and Lai, Houtim and Wang, Xing and Huang, Junhong and Wang, Longyue and Liu, Wei and Zeng, Xiangxiang},
  journal={Briefings in Bioinformatics},
  volume={26},
  number={1},
  pages={bbae693},
  year={2024},
  doi={10.1093/bib/bbae693}
}

ChemCrow: Augmenting LLMs with 18 Chemistry Tools

Sat, 28 Mar 2026 00:00:00 +0000

An LLM-Powered Chemistry Agent

This is a Method paper that introduces ChemCrow, an LLM chemistry agent that augments GPT-4 with 18 expert-designed tools to accomplish tasks across organic synthesis, drug discovery, and materials design. Rather than relying on the LLM’s internal knowledge (which is often inaccurate for chemistry), ChemCrow uses the LLM as a reasoning engine that iteratively calls specialized tools to gather information, plan actions, and execute experiments. The system successfully planned and executed real-world chemical syntheses on a robotic platform, demonstrating one of the first chemistry-related LLM agent interactions with the physical world.

Bridging LLM Reasoning and Chemical Expertise

Large language models have transformed many domains, but they struggle with chemistry-specific problems. GPT-4 cannot reliably perform basic operations like multiplying large numbers, converting IUPAC names to molecular structures, or predicting reaction outcomes. These limitations stem from the models’ token-prediction design, which does not encode chemical reasoning or factual chemical knowledge reliably.

Meanwhile, the chemistry community has developed numerous specialized computational tools for reaction prediction, retrosynthesis planning, molecular property prediction, and de novo molecular generation. These tools exist in isolated environments with steep learning curves, making them difficult for experimental chemists to integrate and use together. The gap between LLM reasoning capabilities and specialized chemistry tools presents an opportunity: augmenting LLMs with these tools could compensate for the models’ chemical knowledge deficiencies while providing a natural language interface to specialized computational chemistry capabilities.

Tool-Augmented Reasoning via ReAct

ChemCrow builds on the ReAct (Reasoning and Acting) framework, where the LLM follows an iterative Thought-Action-Action Input-Observation loop. At each step, the model reasons about the current state of the task, selects an appropriate tool, provides input, pauses while the tool executes, and then incorporates the observation before deciding on the next step. This continues until the final answer is reached.

The system integrates 18 tools organized into four categories:

General tools include web search (via SerpAPI), literature search (using paper-qa with OpenAI embeddings and FAISS), a Python REPL for arbitrary code execution, and a human interaction interface.

Molecule tools cover Name2SMILES (converting molecule names to SMILES via Chem-Space, PubChem, and OPSIN), SMILES2Price (checking purchasability via molbloom and ZINC20), Name2CAS (CAS number lookup via PubChem), molecular Similarity (Tanimoto similarity with ECFP2 fingerprints), ModifyMol (local chemical space exploration via SynSpace), PatentCheck (bloom filter patent lookup via molbloom), FuncGroups (functional group identification via SMARTS patterns), and SMILES2Weight (molecular weight calculation via RDKit).

Safety tools include ControlledChemicalCheck (screening against chemical weapons lists from OPCW and the Australia Group), ExplosiveCheck (GHS explosive classification via PubChem), and SafetySummary (comprehensive safety overview from PubChem data).

Chemical reaction tools include NameRXN (reaction classification via NextMove Software), ReactionPredict (product prediction via IBM’s RXN4Chemistry API using the Molecular Transformer), ReactionPlanner (multi-step synthesis planning via RXN4Chemistry), and ReactionExecute (direct synthesis execution on IBM’s RoboRXN robotic platform).

A key design feature is that safety checks are automatically invoked before synthesis execution. If a molecule is flagged as a controlled chemical or precursor, execution stops immediately.

Experimental Validation and Evaluation

Autonomous Synthesis

ChemCrow autonomously planned and executed four real-world syntheses on the IBM RoboRXN cloud-connected robotic platform:

DEET (insect repellent), from the prompt “Plan and execute the synthesis of an insect repellent”
Three thiourea organocatalysts (Schreiner’s, Ricci’s, and Takemoto’s catalysts), from a prompt asking to find and synthesize a thiourea organocatalyst that accelerates the Diels-Alder reaction

All four syntheses yielded the anticipated compounds. ChemCrow demonstrated the ability to autonomously adapt synthesis procedures when the RoboRXN platform flagged issues (such as insufficient solvent or invalid purification actions), iteratively modifying the procedure until it was valid.

Novel Chromophore Discovery

In a human-AI collaboration scenario, ChemCrow was instructed to train a machine learning model to screen candidate chromophores. The system loaded and cleaned data from a chromophore database, trained and evaluated a random forest model, and suggested a molecule with a target absorption maximum of 369 nm. The proposed molecule was subsequently synthesized and characterized, revealing a measured absorption maximum of 336 nm, confirming the discovery of a new chromophore.

Expert vs. LLM Evaluation

The evaluation used 14 use cases spanning synthesis planning, molecular design, and chemical logic. Both ChemCrow and standalone GPT-4 (without tools) were evaluated by:

Expert human evaluators (n=4): Assessed correctness of chemistry, quality of reasoning, and degree of task completion
EvaluatorGPT: An LLM evaluator prompted to assess responses

Key findings from the evaluation:

Evaluator	Preferred System	Reasoning
Human experts	ChemCrow	Better chemical accuracy and task completeness, especially on complex tasks
EvaluatorGPT	GPT-4	Favored fluent, complete-sounding responses despite factual errors

Human experts preferred ChemCrow across most tasks, with the exception of very simple tasks where GPT-4 could answer from memorized training data (e.g., synthesis of well-known molecules like paracetamol). GPT-4 without tools consistently produced hallucinations that appeared convincing but were factually incorrect upon expert inspection.

An important finding is that LLM-based evaluation (EvaluatorGPT) cannot replace expert human assessment for scientific tasks. The LLM evaluator lacks the domain knowledge needed to distinguish fluent but incorrect answers from accurate ones, rendering it unsuitable for benchmarking factuality in chemistry.

Key Findings and Limitations

ChemCrow demonstrates that augmenting LLMs with expert-designed tools transforms them from “hyperconfident, typically wrong information sources” into reasoning engines that can gather and act on accurate chemical information. The system lowers the barrier for non-experts to access computational chemistry tools through natural language while serving as an assistant to expert chemists.

Several limitations are acknowledged:

Tool dependency: ChemCrow’s performance is bounded by the quality and coverage of its tools. Improved synthesis engines would directly improve synthesis planning capabilities.
Reasoning failures: Tools become useless if the LLM’s reasoning about when and how to use them is flawed, or if garbage inputs are provided.
Reproducibility: The API-based approach to closed-source LLMs (GPT-4) limits reproducibility of individual results. The authors note that open-source models could address this, potentially at the cost of reasoning quality.
Evaluation scope: The 14 evaluation tasks, while diverse, represent a limited test set. Standardized benchmarks for LLM-based chemistry tools did not exist at the time of publication.
Safety considerations: While safety tools prevent execution of controlled chemical syntheses, risks remain from inaccurate reasoning or tool outputs leading to suboptimal conclusions.

The authors emphasize that ChemCrow’s modular design allows easy extension with new tools, and that future integration of image-processing tools, additional language-based tools, and other capabilities could substantially enhance the system.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Chromophore screening	DB for chromophore (Joung et al.)	Not specified	Used for training random forest model
Evaluation	14 expert-designed tasks	14 tasks	Spanning synthesis, molecular design, and chemical logic
Chemical safety	OPCW Schedules 1-3, Australia Group lists	Not specified	Used for controlled chemical screening

Algorithms

LLM: GPT-4 with temperature 0.1
Framework: LangChain for tool integration
Reasoning: ReAct (Reasoning + Acting) framework with chain-of-thought prompting
Synthesis planning: IBM RXN4Chemistry API (Molecular Transformer-based)
Molecule similarity: Tanimoto similarity with ECFP2 fingerprints via RDKit
Chemical space exploration: SynSpace with 50 robust medicinal chemistry reactions

Models

GPT-4 (OpenAI, closed-source) for reasoning
Random forest for chromophore screening (trained on the fly)
Molecular Transformer via RXN4Chemistry API for reaction prediction and retrosynthesis

Evaluation

Human evaluation: 4 expert chemists rated responses on chemistry correctness, reasoning quality, and task completion
LLM evaluation: EvaluatorGPT assessed responses (found unreliable for factuality)
Experimental validation: 4 syntheses on RoboRXN platform, 1 novel chromophore characterization

Hardware

Hardware requirements are not specified in the paper. The system relies primarily on API calls to GPT-4 and RXN4Chemistry, so local compute requirements are minimal.

Artifacts

Artifact	Type	License	Notes
chemcrow-public	Code	MIT	Open-source implementation with 12 of 18 tools
chemcrow-runs	Data	Not specified	All experiment outputs and evaluation data
Zenodo release (code)	Code	MIT	Archived release v0.3.24
Zenodo release (runs)	Data	Not specified	Archived experiment runs

Paper Information

Citation: Bran, A. M., Cox, S., Schilter, O., Baldassari, C., White, A. D., & Schwaller, P. (2024). Augmenting large language models with chemistry tools. Nature Machine Intelligence, 6(5), 525-535.

@article{bran2024augmenting,
  title={Augmenting large language models with chemistry tools},
  author={Bran, Andres M. and Cox, Sam and Schilter, Oliver and Baldassari, Carlo and White, Andrew D. and Schwaller, Philippe},
  journal={Nature Machine Intelligence},
  volume={6},
  number={5},
  pages={525--535},
  year={2024},
  publisher={Nature Publishing Group},
  doi={10.1038/s42256-024-00832-8}
}

ChatDrug: Conversational Drug Editing with ChatGPT

Sat, 28 Mar 2026 00:00:00 +0000

A Framework for Conversational Drug Editing with LLMs

ChatDrug is a Method paper that introduces a parameter-free framework for drug editing using conversational large language models (specifically ChatGPT/GPT-3.5). The primary contribution is a three-module pipeline that combines prompt engineering, retrieval-augmented domain feedback, and iterative conversation to perform text-guided editing of small molecules, peptides, and proteins. The paper also establishes a benchmark of 39 drug editing tasks spanning these three drug types.

Bridging Conversational AI and Drug Discovery

Drug editing (also called lead optimization or protein design) is a critical step in the drug discovery pipeline where molecular substructures are modified to achieve desired properties. Traditional approaches rely on domain experts for manual editing, which can be subjective and biased. Recent multi-modal approaches like MoleculeSTM and ProteinDT have started exploring text-guided drug editing, but they are domain-specific (limited to one drug type) and lack conversational capabilities for iterative refinement.

The authors identify three properties of conversational LLMs that make them suitable for drug discovery: (1) pretraining on comprehensive knowledge bases covering drug-related concepts, (2) strong few-shot adaptation and generalization abilities, and (3) interactive communication enabling iterative feedback incorporation. However, directly applying LLMs to drug editing yields suboptimal results because the models do not fully utilize prior domain knowledge. ChatDrug addresses this gap through structured retrieval and feedback mechanisms.

Three-Module Pipeline: PDDS, ReDF, and Conversation

ChatDrug consists of three modules that operate sequentially without any parameter learning.

PDDS Module (Prompt Design for Domain-Specific)

The PDDS module constructs domain-specific prompts for ChatGPT. Given an input drug $\pmb{x}_{\text{in}}$ and a text prompt $\pmb{x}_t$ describing the desired property change, the goal is:

$$ \pmb{x}_{\text{out}} = \text{ChatDrug}(\pmb{x}_{\text{in}}, \pmb{x}_t) $$

The prompts are designed around high-level property descriptions (e.g., “more soluble in water”) rather than exact substructure replacements. The authors argue that ChatDrug is better suited for “fuzzy searching” (property-based editing with non-deterministic answers) rather than “exact searching” (precise substructure replacement that experts can do directly).

ReDF Module (Retrieval and Domain Feedback)

The ReDF module retrieves structurally similar examples from a domain-specific database and injects them into the conversation as demonstrations. For an input drug $\pmb{x}_{\text{in}}$, a candidate drug $\tilde{\pmb{x}}$ that failed the desired property change, and a retrieval database, ReDF returns:

$$ \pmb{x}_R = \text{ReDF}(\pmb{x}_{\text{in}}, \tilde{\pmb{x}}; \pmb{x}_t) = \underset{\pmb{x}’_R \in \text{RetrievalDB}}{\arg\max} \langle \tilde{\pmb{x}}, \pmb{x}’_R \rangle \wedge D(\pmb{x}_{\text{in}}, \pmb{x}’_R; \pmb{x}_t) $$

where $D(\cdot, \cdot; \cdot) \in {\text{True}, \text{False}}$ is a domain feedback function checking whether the retrieved drug satisfies the desired property change, and $\langle \tilde{\pmb{x}}, \pmb{x}’_R \rangle$ is a similarity function (Tanimoto similarity for small molecules, Levenshtein distance for peptides and proteins).

The retrieved example $\pmb{x}_R$ is injected into the prompt as: “Your provided sequence [$\tilde{\pmb{x}}$] is not correct. We find a sequence [$\pmb{x}_R$] which is correct and similar to the molecule you provided. Can you give me a new molecule?”

Conversation Module

The conversation module enables iterative refinement over $C$ rounds. At each round $c$, if the edited drug $\pmb{x}_c$ does not satisfy the evaluation condition, ChatDrug retrieves a new example via ReDF using $\tilde{\pmb{x}} = \pmb{x}_c$ and continues the conversation. This aligns with the iterative nature of real drug discovery workflows.

Experiments Across 39 Drug Editing Tasks

Task Design

The benchmark includes 39 tasks across three drug types:

Small molecules (28 tasks): 16 single-objective (tasks 101-108, each with loose and strict thresholds) and 12 multi-objective tasks (tasks 201-206, each with two thresholds). Properties include solubility (LogP), drug-likeness (QED), permeability (tPSA), hydrogen bond acceptors/donors.
Peptides (9 tasks): 6 single-objective and 3 multi-objective tasks for editing peptide-MHC binding affinity across different HLA allele types.
Proteins (2 tasks): Editing protein sequences to increase alpha-helix or beta-strand secondary structures.

Baselines

For small molecules, baselines include Random, PCA, High-Variance, and GS-Mutate (all based on MegaMolBART), plus MoleculeSTM with SMILES and Graph representations. For peptides and proteins, random mutation baselines with 1-3 mutated positions are used.

Main Results

ChatDrug achieves the best performance on 33 out of 39 tasks. Key results for small molecule editing (hit ratio):

Task	Property	ChatDrug (loose)	Best Baseline (loose)
101	More soluble	94.13	67.86 (MoleculeSTM-Graph)
102	Less soluble	96.86	64.79 (MoleculeSTM-Graph)
106	Lower permeability	77.35	34.13 (MoleculeSTM-SMILES)
107	More HBA	95.35	54.01 (MoleculeSTM-SMILES)
108	More HBD	96.54	60.97 (MoleculeSTM-Graph)

ChatDrug underperforms on tasks 104 (less like a drug) and 105 (higher permeability) and most multi-objective tasks involving permeability (205), where MoleculeSTM variants perform better.

For peptide editing, ChatDrug achieves 41-69% hit ratios compared to 0.4-14.4% for random mutation baselines. For protein editing, ChatDrug reaches 34.79% and 51.38% hit ratios on helix and strand tasks respectively, compared to 26.90% and 21.44% for the best random mutation baseline.

Ablation Studies

Conversation rounds: Performance increases with more rounds, converging around $C = 2$. For example, on task 101 (loose threshold), zero-shot achieves 78.26%, $C = 1$ reaches 89.56%, and $C = 2$ reaches 93.37%.

ReDF threshold: Using a stricter threshold in the domain feedback function $D$ (matching the evaluation threshold) yields substantially higher performance than using a loose threshold. For example, on task 107 with strict evaluation, the strict-threshold ReDF achieves 72.60% vs. 14.96% for the loose-threshold ReDF.

Similarity analysis: Retrieved molecules $\pmb{x}_R$ tend to have lower similarity to input molecules than the intermediate outputs $\pmb{x}_1$, yet they have higher hit ratios. This suggests the ReDF module explores the chemical space effectively, and the conversation module balances similarity preservation with property optimization.

Knowledge extraction: ChatDrug can articulate domain-specific reasoning for its edits (e.g., summarizing rules for increasing water solubility by introducing polar functional groups), though the extracted knowledge shows some redundancy.

Limitations and Future Directions

ChatDrug demonstrates that conversational LLMs can serve as useful tools for drug editing, achieving strong results across diverse drug types with a parameter-free approach. The framework exhibits open vocabulary and compositional properties, allowing it to handle novel drug concepts and multi-objective tasks through natural language.

The authors acknowledge two main limitations. First, ChatDrug struggles with understanding complex 3D drug geometries, which would require deeper geometric modeling. Second, the framework requires multiple conversation rounds to achieve strong performance, adding computational cost through repeated API calls. The authors suggest that knowledge summarization capabilities of LLMs could help reduce this cost.

The evaluation relies entirely on computational oracles (RDKit for small molecules, MHCflurry2.0 for peptides, ProteinCLAP for proteins) rather than wet-lab validation. The hit ratio metric also excludes invalid outputs from the denominator, so the effective success rate on all attempted edits may be lower than reported.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Small molecule inputs	ZINC	200 molecules	Sampled SMILES strings
Small molecule retrieval DB	ZINC	10K molecules	For ReDF similarity search
Peptide inputs	Peptide-MHC binding dataset	500 peptides per task	From 30 common MHC alleles
Peptide retrieval DB	Experimental binding data	Varies by allele	Target allele experimental data
Protein inputs	TAPE test set	Varies	Secondary structure prediction test data
Protein retrieval DB	TAPE training set	Varies	Secondary structure prediction training data

Algorithms

GPT-3.5-turbo via OpenAI ChatCompletion API, temperature=0, frequency_penalty=0.2
System prompt: “You are an expert in the field of molecular chemistry.”
$C = 2$ conversation rounds for main results
5 random seeds (0-4) for small molecule main results, seed 0 for ablations

Models

ChatGPT (GPT-3.5-turbo): used as-is, no fine-tuning
MHCflurry 2.0: pseudo-oracle for peptide binding affinity evaluation
ProteinCLAP-EBM-NCE from ProteinDT: protein secondary structure prediction
ESMFold: protein folding for visualization
RDKit: molecular property calculations for small molecules

Evaluation

Metric	Description	Notes
Hit Ratio	Fraction of valid edits satisfying property requirements	Invalid sequences excluded from denominator

Hardware

All experiments conducted on a single NVIDIA RTX A6000 GPU (used only for peptide and protein evaluation). Total OpenAI API cost was less than $100.

Artifact	Type	License	Notes
ChatDrug GitHub	Code	Not specified	Official implementation

Paper Information

Citation: Liu, S., Wang, J., Yang, Y., Wang, C., Liu, L., Guo, H., & Xiao, C. (2024). Conversational Drug Editing Using Retrieval and Domain Feedback. ICLR 2024.

@inproceedings{liu2024chatdrug,
  title={Conversational Drug Editing Using Retrieval and Domain Feedback},
  author={Liu, Shengchao and Wang, Jiongxiao and Yang, Yijin and Wang, Chengpeng and Liu, Ling and Guo, Hongyu and Xiao, Chaowei},
  booktitle={International Conference on Learning Representations},
  year={2024}
}

MG-BERT: Graph BERT for Molecular Property Prediction

Fri, 27 Mar 2026 00:00:00 +0000

A Graph-Aware BERT for Molecular Property Prediction

MG-BERT is a Method paper that adapts the BERT pretraining paradigm from NLP to molecular graphs. The primary contribution is a modified Transformer architecture that replaces global self-attention with bond-based local attention, allowing atoms to exchange information only through chemical bonds. This creates a deep message-passing network that avoids the oversmoothing problem of conventional graph neural networks (GNNs). Combined with a masked atom prediction pretraining strategy on 1.7 million unlabeled molecules from ChEMBL, MG-BERT learns context-sensitive atomic representations that transfer effectively to downstream property prediction tasks.

Data Scarcity in Molecular Property Prediction

Molecular property prediction is central to drug discovery, particularly for ADMET (Absorption, Distribution, Metabolism, Excretion, and Toxicity) endpoints. While deep learning has advanced many domains, molecular property prediction faces a persistent challenge: labeled data scarcity. ADMET measurements require expensive, time-consuming experiments, and typical datasets contain only hundreds to thousands of examples.

Prior approaches fall into three categories, each with limitations:

Feature engineering (molecular fingerprints, descriptors): Requires expert design, suffers from low scalability, and fixed representations cannot be optimized for specific tasks.
SMILES-based deep learning (CNNs, LSTMs, Transformers on SMILES strings): Must learn to parse molecular information from complex string syntax, increasing learning difficulty. Autoencoder-based methods (e.g., CDDD) learn fixed representations that cannot be fine-tuned.
Graph neural networks (GAT, GCN): Can learn directly from molecular topology, but are limited to 2-3 layers due to oversmoothing, restricting their capacity to capture deep-level patterns.

The BERT model from NLP demonstrated that self-supervised pretraining on large unlabeled corpora followed by fine-tuning on small labeled datasets can substantially improve downstream performance. SMILES-BERT applied this idea to SMILES strings directly, but suffered from interpretability issues due to auxiliary characters in the SMILES syntax. MG-BERT addresses these limitations by operating directly on molecular graphs.

Bond-Based Local Attention and Masked Atom Pretraining

The core innovation of MG-BERT has two components: a modified Transformer architecture for molecular graphs and a self-supervised pretraining strategy.

Architecture Modifications

The original BERT model uses three components: an embedding layer, Transformer encoder layers, and a task-specific output layer. MG-BERT makes three key modifications:

Atom embeddings replace word embeddings. The dictionary contains 16 tokens: 13 common atom types ([H], [C], [N], [O], [F], [S], [Cl], [P], [Br], [B], [I], [Si], [Se]), plus [UNK] for rare atoms, [MASK] for pretraining, and [GLOBAL] for graph-level readout.
No positional encoding. Unlike sequential text, atoms in a molecular graph have no inherent ordering, so positional embeddings are removed.
Local attention replaces global attention. The adjacency matrix of the molecular graph is used as a visibility matrix to modulate the attention scores. Each atom can only attend to atoms connected by chemical bonds. Formally, the attention is constrained so that:

$$A’_{ij} = \begin{cases} A_{ij} & \text{if bond exists between } i \text{ and } j \\ -\infty & \text{otherwise} \end{cases}$$

where $A_{ij}$ is the standard scaled dot-product attention score. This local message passing makes MG-BERT a variant of GNN, but one that can stack many layers (6 in the medium configuration) without oversmoothing, thanks to the residual connections inherited from the Transformer architecture.

Supernode for graph-level readout. A [GLOBAL] supernode is added to each molecular graph, connected to all atoms. This node aggregates information from the entire molecule and serves as the molecular representation for downstream prediction.

Masked Atom Prediction

The pretraining strategy mirrors BERT’s masked language model but operates on atoms:

15% of atoms in each molecule are randomly selected (at least one atom per molecule)
Of selected atoms: 80% are replaced with [MASK], 10% are randomly replaced with another atom type, and 10% remain unchanged
The model is trained to predict the original atom type at masked positions
Loss is computed only at masked positions

Model Configurations

Three model sizes were compared:

Configuration	Layers	Heads	Embedding Size	FFN Size	Recovery Accuracy
MG-BERT Small	3	2	128	256	95.27%
MG-BERT Medium	6	4	256	512	98.31%
MG-BERT Large	12	8	576	1152	98.35%

The medium configuration was selected for all experiments because it achieved the best downstream performance, despite the large model having slightly higher pretraining recovery accuracy. The authors attribute this to overfitting risk with the larger model.

Experimental Setup and Baselines

Pretraining

MG-BERT was pretrained on 1.7 million compounds randomly selected from ChEMBL, with 10% held out for evaluation (1.53M training molecules). Molecules were converted to 2D undirected graphs using RDKit, with hydrogen atoms explicitly included. The model was pretrained for 10 epochs using Adam with learning rate 1e-4 and batch size 256.

Fine-tuning Datasets

Sixteen datasets covering ADMET endpoints and common molecular properties were collected from ADMETlab and MoleculeNet:

Type	Dataset	Category	Size
Regression	Caco2	Absorption	979
Regression	logD	Physicochemical	10,354
Regression	logS	Physicochemical	5,045
Regression	PPB	Distribution	1,480
Regression	tox	Toxicity	7,295
Regression	ESOL	Physicochemical	1,128
Regression	FreeSolv	Physicochemical	642
Regression	Lipo	Physicochemical	4,200
Classification	Ames	Toxicity	6,719
Classification	BBB	Distribution	1,855
Classification	FDAMDD	Toxicity	795
Classification	H_HT	Toxicity	2,170
Classification	Pgp_inh	Absorption	2,125
Classification	Pgp_sub	Absorption	1,210
Classification	BACE	Biophysics	1,513
Classification	BBBP	Physiology	2,039

Datasets were split 8:1:1 (train:validation:test) with stratified sampling by SMILES length. Each experiment was repeated 10 times with random splits, reporting mean and standard deviation. Regression was evaluated by R-squared, classification by ROC-AUC. Early stopping with a maximum of 100 epochs was used.

Baselines

Five baselines were compared:

ECFP4-XGBoost: Extended connectivity fingerprints (diameter 4) with gradient-boosted trees
GAT: Graph Attention Network
GCN: Graph Convolutional Network
CDDD: Continuous and Data-Driven Descriptors (pretrained RNN encoder on SMILES with a fully connected network)
SMILES-BERT: Original BERT applied directly to SMILES strings

Ablation Studies

Two ablation studies were conducted:

Pretraining effectiveness: Comparing pretrained vs. non-pretrained MG-BERT under identical hyperparameters
Hydrogen atoms: Comparing MG-BERT with and without explicit hydrogen atoms in the molecular graph

Consistent Improvements Across ADMET Benchmarks

Main Results

MG-BERT consistently outperformed all baselines across all 16 datasets. Key results on the 11 ADMET datasets:

Dataset	ECFP4-XGBoost	GAT	GCN	CDDD	SMILES-BERT	MG-BERT
Caco2 (R2)	61.41	69.16	67.15	73.42	72.39	74.68
logD (R2)	70.84	84.62	86.22	85.85	86.31	87.46
logS (R2)	73.73	84.06	83.47	84.01	85.20	87.66
PPB (R2)	55.11	59.96	57.34	54.12	62.37	65.94
Ames (AUC)	87.21	86.38	87.04	86.82	87.69	89.33
BBB (AUC)	94.62	93.03	92.67	94.44	94.02	95.41
BBBP (AUC)	89.16	90.33	90.74	91.12	91.32	92.08

The overall improvement across all datasets was 28.1% (7.02% on classification, 21.28% on regression). Improvements were statistically significant at the 95% confidence level (paired t-test, P <= 0.001).

Pretraining Ablation

Pretraining improved performance by more than 2% on all datasets. The benefit was largest for small datasets: Caco2 improved by approximately 10 percentage points (64.79 to 74.68 R2), and FDAMDD improved by about 7.5 points (80.76 to 88.23 AUC). This confirms that self-supervised pretraining effectively addresses the labeled data scarcity problem.

Hydrogen Atom Ablation

Including explicit hydrogen atoms improved pretraining recovery accuracy from 92.25% to 98.31% and consistently improved downstream performance. The authors provide an intuitive explanation: hydrogen atoms help determine bond counts for neighboring atoms, which is critical for the masked atom recovery task. They also show that removing hydrogens can make structurally distinct molecules (e.g., benzene and cyclohexane) indistinguishable at the graph level.

Interpretability via Attention Visualization

The authors provide two forms of interpretability analysis:

t-SNE visualization of atomic representations: Pretrained atomic representations cluster by atom type and, more specifically, by local chemical environment (e.g., aromatic carbons separate from aliphatic carbons, C-N bonds from C-O bonds). This demonstrates that pretraining captures neighborhood context beyond simple atom identity.
Attention weight visualization: On the logD task, the supernode’s attention focuses on polar groups (which govern lipophilicity). On the Ames mutagenicity task, attention concentrates on known mutagenic structural alerts (acylchloride, nitrosamide, azide groups). This provides chemically meaningful explanations for predictions.

Limitations

The paper does not extensively discuss limitations, but several can be identified:

The model uses only 2D molecular topology (atom types and bonds) without 3D conformational information or bond-type features
The atom dictionary is limited to 13 common types plus [UNK], which may lose information for molecules containing rarer elements
Evaluation is limited to ADMET-focused datasets; broader chemical spaces (e.g., materials, catalysts) are not tested
The comparison baselines do not include other graph-based pretraining methods (e.g., the contemporaneous Strategies for Pre-training Graph Neural Networks by Hu et al.)

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pretraining	ChEMBL (random subset)	1.7M molecules (1.53M train)	10% held out for evaluation
Fine-tuning	ADMETlab + MoleculeNet	16 datasets (642-10,354 molecules)	8:1:1 splits, stratified by SMILES length

Algorithms

Optimizer: Adam (pretraining: lr=1e-4, batch=256; fine-tuning: lr from {1e-5, 5e-5, 1e-4}, batch from {16, 32, 64})
Pretraining epochs: 10
Fine-tuning: Up to 100 epochs with early stopping
Dropout: Optimized per task in range [0.0, 0.5]
Masking: 15% of atoms (80% [MASK], 10% random, 10% unchanged)

Models

Architecture: MG-BERT Medium (6 layers, 4 heads, embedding size 256, FFN size 512)
Molecule processing: RDKit for graph conversion with explicit hydrogens

Evaluation

Metric	Task Type	Notes
R-squared (R2)	Regression	Higher is better
ROC-AUC	Classification	Higher is better
Accuracy, RMSE	Both	Reported in supplementary Table S1

All results averaged over 10 random splits with standard deviations reported.

Hardware

The paper does not specify hardware requirements (GPU type, training time, or memory usage).

Artifacts

Artifact	Type	License	Notes
Molecular-graph-BERT	Code	Not specified	Jupyter Notebook implementation; last code push August 2021

Paper Information

Citation: Zhang, X.-C., Wu, C.-K., Yang, Z.-J., Wu, Z.-X., Yi, J.-C., Hsieh, C.-Y., Hou, T.-J., & Cao, D.-S. (2021). MG-BERT: leveraging unsupervised atomic representation learning for molecular property prediction. Briefings in Bioinformatics, 22(6), bbab152. https://doi.org/10.1093/bib/bbab152

@article{zhang2021mgbert,
  title={{MG-BERT}: leveraging unsupervised atomic representation learning for molecular property prediction},
  author={Zhang, Xiao-Chen and Wu, Cheng-Kun and Yang, Zhi-Jiang and Wu, Zhen-Xing and Yi, Jia-Cai and Hsieh, Chang-Yu and Hou, Ting-Jun and Cao, Dong-Sheng},
  journal={Briefings in Bioinformatics},
  volume={22},
  number={6},
  pages={bbab152},
  year={2021},
  publisher={Oxford University Press},
  doi={10.1093/bib/bbab152}
}

Maxsmi: SMILES Augmentation for Property Prediction

Fri, 27 Mar 2026 00:00:00 +0000

Systematic Benchmarking of SMILES Data Augmentation

This is an Empirical paper that systematically evaluates how SMILES augmentation affects deep learning molecular property prediction. The primary contribution is a comprehensive comparison of five augmentation strategies across three neural network architectures and four datasets, producing the “Maxsmi” models that maximize prediction performance. The study also demonstrates that test-time augmentation provides a practical confidence measure for predictions.

The Data Scarcity Problem in QSAR Modeling

Deep learning models require large training sets to perform well, but experimental physico-chemical and bioactivity datasets remain small, typically ranging from hundreds to a few thousand compounds. SMILES augmentation, where the non-unique SMILES representation of a molecule is exploited to generate multiple training examples per compound, has been shown to help in prior work by Bjerrum (2017), Kimber et al. (2018), and Li and Fourches (2020). However, no prior study had systematically compared different augmentation strategies, analyzed how much augmentation is needed, or examined the relationship between augmentation factor and prediction confidence. Most previous work chose augmentation numbers a priori without justification. Maxsmi fills this gap by providing a systematic analysis and practical guidelines.

Five Augmentation Strategies and Test-Time Ensemble Learning

The core insight is twofold. First, the authors define five distinct strategies for generating augmented SMILES:

No augmentation: use only the canonical SMILES (baseline)
Augmentation with duplication: generate $m$ random SMILES per compound, allowing duplicates; dataset grows to $N \times m$
Augmentation without duplication: generate $m$ random SMILES and discard exact duplicates
Augmentation with reduced duplication: keep only $f(m) = \sqrt{m}$ copies of each duplicate, a compromise between the above
Augmentation with estimated maximum: sample random SMILES until the same string has been generated 10 times, attempting to cover most of the valid SMILES space

Second, the authors formalize test-time augmentation as ensemble learning. Given a trained model $M_{\Theta}$, each test compound $C$ is represented by $k$ random SMILES $S_1(C), \ldots, S_k(C)$. The per-SMILES predictions are:

$$ \hat{y}_i(C) = M_{\Theta}(S_i(C)) $$

The compound-level prediction is an aggregation (mean) over these:

$$ \hat{y}(C) = A\big(\hat{y}_1(C), \ldots, \hat{y}_k(C)\big) $$

The standard deviation of the per-SMILES predictions serves as a confidence measure: high variance indicates the model is uncertain about a compound.

Experimental Design: Three Architectures, Four Datasets

Datasets

Dataset	Size (after preprocessing)	Train / Test	Task	Provenance
ESOL	1,128	902 / 226	Water solubility	MoleculeNet
ESOL_small	1,068	854 / 214	Solubility (max 25 heavy atoms)	MoleculeNet
FreeSolv	642	513 / 129	Hydration free energy	MoleculeNet
Lipophilicity	4,199	3,359 / 840	Octanol/water distribution	ChEMBL
Affinity (EGFR)	5,849	4,679 / 1,170	pIC50 against EGFR kinase	Kinodata

Architectures

Three shallow neural networks are compared:

CONV1D: 1D convolution (kernel size 10, stride 1) followed by two fully connected layers
CONV2D: 2D convolution on the one-hot encoded SMILES matrix, followed by two fully connected layers
RNN: LSTM layer followed by two fully connected layers (128 and 64 units)

All models are trained for 250 epochs with batch size 16, MSE loss, SGD optimizer, and learning rate 0.001. A Random Forest baseline with Morgan fingerprints (radius 2, length 1024) is also included.

Augmentation sweep

The augmentation number $m$ is varied from 1 to 20 (step 1) and from 20 to 100 (step 10) for three strategies (with, without, and reduced duplication). The estimated maximum strategy is tested on the smaller datasets. Both training and test sets receive the same augmentation.

Key Findings: Augmentation Consistently Improves RMSE

Augmentation always helps

Across all datasets and architectures, SMILES augmentation reduces test RMSE compared to the no-augmentation baseline. Performance improves sharply in the low augmentation range (1 to 10) and reaches a plateau around 40 to 70, after which additional augmentation provides diminishing returns.

Best models (Maxsmi)

Dataset	Model	Augmentation Number	Strategy	Test RMSE
ESOL	CONV1D	70	Reduced duplication	0.569
FreeSolv	CONV1D	70	With duplication	1.032
Lipophilicity	CONV1D	80	Without duplication	0.593

The CONV1D architecture consistently outperforms RNN and CONV2D. For ESOL, the CONV1D model improves from 0.839 RMSE (no augmentation) to 0.569 RMSE (70x reduced duplication), a 32% reduction.

No single best augmentation strategy

The three main augmentation strategies (with, without, and reduced duplication) perform similarly. Generating the estimated maximum number of unique SMILES does not yield the best results, suggesting a saturation point exists where additional SMILES diversity stops helping.

Canonical SMILES outperform single random SMILES

When augmentation is limited to a single representation ($m = 1$), the canonical SMILES consistently outperforms a single random SMILES. On ESOL with CONV1D, the canonical model achieves 0.839 RMSE versus 0.964 for a random SMILES. The authors attribute this to the simpler, more readable structure of canonical SMILES (fewer branches and brackets).

Comparison to prior work

Study	ESOL	FreeSolv	Lipophilicity	Model
Maxsmi	0.569	1.032	0.593	CNN
MoleculeNet	0.58 +/- 0.03	1.15 +/- 0.12	0.655 +/- 0.036	GNN
CNF	0.62	1.11	0.67	CNN
MolPMoFiT	N/A	1.197 +/- 0.127	0.565 +/- 0.037	RNN

Maxsmi outperforms or matches MoleculeNet’s graph neural networks and the CNF model on all three tasks. MolPMoFiT slightly outperforms Maxsmi on lipophilicity (0.565 vs 0.593) but performs worse on FreeSolv.

Confidence estimation

The standard deviation of per-SMILES predictions correlates with prediction error. Confidence curves show that sequentially removing compounds with the highest uncertainty leads to monotonically decreasing mean prediction error. For ESOL, keeping only the top 10% most confident predictions yields errors below 0.25.

EGFR affinity test case

Applying the Maxsmi approach (CONV1D, 70x augmentation, reduced duplication) to EGFR kinase affinity prediction yields test RMSE of 0.777 and R2 of 0.712, compared to 1.031 RMSE and 0.494 R2 for the canonical model (a 25% RMSE improvement). The Random Forest baseline (0.758 RMSE, 0.726 R2) performs comparably, which the authors note without further explanation.

Limitations

All experiments use a single train/test split (80/20) without cross-validation, due to the computational cost of the full augmentation sweep. This means reported RMSE values lack uncertainty estimates for the Maxsmi models.
The study uses shallow networks only. Whether the same augmentation benefits apply to deeper architectures or pre-trained models is untested.
The EGFR test case shows the Random Forest baseline performing comparably to the Maxsmi model, raising questions about when SMILES augmentation provides a meaningful advantage over traditional fingerprint-based methods.
The comparison to prior work uses different splits, preprocessing, and evaluation protocols across studies, which the authors acknowledge limits direct comparability.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Training/Evaluation	ESOL	1,128	MoleculeNet, water solubility
Training/Evaluation	FreeSolv	642	MoleculeNet, hydration free energy
Training/Evaluation	Lipophilicity	4,199	ChEMBL, logD
Test case	EGFR Affinity	5,849	Kinodata (ChEMBL v28), pIC50

All datasets are publicly available through MoleculeNet/DeepChem and Kinodata.

Algorithms

SMILES generation via RDKit’s random SMILES enumeration
One-hot encoding of SMILES characters with padding to max length
Five augmentation strategies applied to both training and test sets
Mean aggregation for compound-level predictions

Models

Model	Architecture	Parameters
CONV1D	1D conv (kernel 10, stride 1) + 2 FC layers	Not specified
CONV2D	2D conv (single channel) + 2 FC layers	Not specified
RNN	LSTM + FC(128) + FC(64)	Not specified
RF Baseline	Random Forest (default sklearn)	Morgan FP, radius 2, length 1024

Training: 250 epochs, batch size 16, MSE loss, SGD, lr=0.001.

Evaluation

Metric	Best Value	Baseline	Notes
RMSE (ESOL)	0.569	1.102 (RF)	CONV1D, 70x reduced dup
RMSE (FreeSolv)	1.032	2.563 (RF)	CONV1D, 70x with dup
RMSE (Lipophilicity)	0.593	0.860 (RF)	CONV1D, 80x without dup
RMSE (EGFR)	0.777	0.758 (RF)	CONV1D, 70x reduced dup

Hardware

Training was performed on a GeForce GTX 1080 Ti, provided by the HPC cluster at Freie Universitat Berlin. Training CONV1D on ESOL with 100x augmentation (keeping duplicates, 90,200 data points) takes approximately 3 hours. Training with 19x augmentation achieves RMSE of 0.605 in under 30 minutes.

Artifacts

Artifact	Type	License	Notes
volkamerlab/maxsmi	Code	MIT	Full source code, trained models, CLI for prediction
Documentation	Docs	N/A	Read the Docs documentation
Kinodata	Dataset	N/A	Curated kinase bioactivity data from ChEMBL v28

Reproducibility status: Highly Reproducible. Code, data, trained models, and a command-line prediction tool are all publicly available under the MIT license.

Paper Information

Citation: Kimber, T. B., Gagnebin, M., & Volkamer, A. (2021). Maxsmi: Maximizing molecular property prediction performance with confidence estimation using SMILES augmentation and deep learning. Artificial Intelligence in the Life Sciences, 1, 100014. https://doi.org/10.1016/j.ailsci.2021.100014

@article{kimber2021maxsmi,
  title={Maxsmi: Maximizing molecular property prediction performance with confidence estimation using SMILES augmentation and deep learning},
  author={Kimber, Talia B. and Gagnebin, Maxime and Volkamer, Andrea},
  journal={Artificial Intelligence in the Life Sciences},
  volume={1},
  pages={100014},
  year={2021},
  publisher={Elsevier},
  doi={10.1016/j.ailsci.2021.100014}
}

Transformer CLMs for SMILES: Literature Review 2024

Thu, 26 Mar 2026 00:00:00 +0000

A Systematization of Transformer-Based Chemical Language Models

This paper is a Systematization (literature review) that surveys the landscape of transformer-based chemical language models (CLMs) operating on SMILES representations. It organizes the field into three architectural categories (encoder-only, decoder-only, encoder-decoder), discusses tokenization strategies, pre-training and fine-tuning methodologies, and identifies open challenges and future research directions. The review covers approximately 30 distinct CLMs published through early 2024.

Why Review Transformer CLMs for SMILES?

The chemical space is vast, with databases like ZINC20 exceeding 5.5 billion compounds, and the amount of unlabeled molecular data far outstrips available labeled data for specific tasks like toxicity prediction or binding affinity estimation. Traditional molecular representations (fingerprints, descriptors, graph-based methods) require expert-engineered features and extensive domain knowledge.

Transformer-based language models, originally developed for NLP, have emerged as a compelling alternative. By treating SMILES strings as a “chemical language,” these models can leverage large-scale unsupervised pre-training on abundant unlabeled molecules, then fine-tune on small labeled datasets for specific downstream tasks. Earlier approaches like Seq2Seq and Seq3Seq fingerprint methods used RNN-based encoder-decoders, but these suffered from vanishing gradients and sequential processing bottlenecks when handling long SMILES sequences.

The authors motivate this review by noting that no prior survey has comprehensively organized transformer-based CLMs by architecture type while simultaneously covering tokenization, embedding strategies, and downstream application domains.

Architectural Taxonomy: Encoder, Decoder, and Encoder-Decoder Models

The core organizational contribution is a three-way taxonomy of transformer CLMs based on their architectural backbone.

Encoder-Only Models (BERT Family)

These models capture bidirectional context, making them well suited for extracting molecular representations for property prediction tasks. The review covers:

BERT (Lee and Nam, 2022): Adapted for SMILES processing with linguistic knowledge infusion, using BPE tokenization
MOLBERT (Fabian et al., 2020): Chemistry-specific BERT for physicochemical property and bioactivity prediction
SMILES-BERT (Wang et al., 2019): BERT variant designed to learn molecular representations directly from SMILES without feature engineering
ChemBERTa / ChemBERTa-2 (Chithrananda et al., 2020; Ahmad et al., 2022): RoBERTa-based models optimized for chemical property prediction, with ChemBERTa-2 exploring multi-task pre-training
GPT-MolBERTa (Balaji et al., 2023): Combines GPT molecular features with a RoBERTa backbone
MoLFormer (Ross et al., 2022): Large-scale model trained on 1.1 billion molecules, published in Nature Machine Intelligence
SELFormer (Yuksel et al., 2023): Operates on SELFIES representations rather than SMILES
Mol-BERT / MolRoPE-BERT (Li and Jiang, 2021; Liu et al., 2023): Differ in positional embedding strategy, with MolRoPE-BERT using rotary position embedding to handle longer sequences
BET (Chen et al., 2021): Extracts predictive representations from hundreds of millions of molecules

Decoder-Only Models (GPT Family)

These models excel at generative tasks, including de novo molecular design:

GPT-2-based model (Adilov, 2021): Generative pre-training from molecules
MolXPT (Liu et al., 2023): Wraps molecules with text for generative pre-training, connecting chemical and natural language
BioGPT (Luo et al., 2022): Focuses on biomedical text generation and mining
MolGPT (Haroon et al., 2023): Uses relative attention to capture token distances and relationships for de novo drug design
Mol-Instructions (Fang et al., 2023): Large-scale biomolecular instruction dataset for LLMs

Encoder-Decoder Models

These combine encoding and generation capabilities for sequence-to-sequence tasks:

Chemformer (Irwin et al., 2022): BART-based model for reaction prediction and molecular property prediction
MolT5 (adapted T5): Unified text-to-text framework for molecular tasks
SMILES Transformer (Honda et al., 2019): Pre-trained molecular fingerprints for low-data drug discovery
X-MOL (Xue et al., 2020): Large-scale pre-training for molecular understanding
Regression Transformer (Born and Manica, 2023): Operates on SELFIES, enabling concurrent regression and generation
TransAntivirus (Mao et al., 2023): Specialized for antiviral drug design using IUPAC nomenclature

Tokenization, Embedding, and Pre-Training Strategies

SMILES Tokenization

The review identifies tokenization as a critical preprocessing step that affects downstream performance. SMILES tokenization differs from standard NLP tokenization because SMILES strings lack whitespace and use parentheses for branching rather than sentence separation. The key approaches include:

Strategy	Source	Description
Atom-in-SMILES (AIS)	Ucak et al. (2023)	Atom-level tokens preserving chemical identity
SMILES Pair Encoding (SPE)	Li and Fourches (2021)	BPE-inspired substructure tokenization
Byte-Pair Encoding (BPE)	Chithrananda et al. (2020); Lee and Nam (2022)	Standard subword tokenization adapted for SMILES
SMILESTokenizer	Chithrananda et al. (2020)	Character-level tokenization with chemical adjustments

Positional Embeddings

The models use various positional encoding strategies: absolute, relative key, relative key-query, rotary (RoPE), and sinusoidal. Notably, SMILES-based models omit segmentation embeddings since SMILES data consists of single sequences rather than sentence pairs.

Pre-Training and Fine-Tuning Pipeline

The standard workflow follows two phases:

Pre-training: Unsupervised training on large unlabeled SMILES databases (ZINC, PubChem, ChEMBL) using masked language modeling (MLM), where the model learns to predict masked tokens within SMILES strings
Fine-tuning: Supervised adaptation on smaller labeled datasets for specific tasks (classification or regression)

The self-attention mechanism, central to all transformer CLMs, is formulated as:

$$ Z = \text{Softmax}\left(\frac{(XW^Q)(XW^K)^T}{\sqrt{d_k}}\right) XW^V $$

where $X \in \mathbb{R}^{N \times M}$ is the input feature matrix, $W^Q$, $W^K$, $W^V \in \mathbb{R}^{M \times d_k}$ are learnable weight matrices, and $\sqrt{d_k}$ is the scaling factor.

Benchmark Datasets and Evaluation Landscape

The review catalogs the standard evaluation ecosystem for CLMs. Pre-training databases include ZINC, PubChem, and ChEMBL. Fine-tuning and evaluation rely heavily on MoleculeNet benchmarks:

Category	Datasets	Task Type	Example Size
Physical Chemistry	ESOL, FreeSolv, Lipophilicity	Regression	642 to 4,200
Biophysics	PCBA, MUV, HIV, PDBbind, BACE	Classification/Regression	11,908 to 437,929
Physiology	BBBP, Tox21, ToxCast, SIDER, ClinTox	Classification	1,427 to 8,575

The authors also propose four new fine-tuning datasets targeting diseases: COVID-19 drug compounds, cocrystal formation, antimalarial drugs (Plasmodium falciparum targets), and cancer gene expression/drug response data.

Challenges, Limitations, and Future Directions

Current Challenges

The review identifies several persistent limitations:

Data efficiency: Despite transfer learning, transformer CLMs still require substantial pre-training data, and labeled datasets for specific tasks remain scarce
Interpretability: The complexity of transformer architectures makes it difficult to understand how specific molecular features contribute to predictions
Computational cost: Training large-scale models demands significant GPU resources, limiting accessibility
Handling rare molecules: Models struggle with molecular structures that deviate significantly from training data distributions
SMILES limitations: Non-unique representations, invalid strings, exceeded atom valency, and inadequate spatial information capture

SMILES Representation Issues

The authors highlight five specific problems with SMILES as an input representation:

Non-canonical representations reduce string uniqueness for the same molecule
Many symbol combinations produce chemically invalid outputs
Valid SMILES strings can encode chemically impossible molecules (e.g., exceeded valency)
Spatial information is inadequately captured
Syntactic and semantic robustness is limited

Future Research Directions

The review proposes several directions:

Alternative molecular representations: Exploring SELFIES, DeepSMILES, IUPAC, and InChI beyond SMILES
Role of SMILES token types: Strategic masking of metals, non-metals, bonds, and branches during MLM pre-training to identify which components are most critical
Few-shot learning: Combining few-shot approaches with large-scale pre-trained CLMs for data-scarce scenarios
Drug repurposing: Training CLMs to distinguish identical compounds with different biological activity profiles across therapeutic domains
Improved benchmarks: Incorporating disease-specific datasets (malaria, cancer, COVID-19) for more realistic evaluation
Ethical considerations: Addressing dual-use risks, data biases, and responsible open-source release of CLMs

Reproducibility Details

This is a literature review paper. It does not introduce new models, code, or experimental results. The reproducibility assessment focuses on the accessibility of the reviewed works and proposed datasets.

Data

Purpose	Dataset	Size	Notes
Pre-training	ZINC20	5.5B+ compounds	Publicly available
Pre-training	PubChem	100M+ compounds	Publicly available
Pre-training	ChEMBL	2M+ compounds	Publicly available
Fine-tuning	MoleculeNet (8 datasets)	642 to 437,929	Standard benchmark suite
Proposed	COVID-19 drug compounds	740	From Harigua-Souiai et al. (2021)
Proposed	Cocrystal formation	3,282	From Mswahili et al. (2021)
Proposed	Antimalarial drugs	4,794	From Mswahili et al. (2024)
Proposed	Cancer gene/drug response	201 drugs, 734 cell lines	From Kim et al. (2021)

Artifacts

Artifact	Type	License	Notes
DAI Lab website	Other	N/A	Authors’ research lab

No code, models, or evaluation scripts are released with this review. The paper does not include a supplementary materials section or GitHub repository.

Hardware

Not applicable (literature review).

Paper Information

Citation: Mswahili, M. E., & Jeong, Y.-S. (2024). Transformer-based models for chemical SMILES representation: A comprehensive literature review. Heliyon, 10(20), e39038. https://doi.org/10.1016/j.heliyon.2024.e39038

@article{mswahili2024transformer,
  title={Transformer-based models for chemical {SMILES} representation: A comprehensive literature review},
  author={Mswahili, Medard Edmund and Jeong, Young-Seob},
  journal={Heliyon},
  volume={10},
  number={20},
  pages={e39038},
  year={2024},
  publisher={Elsevier},
  doi={10.1016/j.heliyon.2024.e39038}
}

Systematic Review of Deep Learning CLMs (2020-2024)

Thu, 26 Mar 2026 00:00:00 +0000

A Systematization of Chemical Language Models for Molecular Generation

This paper is a Systematization that provides a comprehensive, PRISMA-guided systematic review of deep learning chemical language models (CLMs) used for de novo molecular generation. The primary contribution is a structured statistical analysis of 72 retrieved articles from 2020 to June 2024, comparing architectures (RNNs, transformers, VAEs, GANs, S4 models), molecular representations, biased generation strategies, and quality metrics from the MOSES and GuacaMol benchmarking platforms. The review addresses five research questions about architecture configuration effects, best-performing architectures, impactful hyperparameters, common molecular representations, and effective biased generation methods.

Motivation: Evaluating Four Years of Generative CLM Progress

Deep learning molecular generation has expanded rapidly since 2018, when Gomez-Bombarelli et al. and Segler et al. demonstrated that deep generative models could learn to produce novel molecules from SMILES representations. By 2020, multiple architectures (RNNs, transformers, VAEs, GANs) were being applied to chemical language modeling, and benchmarking platforms like MOSES and GuacaMol had been introduced to enable standardized evaluation.

Despite this growth, existing reviews largely focused on theoretical background or drug development applications rather than systematic statistical comparison of model performance. Few studies had examined how architecture choice, training dataset size, molecular representation format, and biased learning strategies interact to affect generation quality metrics like validity, uniqueness, and novelty. This review fills that gap by restricting the analysis to papers reporting MOSES or GuacaMol metrics, enabling quantitative cross-study comparison.

PRISMA-Based Systematic Review Methodology

The review follows the Preferred Reporting Items for Systematic Review and Meta-Analysis (PRISMA) guidelines. Articles were retrieved from Scopus, Web of Science, and Google Scholar using six Boolean search queries combining terms like “Molecule Generation,” “Chemical Language Models,” “Deep Learning,” and specific architecture names. The search window covered January 2020 to June 2024.

Eligibility Criteria

Papers were included if they:

Were written in English
Explicitly presented at least two metrics of uniqueness, validity, or novelty
Defined these metrics consistent with MOSES or GuacaMol concepts
Used deep learning generative models for de novo molecule design
Used conventional (non-quantum) deep learning methods
Were published between January 2020 and June 2024

This yielded 48 articles from query-based search and 25 from citation search, totaling 72 articles. Of these, 62 used CLM approaches (string-based molecular representations) and 10 used graph-based representations.

Data Collection

For each article, the authors extracted: journal details, database name, training dataset size, molecular representation type (SMILES, SELFIES, InChI, DeepSMILES), architecture details (embedding length, layers, hidden units, trainable parameters, dropout, temperature, batch size, epochs, learning rate, optimizer), biased method usage (TL, RL, conditional learning), and generation metrics (validity, uniqueness, novelty, scaffold diversity, SNN, FCD).

Evaluation Metrics

The review focuses on three core MOSES metrics:

$$ \text{Validity}(V_m) = \frac{\text{Valid molecules}}{\text{Molecules produced}} $$

$$ \text{Uniqueness} = \frac{\text{set}(V_m)}{V_m} $$

$$ \text{Novelty} = 1 - \frac{V_m \cap T_d}{V_m} $$

where $V_m$ denotes valid molecules and $T_d$ the training dataset.

Architecture Distribution and Performance Comparison

Architecture Trends (2020-2024)

The review found that RNNs and transformers dominate CLM usage, with a growing trend toward transformers over time. The breakdown across 62 CLM articles: 24 RNN-based, 23 transformer-based, 16 VAE-based, 8 GAN-based, and 1 S4-based model. Among RNN variants, LSTM was the most common, followed by GRU, despite GRU having fewer trainable parameters.

The increase in transformer adoption is attributed to self-attention mechanisms enabling parallel computation and effective long-range dependency capture. Meanwhile, GANs and VAEs saw lower adoption rates, partly due to higher memory and time complexity and reduced ability to generate large molecules.

Molecular Representations and Databases

SMILES was used exclusively in 77.27% of CLM articles, reflecting its wide database availability and compact format. SELFIES, DeepSMILES, and InChI each appeared in smaller fractions. The dominant databases were ChEMBL and ZINC (27 articles each), followed by PubChem (4 articles). Approximately 71% of reviewed articles focused on drug discovery applications.

Database	Molecules (millions)	Representation	Articles
ChEMBL	2.4	SMILES, InChI	27
ZINC	750	SMILES	27
PubChem	115.3	SMILES, InChI	4
COCONUT	0.695	SMILES, InChI	1
DNA-Encoded Library	1,040	SMILES	1

Unbiased Model Performance

Validity: No statistically significant differences were observed across architecture families. Transformers generally achieved high validity through self-attention mechanisms that retain uncompressed sequence information. However, one transformer model (TransMol) achieved only 6.9% validity when using stochastic sampling with Gaussian noise to explore unseen chemical space. GANs showed high dispersion, with validity as low as 8.5% when learning from gene expression signatures rather than molecular structures directly.

Uniqueness: No significant differences in median uniqueness across architectures. Transformer-based models using masked self-attention achieved near-perfect uniqueness scores. Scaffold decoration and fragment-linking approaches sometimes compromised uniqueness due to overfit-driven redundancy.

Validity-Novelty Trade-off: The authors propose a “Valid/Sample” metric (Validity x Novelty) and find an inverse trend between validity and novelty (Spearman $\rho = -0.3575$, p-value = 0.0618). Only 17.9% of models achieved above-median values for both validity (95.6%) and novelty (96.5%) simultaneously. SELFIES-based models achieve 100% validity by construction, which can help address this trade-off.

Biased Model Performance

The review examines three biased generation strategies:

Transfer Learning (TL): The most prevalent biased method, used across all architecture types. Fine-tuning transfers pre-trained parameters to a target model, requiring significantly fewer training molecules (median ~2,507 vs. ~1.1M for unbiased). TL does not significantly affect validity (p = 0.16) or novelty (p = 0.84), but uniqueness decreases significantly (median 90.2% vs. 97.9%, p = 0.014), likely due to overfitting on small target datasets.

Metric	Unbiased (median)	TL Target (median)	p-value
Training size	1,128,920	2,507	<0.0001
Validity	98.05%	95.5%	0.1602
Uniqueness	97.9%	90.2%	0.0144
Novelty	91.6%	96.0%	0.8438

Reinforcement Learning (RL): Applied only to RNNs and transformers in the reviewed set. 90.1% of RL implementations used policy gradient methods with scoring functions for properties like synthesizability, binding affinity, and membrane permeability. No significant effects on generation metrics were observed.

Metric	Unbiased (median)	RL Target (median)	p-value
Validity	91.1%	96.5%	0.1289
Uniqueness	99.9%	89.7%	0.0935
Novelty	91.5%	93.5%	0.2500

Conditional Learning (CL): Integrates domain-specific data (properties, bioactivities, functional groups) directly into training via constraint tokens or property embeddings. Used primarily with encoder-decoder architectures (ARAEs, VAEs, transformers). CL does not significantly degrade generation metrics relative to unbiased models.

Metric	Unbiased (median)	CL Target (median)	p-value
Validity	98.5%	96.8%	0.4648
Uniqueness	99.9%	97.5%	0.0753
Novelty	89.3%	99.6%	0.2945

Key Findings and Directions for Chemical Language Models

Main Conclusions

Transformers are overtaking RNNs as the dominant CLM architecture, driven by self-attention mechanisms that capture long-range dependencies without the gradient vanishing issues of recurrent models.
SMILES remains dominant (77% of models) despite known limitations (non-uniqueness, syntax errors). SELFIES shows promise for improving the validity-novelty trade-off.
No architecture achieves both high validity and high novelty easily. Only 17.9% of unbiased models exceeded medians for both metrics simultaneously, highlighting a fundamental tension in generative chemistry.
Transfer learning requires only ~2,500 molecules to generate targeted compounds, compared to ~1.1M for unbiased training, but at the cost of reduced uniqueness.
Combining biased methods (e.g., TL + RL, CL + TL) shows promise for multi-objective optimization and exploring distant regions of chemical space.
S4 models were newly introduced for CLMs in 2023, showing competitive performance with the dual nature of convolution during training and recurrent generation.

Limitations

The review is restricted to papers reporting MOSES or GuacaMol metrics, which excludes many molecular generation studies that use alternative evaluation frameworks. The statistical comparisons rely on median values reported across different experimental settings, making direct architecture comparisons approximate. Graph-based approaches are included only for coarse comparison (10 of 72 articles) and are not the focus of the analysis.

Reproducibility Details

Data

This is a systematic review, so no new models were trained. The authors collected metadata from 72 published articles. No datasets were generated or analyzed beyond the literature corpus.

Algorithms

Statistical comparisons used Mann-Whitney U tests for paired samples. Spearman correlation was used to assess the validity-novelty relationship. Outlier identification used the Valid/Sample (Validity x Novelty) metric with box plot analysis.

Evaluation

The review evaluates models using MOSES metrics: validity, uniqueness, novelty, scaffold diversity, scaffold novelty, fragment similarity, SNN, internal diversity, and FCD. Statistical tests were applied to compare medians across architecture families and between biased and unbiased models.

Hardware

Not applicable (systematic review, no model training performed).

Paper Information

Citation: Flores-Hernandez, H., & Martínez-Ledesma, E. (2024). A systematic review of deep learning chemical language models in recent era. Journal of Cheminformatics, 16(1), 129. https://doi.org/10.1186/s13321-024-00916-y

@article{floreshernandez2024systematic,
  title={A systematic review of deep learning chemical language models in recent era},
  author={Flores-Hernandez, Hector and Mart{\'i}nez-Ledesma, Emmanuel},
  journal={Journal of Cheminformatics},
  volume={16},
  number={1},
  pages={129},
  year={2024},
  publisher={BioMed Central},
  doi={10.1186/s13321-024-00916-y}
}

S4 Structured State Space Models for De Novo Drug Design

Thu, 26 Mar 2026 00:00:00 +0000

Structured State Spaces Meet Chemical Language Modeling

This is a Method paper that introduces structured state space sequence (S4) models to chemical language modeling (CLM) for de novo drug design. S4 models have a dual formulation: they process entire input sequences via convolution during training (like Transformers) and generate sequences element-by-element via recurrence during inference (like LSTMs). The authors benchmark S4 against LSTM and GPT architectures across multiple drug discovery tasks, including drug-like molecule generation, bioactivity learning, chemical space exploration, natural product design, and prospective kinase inhibitor design validated by molecular dynamics simulations.

Bridging the LSTM-Transformer Gap in Molecular Generation

Chemical language models (CLMs) generate molecules by learning the “chemical language” of SMILES string representations. The two dominant architectures for CLMs are LSTMs and GPTs, each with complementary strengths and limitations:

LSTMs generate sequences recurrently (element-by-element), which enables efficient generation and good learning of local/short-range dependencies. However, their sequential information bottleneck limits learning of global sequence properties.
GPTs (Transformer decoders) process the entire input at once, better capturing global properties like bioactivity. However, they become increasingly compute-intensive for longer SMILES strings and struggle with chemical space exploration at higher sampling temperatures.

Complex molecular properties like bioactivity can emerge from separated portions of a SMILES string (e.g., distant functional groups in the linear notation). Neither architecture fully addresses the need to learn these long-range dependencies while maintaining efficient, robust generation. The chemical space, estimated at up to $10^{60}$ small molecules, demands models that can both capture complex property relationships and explore diverse scaffolds efficiently.

The Dual Nature of S4: Convolution Meets Recurrence

S4 models are built on discrete state space models, which map an input sequence $\mathbf{u}$ to an output sequence $\mathbf{y}$ through learnable parameters $\overline{\mathbf{A}} \in \mathbb{R}^{N \times N}$, $\overline{\mathbf{B}} \in \mathbb{R}^{N \times 1}$, $\overline{\mathbf{C}} \in \mathbb{R}^{1 \times N}$, and $\overline{\mathbf{D}} \in \mathbb{R}^{1 \times 1}$:

$$ x_{k} = \overline{\mathbf{A}} x_{k-1} + \overline{\mathbf{B}} u_{k} $$

$$ y_{k} = \overline{\mathbf{C}} x_{k} + \overline{\mathbf{D}} u_{k} $$

This linear recurrence can equivalently be “unrolled” into a global convolution:

$$ \mathbf{y} = \mathbf{u} * \overline{\mathbf{K}} $$

where $\overline{\mathbf{K}}$ is a convolution filter parameterized by $\overline{\mathbf{A}}$, $\overline{\mathbf{B}}$, and $\overline{\mathbf{C}}$. This duality is the core innovation for CLMs:

Training: S4 uses the convolutional formulation to learn from entire SMILES sequences simultaneously, capturing global molecular properties.
Generation: S4 switches to the recurrent formulation, producing SMILES tokens one at a time for efficient, robust chemical space exploration.

S4 addresses the numerical instabilities of naive state space models through high-order polynomial projection operators (HiPPO) and reduction to the stable Cauchy kernel computation, enabling effective learning of long-range dependencies.

For molecular ranking after fine-tuning, the log-likelihood score subtracts the pre-training likelihood to isolate target-specific information:

$$ \mathcal{L}_{\text{score}}(\mathbf{M}) = \mathcal{L}(\mathbf{M}_{\text{ft}}) - \mathcal{L}(\mathbf{M}_{\text{pt}}) $$

where $\mathcal{L}(\mathbf{M}_{\text{ft}})$ and $\mathcal{L}(\mathbf{M}_{\text{pt}})$ are the fine-tuned and pre-trained model log-likelihoods, respectively.

Benchmarking S4 Across Drug Discovery Tasks

Drug-like molecule generation

All three CLMs (S4, LSTM, GPT) were pre-trained on 1.9M canonical SMILES from ChEMBL v31 (molecules with fewer than 100 tokens). Each model generated 102,400 SMILES strings de novo.

Model	Valid	Unique	Novel
S4	99,268 (97%)	98,712 (96%)	95,552 (93%)
LSTM	97,151 (95%)	96,618 (94%)	82,988 (81%)
GPT	93,580 (91%)	93,263 (91%)	91,590 (89%)

S4 produces the most valid, unique, and novel molecules. Error analysis reveals that each architecture shows different failure modes: LSTMs struggle most with branching errors, GPTs with ring and bond assignment errors, while S4 generates fewer branching and ring errors but more bond assignment errors than LSTM. This pattern supports the hypothesis that S4 captures long-range dependencies (branching, ring opening/closure) better while local dependencies (bond assignment) are handled better by recurrent processing.

Bioactivity learning via transfer learning

Five fine-tuning campaigns were conducted on targets from the LIT-PCBA dataset: PKM2, MAPK1, GBA, mTORC1, and TP53. After fine-tuning, models ranked held-out test molecules by learned log-likelihoods to evaluate bioactive compound prioritization.

S4 outperformed both benchmarks across targets. Wilcoxon signed-rank tests on pooled scores confirmed statistically significant superiority:

S4 vs. LSTM: $p$ [top 10] = 8.41e-6, $p$ [top 50] = 2.93e-7, $p$ [top 100] = 1.45e-7
S4 vs. GPT: $p$ [top 10] = 2.33e-3, $p$ [top 50] = 3.72e-3, $p$ [top 100] = 2.61e-2

TP53 was the most challenging target, where no model consistently retrieved actives in the top 10, possibly due to activity cliffs in the test set.

Chemical space exploration with temperature sampling

Models were evaluated across sampling temperatures from $T = 1.0$ to $T = 2.0$ on three metrics: SMILES validity, rediscovery rate of known actives, and scaffold diversity. Key findings:

Validity: S4 and LSTM maintain higher validity than GPT at elevated temperatures (GPT median validity drops below 40% at high T).
Rediscovery: S4 outperforms LSTM in rediscovering bioactive molecules at all temperatures.
Scaffold diversity: LSTM achieves the highest number of unique scaffold clusters (median 6,602 at $T = 1.75$), with S4 as close second (6,520 clusters).

S4 provides the best balance between bioactivity capture and structural diversity.

Natural product design

Models were trained on 32,360 large natural product SMILES (length > 100 tokens) from the COCONUT database and used to generate 102,400 designs each.

Metric	S4	LSTM	GPT	Training Set
Valid	82,633 (81%)	76,264 (74%)	70,117 (68%)	n.a.
Unique	53,293 (52%)	51,326 (50%)	50,487 (49%)	n.a.
Novel	40,897 (40%)	43,245 (42%)	43,168 (42%)	n.a.
NP-likeness	1.6 +/- 0.7	1.5 +/- 0.7	1.5 +/- 0.7	1.6 +/- 0.7

S4 designs the most valid molecules (6,000 to 12,000 more than benchmarks) and achieves significantly higher NP-likeness ($p = 1.41 \times 10^{-53}$ vs. LSTM, $p = 1.02 \times 10^{-82}$ vs. GPT). S4 also achieves the lowest Kolmogorov-Smirnov distances to the training/test distributions across multiple structural properties (sp3 carbons, aliphatic rings, spiro atoms, molecular weight, fused ring size, heavy atoms).

For computational efficiency, S4 trains as fast as GPT (both approximately 1.3x faster than LSTM) and generates fastest among all architectures.

Prospective MAPK1 inhibitor design

The pre-trained S4 model was fine-tuned on 68 manually curated MAPK1 inhibitors ($K_i < 1 \mu M$) from ChEMBL v33. The last five fine-tuning epochs generated 256K molecules across five temperature values. After ranking and filtering by log-likelihood score and scaffold similarity, the top 10 designs were evaluated via Umbrella Sampling molecular dynamics simulations.

Eight out of ten designs showed high predicted affinity, with $\Delta G$ values ranging from $-10.3 \pm 0.6$ to $-23 \pm 4$ kcal/mol. These affinities are comparable to or exceed those of the closest known active neighbors ($\Delta G = -9.1 \pm 0.8$ to $-13 \pm 2$ kcal/mol). The most potent predicted design (molecule 2, $\Delta G = -23 \pm 4$ kcal/mol) engages extensively with the MAPK1 binding pocket, though synthetic accessibility may be limited. Several designs incorporate halogen substitutions favorable for MAPK1 inhibition, consistent with known structure-activity relationships.

S4 Combines the Best of LSTMs and GPTs for Molecular Design

The main findings of this study are:

S4 outperforms both LSTM and GPT in learning complex molecular properties like bioactivity, while maintaining competitive or superior performance in syntax learning and chemical space exploration.
The dual formulation is key: holistic training (convolution) enables better capture of global molecular properties, while recurrent generation preserves robust chemical syntax and diverse scaffold exploration.
S4 is especially strong for longer sequences: natural product design (SMILES > 100 tokens) shows the largest advantages over benchmarks in validity and property matching.
Prospective validation: 8/10 S4-designed MAPK1 inhibitors are predicted as highly active by molecular dynamics, with affinities comparable to or exceeding known actives.

Limitations acknowledged by the authors:

All evaluations are computational; no wet-lab experimental validation is reported.
Bioactivity evaluation relies on likelihood-based ranking, which is an indirect proxy.
The MD simulations, while more rigorous than simple docking, still represent in silico predictions.
SMILES augmentation and improved ranking protocols could further boost performance.

Future directions include application to macrocyclic peptides and protein sequences, organic reaction planning, structure-based drug design, and integration with wet-lab experimental validation.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pre-training	ChEMBL v31	1.9M SMILES	Molecules with SMILES length <= 100 tokens
Fine-tuning (bioactivity)	LIT-PCBA (5 targets)	11-56 actives + ~10K inactives per target	PKM2, MAPK1, GBA, mTORC1, TP53
Natural product training	COCONUT	32,360 SMILES	SMILES length > 100 tokens
Prospective fine-tuning	ChEMBL v33 (MAPK1)	68 inhibitors	$K_i < 1 \mu M$, target ID CHEMBL4040

Algorithms

Pre-training: next-token prediction on SMILES strings
Fine-tuning: transfer learning with early stopping (patience 5, tolerance $10^{-5}$)
Molecule ranking: log-likelihood scoring with pre-training bias subtraction (Eq. 5)
Temperature sampling: $T$ from 1.0 to 2.0 (step 0.25) for chemical space exploration

Models

S4: Structured state space sequence model with HiPPO initialization; hyperparameter search over 242 + 108 configurations
LSTM: 40 configurations optimized via random search
GPT: 35 configurations optimized via random search
All models share the same pre-training data and fine-tuning protocol for fair comparison

Evaluation

Metric	Best Model	Value	Notes
Validity (ChEMBL)	S4	97%	Out of 102,400 generated SMILES
Uniqueness (ChEMBL)	S4	96%	Among valid designs
Novelty (ChEMBL)	S4	93%	Not in training set
Bioactivity ranking (top 10)	S4	Significant (p = 8.41e-6 vs LSTM)	Wilcoxon signed-rank test
NP validity	S4	81%	COCONUT, SMILES > 100 tokens
MAPK1 inhibitor success	S4	8/10 designs active	Validated by MD (Umbrella Sampling)

Hardware

Hyperparameter search: NVIDIA A100 40GB GPUs
LSTM/GPT search: 5 days on single A100
S4 search: 10 days on multiple A100 GPUs
MD simulations: Dutch supercomputer Snellius; 1.2-1.6 microseconds per ligand (Umbrella Sampling)

Artifacts

Artifact	Type	License	Notes
S4 for de novo drug design	Code	MIT	Official PyTorch implementation with data and trained models
Zenodo archive	Dataset	CC-BY-4.0	Source data and molecule designs

Paper Information

Citation: Ozcelik, R., de Ruiter, S., Criscuolo, E., & Grisoni, F. (2024). Chemical language modeling with structured state space sequence models. Nature Communications, 15, 6176.

@article{ozcelik2024chemical,
  title={Chemical language modeling with structured state space sequence models},
  author={\"O{}z\c{c}elik, R{\i}za and de Ruiter, Sarah and Criscuolo, Emanuele and Grisoni, Francesca},
  journal={Nature Communications},
  volume={15},
  number={1},
  pages={6176},
  year={2024},
  publisher={Nature Publishing Group},
  doi={10.1038/s41467-024-50469-9}
}

Review: Deep Learning for Molecular Design (2019)

Thu, 26 Mar 2026 00:00:00 +0000

A Systematization of Deep Generative Models for Molecular Design

This is a Systematization paper that organizes and compares the rapidly growing literature on deep generative modeling for molecules. Published in 2019, it catalogs 45 papers from the preceding two years, classifying them by architecture (RNNs, VAEs, GANs, reinforcement learning) and molecular representation (SMILES strings, context-free grammars, graph tensors, 3D voxels). The review provides mathematical foundations for each technique, identifies cross-cutting themes, and proposes a framework for reward function design that addresses diversity, novelty, stability, and synthesizability.

The Challenge of Navigating Vast Chemical Space

The space of potential drug-like molecules has been estimated to contain between $10^{23}$ and $10^{60}$ compounds, while only about $10^{8}$ have ever been synthesized. Traditional approaches to molecular design rely on combinatorial methods, mixing known scaffolds and functional groups, but these generate many unstable or unsynthesizable candidates. High-throughput screening (HTS) and virtual screening (HTVS) help but remain computationally expensive. The average cost to bring a new drug to market exceeds one billion USD, with a 13-year average timeline from discovery to market.

By 2016, deep generative models had shown strong results in producing original images, music, and text. The “molecular autoencoder” of Gomez-Bombarelli et al. (2016/2018) first applied these techniques to molecular generation, triggering an explosion of follow-up work. By the time of this review, the landscape had grown complex enough, with many architectures, representation schemes, and no agreed-upon benchmarking standards, to warrant systematic organization.

Molecular Representations and Architecture Taxonomy

The review’s core organizational contribution is a two-axis taxonomy: molecular representations on one axis and deep learning architectures on the other.

Molecular Representations

The review categorizes representations into 3D and 2D graph-based schemes:

3D representations include raw voxels (placing nuclear charges on a grid), smoothed voxels (Gaussian blurring around nuclei), and tensor field networks. These capture full geometric information but suffer from high dimensionality, sparsity, and difficulty encoding rotation/translation invariance.

2D graph representations include:

SMILES strings: The dominant representation, encoding molecular graphs as ASCII character sequences via depth-first traversal. Non-unique (each molecule with $N$ heavy atoms has at least $N$ SMILES representations), but invertible and widely supported.
Canonical SMILES: Unique but potentially encode grammar rules rather than chemical structure.
Context-free grammars (CFGs): Decompose SMILES into grammar rules to improve validity rates, though not to 100%.
Tensor representations: Store atom types in a vertex feature matrix $X \in \mathbb{R}^{N \times |\mathcal{A}|}$ and bond types in an adjacency tensor $A \in \mathbb{R}^{N \times N \times Y}$.
Graph operations: Directly build molecular graphs by adding atoms and bonds, guaranteeing 100% chemical validity.

Deep Learning Architectures

Recurrent Neural Networks (RNNs) generate SMILES strings character by character, typically using LSTM or GRU units. Training uses maximum likelihood estimation (MLE) with teacher forcing:

$$ L^{\text{MLE}} = -\sum_{s \in \mathcal{X}} \sum_{t=2}^{T} \log \pi_{\theta}(s_{t} \mid S_{1:t-1}) $$

Thermal rescaling of the output distribution controls the diversity-validity tradeoff via a temperature parameter $T$. RNNs achieved SMILES validity rates of 94-98%.

Variational Autoencoders (VAEs) learn a continuous latent space by maximizing the evidence lower bound (ELBO):

$$ \mathcal{L}_{\theta,\phi}(x) = \mathbb{E}_{z \sim q_{\phi}(z|x)}[\log p_{\theta}(x|z)] - D_{\text{KL}}[q_{\phi}(z|x), p(z)] $$

The first term encourages accurate reconstruction while the KL divergence term regularizes the latent distribution toward a standard Gaussian prior $p(z) = \mathcal{N}(z, 0, I)$. Variants include grammar VAEs (GVAEs), syntax-directed VAEs, junction tree VAEs, and adversarial autoencoders (AAEs) that replace the KL term with adversarial training.

Generative Adversarial Networks (GANs) train a generator against a discriminator using the minimax objective:

$$ \min_{G} \max_{D} V(D, G) = \mathbb{E}_{x \sim p_{d}(x)}[\log D(x)] + \mathbb{E}_{z \sim p_{z}(z)}[\log(1 - D(G(z)))] $$

The review shows that with an optimal discriminator, the generator objective reduces to minimizing the Jensen-Shannon divergence, which captures both forward and reverse KL divergence terms. This provides a more “balanced” training signal than MLE alone. The Wasserstein GAN (WGAN) uses the Earth mover’s distance for more stable training:

$$ W(p, q) = \inf_{\gamma \in \Pi(p,q)} \mathbb{E}_{(x,y) \sim \gamma} |x - y| $$

Reinforcement Learning recasts molecular generation as a sequential decision problem. The policy gradient (REINFORCE) update is:

$$ \nabla J(\theta) = \mathbb{E}\left[G_{t} \frac{\nabla_{\theta} \pi_{\theta}(a_{t} \mid y_{1:t-1})}{\pi_{\theta}(a_{t} \mid y_{1:t-1})}\right] $$

To prevent RL fine-tuning from causing the generator to “drift” away from viable chemical structures, an augmented reward function incorporates the prior likelihood:

$$ R’(S) = [\sigma R(S) + \log P_{\text{prior}}(S) - \log P_{\text{current}}(S)]^{2} $$

Cataloging 45 Models and Their Design Choices

Rather than running new experiments, the review’s methodology involves systematically cataloging and comparing 45 published models. Table 2 in the paper lists each model’s architecture, representation, training dataset, and dataset size. Key patterns include:

RNN-based models (16 entries): Almost exclusively use SMILES, trained on ZINC or ChEMBL datasets with 0.1M-1.7M molecules.
VAE variants (20 entries): The most diverse category, spanning SMILES VAEs, grammar VAEs, junction tree VAEs, graph-based VAEs, and 3D VAEs. Training sets range from 10K to 72M molecules.
GAN models (7 entries): Include ORGAN, RANC, ATNC, MolGAN, and CycleGAN approaches. Notably, GANs appear to work with fewer training samples.
Other approaches (2 entries): Pure RL methods from Zhou et al. and Stahl et al. that do not require pretraining on a dataset.

The review also catalogs 13 publicly available datasets (Table 3), ranging from QM9 (133K molecules with quantum chemical properties) to GDB-13 (977M combinatorially generated molecules) and ZINC15 (750M+ commercially available compounds).

Metrics and Reward Function Design

A significant contribution is the systematic treatment of reward functions. The review argues that generated molecules should satisfy six desiderata: diversity, novelty, stability, synthesizability, non-triviality, and good properties. Key metrics formalized include:

Diversity using Tanimoto similarity over fingerprints:

$$ r_{\text{diversity}} = 1 - \frac{1}{|\mathcal{G}|} \sum_{(x_{1}, x_{2}) \in \mathcal{G} \times \mathcal{G}} D(x_{1}, x_{2}) $$

Novelty measured as the fraction of generated molecules not appearing in a hold-out test set:

$$ r_{\text{novel}} = 1 - \frac{|\mathcal{G} \cap \mathcal{T}|}{|\mathcal{T}|} $$

Synthesizability primarily assessed via the SA score, sometimes augmented with ring penalties and medicinal chemistry filters.

The review also discusses the Fréchet ChemNet Distance as an analog of FID for molecular generation, and notes the emergence of standardized benchmarking platforms including MOSES, GuacaMol, and DiversityNet.

Key Findings and Future Directions

The review identifies several major trends and conclusions:

Shift from SMILES to graph-based representations. SMILES-based methods struggle with validity (the molecular autoencoder VAE achieved only 0.7-75% valid SMILES depending on sampling strategy). Methods that work directly on molecular graphs with chemistry-preserving operations achieve 100% validity, and the review predicts this trend will continue.

Advantages of adversarial and RL training over MLE. The mathematical analysis shows that MLE only optimizes forward KL divergence, which can lead to models that place probability mass where the data distribution is zero. GAN training optimizes the Jensen-Shannon divergence, which balances forward and reverse KL terms. RL approaches, particularly pure RL without pretraining, showed competitive performance with much less training data.

Genetic algorithms remain competitive. The review notes that the latest genetic algorithm approaches (Grammatical Evolution) could match deep learning methods for molecular optimization under some metrics, and at 100x lower computational cost in some comparisons. This serves as an important baseline calibration.

Reward function design is underappreciated. Early models generated unstable molecules with labile groups (enamines, hemiaminals, enol ethers). Better reward functions that incorporate synthesizability, diversity, and stability constraints significantly improved practical utility.

Need for standardized benchmarks. The review identifies a lack of agreement on evaluation methodology as a major barrier to progress, noting that published comparisons are often subtly biased toward novel methods.

Limitations

As a review paper from early 2019, the work predates several important developments: transformer-based architectures (which would soon dominate), SELFIES representations, diffusion models for molecules, and large-scale pretrained chemical language models. The review focuses primarily on drug-like small molecules and does not deeply cover protein design or materials optimization.

Reproducibility Details

Data

This is a review paper that does not present new experimental results. The paper catalogs 13 publicly available datasets used across the reviewed works:

Purpose	Dataset	Size	Notes
Training/Eval	GDB-13	977M	Combinatorially generated library
Training/Eval	ZINC15	750M+	Commercially available compounds
Training/Eval	GDB-17	50M	Combinatorially generated library
Training/Eval	ChEMBL	2M	Curated bioactive molecules
Training/Eval	QM9	133,885	Small organic molecules with DFT properties
Training/Eval	PubChemQC	3.98M	PubChem compounds with DFT data

Algorithms

The review provides mathematical derivations for MLE training (Eq. 1), VAE ELBO (Eqs. 9-13), AAE objectives (Eqs. 15-16), GAN objectives (Eqs. 19-22), WGAN (Eq. 24), REINFORCE gradient (Eq. 7), and numerous reward function formulations (Eqs. 26-36).

Evaluation

Key evaluation frameworks discussed:

Fréchet ChemNet Distance (molecular analog of FID)
MOSES benchmarking platform
GuacaMol benchmarking suite
Validity rate, uniqueness, novelty, and internal diversity metrics

Paper Information

Citation: Elton, D. C., Boukouvalas, Z., Fuge, M. D., & Chung, P. W. (2019). Deep Learning for Molecular Design: A Review of the State of the Art. Molecular Systems Design & Engineering, 4(4), 828-849. https://doi.org/10.1039/C9ME00039A

@article{elton2019deep,
  title={Deep Learning for Molecular Design -- A Review of the State of the Art},
  author={Elton, Daniel C. and Boukouvalas, Zois and Fuge, Mark D. and Chung, Peter W.},
  journal={Molecular Systems Design \& Engineering},
  volume={4},
  number={4},
  pages={828--849},
  year={2019},
  publisher={Royal Society of Chemistry},
  doi={10.1039/C9ME00039A}
}

Re-evaluating Sample Efficiency in Molecule Generation

Thu, 26 Mar 2026 00:00:00 +0000

An Empirical Re-evaluation of Generative Model Benchmarks

This is an Empirical paper. The primary contribution is a critical reassessment of the Practical Molecular Optimization (PMO) benchmark for de novo molecule generation. Rather than proposing a new generative model, the authors modify existing benchmark metrics to account for chemical desirability (molecular weight, LogP, topological novelty) and molecular diversity. They then re-evaluate all 25 generative models from the original PMO benchmark plus the recently proposed Augmented Hill-Climb (AHC) method.

Sample Efficiency and Chemical Quality in Drug Design

Deep generative models for de novo molecule generation often require large numbers of oracle evaluations (up to $10^5$ samples) to optimize toward a target objective. This is a practical limitation when using computationally expensive scoring functions like molecular docking. The PMO benchmark by Gao et al. addressed this by reformulating performance as maximizing an objective within a fixed budget of 10,000 oracle calls, finding REINVENT to be the most sample-efficient model across 23 tasks.

However, the authors identify a key limitation: the PMO benchmark measures only sample efficiency without considering the chemical quality of proposed molecules. Investigating the top-performing REINVENT model on the JNK3 task, they find that 4 of 5 replicate runs produce molecules with molecular weight and LogP distributions far outside the training data (ZINC250k). The resulting molecules contain large structures with repeating substructures that are undesirable from a medicinal chemistry perspective. This disconnect between benchmark performance and practical utility motivates the modified evaluation metrics.

Modified Metrics: Property Filters and Diversity Requirements

The core innovation is the introduction of three modified AUC Top-10 metrics that extend the original PMO benchmark evaluation:

AUC Top-10 (Filtered): Molecules are excluded if their molecular weight or LogP falls beyond 4 standard deviations from the mean of the ZINC250k pre-training dataset ($\mu \pm 4\sigma$, covering approximately 99.99% of a normal distribution). Molecules with more than 10% de novo (unobserved in ZINC250k) ECFP4 fingerprint bits are also filtered out. This ensures the generative model does not drift beyond its applicability domain.

AUC Top-10 (Diverse): The top 10 molecules are selected iteratively, where a molecule is only added if its Tanimoto similarity (by ECFP4 fingerprints) to any previously selected compound does not exceed 0.35. This threshold corresponds to an approximately 80% probability that more-similar molecules belong to the same bioactivity class, enforcing that distinct candidates possess different profiles.

AUC Top-10 (Combined): Applies both property filters and diversity filters simultaneously, providing the most stringent evaluation of practical performance.

Benchmark Setup and Generative Models Evaluated

Implementation Details

The authors re-implement the PMO benchmark using the original code and data (MIT license) with no changes beyond adding AHC and the new metrics. For Augmented Hill-Climb, the architecture follows REINVENT: an embedding layer of size 128 and 3 layers of Gated Recurrent Units (GRU) with size 512. The prior is trained on ZINC250k using SMILES notation with batch size 128 for 5 epochs.

Two AHC variants are benchmarked:

SMILES-AHC: Hyperparameters optimized via the standard PMO procedure, yielding batch size 256, $\sigma = 120$, $K = 0.25$
SMILES-AHC*: Uses $\sigma = 60$, chosen based on prior knowledge that lower $\sigma$ values maintain better regularization and chemical quality

Both omit diversity filters and non-unique penalization for standardized comparison, despite these being shown to improve performance in prior work.

Models Compared

The benchmark includes 25 generative models from the original PMO paper spanning diverse architectures: REINVENT (RNN + RL), Graph GA (graph-based genetic algorithm), GP BO (Gaussian process Bayesian optimization), SMILES GA (SMILES-based genetic algorithm), SELFIES-based VAEs, and others. The 23 objective tasks derive primarily from the GuacaMol benchmark.

Re-ranked Results and Augmented Hill-Climb Performance

The modified metrics substantially re-order the ranking of generative models:

SMILES-AHC achieves top performance on AUC Top-10 (Combined)*, where both property filters and diversity are enforced. The use of domain-informed hyperparameter selection ($\sigma = 60$) proves critical.
SMILES-AHC (data-driven hyperparameters) ranks first when accounting for property filters alone, diversity alone, or both combined, demonstrating that the AHC algorithm itself provides strong performance even without manual tuning.
REINVENT retains its first-place rank under property filters alone, suggesting that the minority of compounds staying within acceptable property space still perform well. However, it drops when diversity is also required.
Evolutionary algorithms (Graph GA, GP BO, SMILES GA) drop significantly under the new metrics. This is expected because rule-based methods are not constrained by the ZINC250k distribution and tend to propose molecules that diverge from drug-like chemical space.
Both AHC variants excel on empirically difficult tasks, including isomer-based tasks, Zaleplon MPO, and Sitagliptin MPO, where other methods struggle.

Limitations

The authors acknowledge several limitations:

Results are preliminary because generative models have not undergone hyperparameter optimization against the new metrics
Property filter thresholds are subjective, and the 10% de novo ECFP4 bit threshold was chosen by visual inspection
Comparing rule-based models against distribution-based models using ZINC250k similarity introduces a bias toward distribution-based approaches
Six objective task reference molecules sit in the lowest 0.01% of ZINC250k property space, raising questions about whether distribution-based models can reasonably optimize for these objectives
Property filters and diversity could alternatively be incorporated directly into the objective function as additional oracles, though this would not necessarily produce the same results

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pre-training	ZINC250k	~250K molecules	Subset of ZINC15, provided by PMO benchmark
Evaluation	PMO benchmark tasks	23 objectives	Derived primarily from GuacaMol

Algorithms

Augmented Hill-Climb: RL strategy from Thomas et al. (2022), patience of 5
Hyperparameters (SMILES-AHC): batch size 256, $\sigma = 120$, $K = 0.25$
Hyperparameters (SMILES-AHC)*: $\sigma = 60$ (domain-informed selection)
Prior training: 5 epochs, batch size 128, SMILES notation
Oracle budget: 10,000 evaluations per task
Replicates: 5 per model per task

Models

Architecture: Embedding (128) + 3x GRU (512), following REINVENT
All 25 PMO benchmark models re-evaluated using original implementations

Evaluation

Metric	Description	Notes
AUC Top-10 (Original)	Area under curve of average top 10 molecules	Standard PMO metric
AUC Top-10 (Filtered)	Original with MW/LogP and ECFP4 novelty filters	$\mu \pm 4\sigma$ from ZINC250k
AUC Top-10 (Diverse)	Top 10 selected with Tanimoto < 0.35 diversity	ECFP4 fingerprints
AUC Top-10 (Combined)	Both filters and diversity applied	Most stringent metric

Hardware

Hardware requirements are not specified in the paper. The benchmark uses 10,000 oracle evaluations per task with 5 replicates, which is computationally modest compared to standard generative model training.

Artifacts

Artifact	Type	License	Notes
MolScore	Code	MIT	Scoring and benchmarking framework by the first author
PMO Benchmark	Code	MIT	Original benchmark code and data

Paper Information

Citation: Thomas, M., O’Boyle, N. M., Bender, A., & de Graaf, C. (2022). Re-evaluating sample efficiency in de novo molecule generation. arXiv preprint arXiv:2212.01385.

@misc{thomas2022reevaluating,
  title={Re-evaluating sample efficiency in de novo molecule generation},
  author={Thomas, Morgan and O'Boyle, Noel M. and Bender, Andreas and de Graaf, Chris},
  year={2022},
  eprint={2212.01385},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  doi={10.48550/arxiv.2212.01385}
}

Protein-to-Drug Molecule Translation via Transformer

Thu, 26 Mar 2026 00:00:00 +0000

Protein-Targeted Drug Generation as Machine Translation

This is a Method paper that proposes using the Transformer neural network architecture for protein-specific de novo drug generation. The primary contribution is framing the problem of generating molecules that bind to a target protein as a machine translation task: translating from the “language” of amino acid sequences to the SMILES representation of candidate drug molecules. The model takes only a protein’s amino acid sequence as input and generates novel molecules with predicted binding affinity, requiring no prior knowledge of active ligands, physicochemical descriptors, or the protein’s three-dimensional structure.

Limitations of Existing Generative Drug Design Approaches

Existing deep learning methods for de novo molecule generation suffer from several limitations. Most RNN-based approaches require a library of known active compounds against the target protein to fine-tune the generator or train a reward predictor for reinforcement learning. Structure-based drug design methods require the three-dimensional structure of the target protein, which can be costly and technically difficult to obtain through protein expression, purification, and crystallization. Autoencoder-based approaches (variational and adversarial) similarly depend on prior knowledge of protein binders or their physicochemical characteristics.

The estimated drug-like molecule space is on the order of $10^{60}$, while only around $10^{8}$ compounds have been synthesized. High-throughput screening is expensive and time-consuming, and virtual screening operates only on known molecules. Computational de novo design methods often generate molecules that are hard to synthesize or restrict accessible chemical space through coded rules. A method that requires only a protein’s amino acid sequence would substantially simplify the initial stages of drug discovery, particularly for targets with limited or no information about inhibitors and 3D structure.

Sequence-to-Sequence Translation with Self-Attention

The core insight is to treat protein-targeted drug generation as a translation problem between two “languages,” applying the Transformer architecture that had demonstrated strong results in neural machine translation. The encoder maps a protein amino acid sequence $(a_1, \ldots, a_n)$ to continuous representations $\mathbf{z} = (z_1, \ldots, z_n)$, and the decoder autoregressively generates a SMILES string conditioned on $\mathbf{z}$.

The self-attention mechanism computes:

$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$

where $d_k$ is a scaling factor. Multihead attention runs $h$ parallel attention heads:

$$ \text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V) $$

$$ \text{Multihead}(Q, K, V) = (\text{head}_1, \ldots, \text{head}_h)W^O $$

Positional encoding uses sinusoidal functions:

$$ PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i / d_{model}}}\right) $$

$$ PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i / d_{model}}}\right) $$

The self-attention mechanism is particularly well-suited for this task for two reasons. First, protein sequences can be much longer than SMILES strings (dozens of times longer), making the ability to capture long-range dependencies essential. Second, three-dimensional structural features of the binding pocket may be formed by amino acid residues far apart in the linear sequence, and multihead attention can jointly attend to different positional aspects simultaneously.

Data, Model Architecture, and Docking Evaluation

Data

The training data was retrieved from BindingDB, filtering for interactions between proteins from Homo sapiens, Rattus norvegicus, Mus musculus, and Bos taurus with binding affinity below 100 nM (IC50, Kd, or EC50). After filtering for valid PubChem CIDs, SMILES representations, UniProt IDs, molecular weight under 1000 Da, and amino acid sequence lengths between 80 and 2050, the final dataset contained 238,147 records with 1,613 unique proteins and 154,924 unique ligand SMILES strings.

Five Monte Carlo cross-validation splits were created, with the constraint that test set proteins share less than 20% sequence similarity with training set proteins (measured via Needleman-Wunsch global alignment).

Model Configuration

The model uses the original Transformer implementation via the tensor2tensor library with:

4 encoder/decoder layers of size 128
4 attention heads
Adam optimizer with learning rate decay from the original Transformer paper
Batch size of 4,096 tokens
Training for 600K epochs on a single GPU in Google Colaboratory
Vocabulary of 71 symbols (character-level tokenization)

Beam search decoding was used with two modes: beam size 4 keeping only the top-1 result (“one per one” mode) and beam size 10 keeping all 10 results (“ten per one” mode).

Chemical Validity and Uniqueness

Metric	One per One (avg)	Ten per One (avg)
Valid SMILES (%)	90.2	82.6
Unique SMILES (%)	92.3	81.7
ZINC15 match (%)	30.6	17.1

Docking Evaluation

To assess binding affinity, the authors selected two receptor tyrosine kinases from the test set (IGF-1R and VEGFR2) and performed molecular docking with SMINA. Four sets of ligands were compared: known binders, randomly selected compounds, molecules generated for the target protein, and molecules generated for other targets (cross-docking control).

ROC-AUC analysis showed that the docking tool classified generated molecules for the correct target as binders at rates comparable to known binders. For the best-discriminating structures (PDB 3O23 for IGF-1R, PDB 3BE2 for VEGFR2), Mann-Whitney U tests confirmed statistically significant differences between generated-for-target molecules and random compounds, while the difference between generated-for-target and known binders was not significant (p = 0.40 and 0.26 respectively), suggesting the model generates plausible binders.

Drug-Likeness Properties

Generated molecules were evaluated against Lipinski’s Rule of Five and other drug-likeness criteria:

Property	Constraint	One per One (%)	Ten per One (%)
logP	< 5	84.4	85.6
Molecular weight	< 500 Da	95.8	88.9
H-bond donors	< 5	95.8	91.9
H-bond acceptors	< 10	97.9	93.5
Rotatable bonds	< 10	97.9	91.2
TPSA	< 140	98.0	92.7
SAS	< 6	99.9	100.0

Mean QED values were 0.66 +/- 0.19 (one per one) and 0.58 +/- 0.21 (ten per one).

Structural Novelty

Tanimoto similarity analysis showed that only 8% of generated structures had similarity above the threshold (> 0.85) to training compounds. The majority (51%) had Tanimoto scores below 0.5. The mean nearest-neighbor Tanimoto similarity of generated molecules to the training set (0.54 +/- 0.17 in one-per-one mode) was substantially lower than the mean within-training-set similarity (0.74 +/- 0.14), indicating the model generates structurally diverse molecules outside the training distribution.

Generated Molecules Show Drug-Like Properties and Predicted Binding

The model generates roughly 90% chemically valid SMILES in one-per-one mode, with 92% uniqueness. Docking simulations on IGF-1R and VEGFR2 suggest that generated molecules for the correct target are statistically indistinguishable from known binders, while molecules generated for other targets behave more like random compounds. Drug-likeness properties fall within acceptable ranges for the vast majority of generated compounds.

The authors acknowledge several limitations:

Only two protein targets were analyzed via docking due to computational constraints, and the analysis was limited to proteins with a single well-known druggable binding pocket.
Beam search produces molecules that differ only slightly; diverse beam search or coupling with variational/adversarial autoencoders could improve diversity.
The fraction of molecules matching the ZINC15 database (30.6% in one-per-one mode) could potentially be reduced by pretraining on a larger compound set (e.g., ChEMBL’s 1.5 million molecules).
Model interpretability remains limited and is identified as important future work.
The approach is a proof of concept and requires further validation via in vitro assays across diverse protein targets.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Training/Test	BindingDB (filtered)	238,147 records	1,613 unique proteins, 154,924 unique SMILES; IC50/Kd/EC50 < 100 nM
Docking validation	PDB structures	11 (IGF-1R), 20 (VEGFR2)	SMINA docking with default settings
Database matching	ZINC15	N/A	Used for novelty assessment

Algorithms

Transformer (encoder-decoder) via tensor2tensor library
Beam search decoding (beam sizes 4 and 10)
Needleman-Wunsch global alignment for protein sequence similarity (EMBOSS)
SMINA for molecular docking
RDKit for validity checking, property calculation, and canonicalization

Models

4 layers, 128 hidden size, 4 attention heads
Character-level tokenization with 71-symbol vocabulary
5-fold Monte Carlo cross-validation with < 20% sequence similarity between train/test proteins

Evaluation

Metric	Value	Notes
Valid SMILES	90.2% (1-per-1), 82.6% (10-per-1)	Averaged across 5 splits
Unique SMILES	92.3% (1-per-1), 81.7% (10-per-1)	Averaged across 5 splits
ZINC15 match	30.6% (1-per-1), 17.1% (10-per-1)	Averaged across 5 splits
QED	0.66 +/- 0.19 (1-per-1), 0.58 +/- 0.21 (10-per-1)	Drug-likeness score
SAS compliance	99.9% (1-per-1), 100% (10-per-1)	SAS < 6

Hardware

Google Colaboratory with one GPU
Training for 600K epochs

Artifacts

Artifact	Type	License	Notes
molecule_structure_generation	Code	Not specified	Jupyter Notebook implementation using tensor2tensor

Paper Information

Citation: Grechishnikova, D. (2021). Transformer neural network for protein-specific de novo drug generation as a machine translation problem. Scientific Reports, 11, 321. https://doi.org/10.1038/s41598-020-79682-4

@article{grechishnikova2021transformer,
  title={Transformer neural network for protein-specific de novo drug generation as a machine translation problem},
  author={Grechishnikova, Daria},
  journal={Scientific Reports},
  volume={11},
  number={1},
  pages={321},
  year={2021},
  publisher={Nature Publishing Group},
  doi={10.1038/s41598-020-79682-4}
}

PrefixMol: Prefix Embeddings for Drug Molecule Design

Thu, 26 Mar 2026 00:00:00 +0000

Unified Multi-Conditional Molecular Generation

PrefixMol is a Method paper that introduces a unified generative model for structure-based drug design that simultaneously conditions on protein binding pockets and multiple chemical properties. The primary contribution is a prefix-embedding mechanism, borrowed from NLP multi-task learning, that represents each condition (pocket geometry, Vina score, QED, SA, LogP, Lipinski) as a learnable feature vector prepended to the input sequence of a GPT-based SMILES generator. This allows a single model to handle customized multi-conditional generation without the negative transfer that typically arises from merging separate task-specific models.

Bridging Target-Aware and Chemistry-Aware Molecular Design

Prior structure-based drug design methods (e.g., Pocket2Mol, GraphBP) generate molecules conditioned on protein binding pockets but impose no constraints on the chemical properties of the output. Conversely, controllable molecule generation methods (e.g., REINVENT, RetMol, CMG) can steer chemical properties but ignore protein-ligand interactions. Merging these two objectives into a single model is difficult for two reasons:

Data scarcity: Few datasets contain both protein-ligand binding affinity data and comprehensive molecular property annotations.
Negative transfer: Treating each condition as a separate task in a multi-task framework can hurt overall performance when tasks conflict.

PrefixMol addresses both problems by extending the CrossDocked dataset with molecular property labels and using a parameter-efficient prefix conditioning strategy that decouples task-specific knowledge from the shared generative backbone.

Prefix Conditioning in Attention Layers

The core innovation adapts prefix-tuning from NLP to molecular generation. Given a GPT transformer that generates SMILES token-by-token, PrefixMol prepends $n_c$ learnable condition vectors $\mathbf{p}_{\phi} \in \mathbb{R}^{n_c \times d}$ to the left of the sequence embedding $\mathbf{x} \in \mathbb{R}^{l \times d}$, forming an extended input $\mathbf{x}’ = [\text{PREFIX}; \mathbf{x}]$.

The output of each position is:

$$ h_i = \begin{cases} p_{\phi,i}, & \text{if } i < n_c \\ \text{LM}_\theta(x_i’, h_{

Because the prefix features always sit to the left, the causal attention mask ensures they influence all subsequent token predictions. The key insight is that the attention mechanism decomposes into a weighted sum of self-attention and prefix attention:

$$ \begin{aligned} \text{head} &= (1 - \lambda(\mathbf{x})) \underbrace{\text{Attn}(\mathbf{x}\mathbf{W}_q, \mathbf{c}\mathbf{W}_k, \mathbf{c}\mathbf{W}_v)}_{\text{self-attention}} \\ &\quad + \lambda(\mathbf{x}) \underbrace{\text{Attn}(\mathbf{x}\mathbf{W}_q, \mathbf{p}_\phi\mathbf{W}_k, \mathbf{p}_\phi\mathbf{W}_v)}_{\text{prefix attention}} \end{aligned} $$

where $\lambda(\mathbf{x})$ is a scalar representing the normalized attention weight on the prefix positions. This decomposition shows that conditions modulate generation through an additive attention pathway, and the activation map $\text{softmax}(\mathbf{x}\mathbf{W}_q \mathbf{W}_k^\top \mathbf{p}_\phi^\top)$ directly reveals how each condition steers model behavior.

Condition correlation is similarly revealed. For the prefix features themselves, the causal mask zeros out the cross-attention to the sequence, leaving only the prefix self-correlation term:

$$ \text{head} = \text{Attn}(\mathbf{p}_\phi \mathbf{W}_q, \mathbf{p}_\phi \mathbf{W}_k, \mathbf{p}_\phi \mathbf{W}_v) $$

The attention map $\mathbf{A}(\mathbf{p}_\phi)$ from this term encodes how conditions relate to one another.

Condition Encoders

Each condition has a dedicated encoder:

3D Pocket: A Geometric Vector Transformer (GVF) processes the binding pocket as a 3D graph with SE(3)-equivariant node and edge features. GVF extends GVP-GNN with a global attention module over geometric features. A position-aware attention mechanism with radial basis functions produces the pocket embedding.
Chemical properties: Separate MLPs embed each scalar property (Vina, QED, SA, LogP, Lipinski) into the shared $d$-dimensional space.

Training Objective

PrefixMol is trained with two losses. The auto-regressive loss is:

$$ \mathcal{L}_{AT} = -\sum_{1 < i \leq t} \log p_{\phi, \theta}(x_i \mid \mathbf{x}_{

A triplet property prediction loss encourages generated molecules to match desired properties:

$$ \mathcal{L}_{Pred} = \max\left((\hat{\mathbf{c}} - \mathbf{c})^2 - (\hat{\mathbf{c}} - \dot{\mathbf{c}})^2, 0\right) $$

where $\mathbf{c}$ is the input condition, $\hat{\mathbf{c}}$ is predicted by an MLP head, and $\dot{\mathbf{c}}$ is computed by RDKit from the generated SMILES (gradient is propagated through $\hat{\mathbf{c}}$ since RDKit is non-differentiable).

Experimental Setup and Controllability Evaluation

Dataset

The authors use the CrossDocked dataset (22.5 million protein-ligand structures) with chemical properties appended for each ligand. Data splitting and evaluation follow Pocket2Mol and Masuda et al.

Metrics

Vina score (binding affinity, computed by QVina after UFF refinement)
QED (quantitative estimate of drug-likeness, 0-1)
SA (synthetic accessibility, 0-1)
LogP (octanol-water partition coefficient)
Lipinski (rule-of-five compliance count)
High Affinity (fraction of pockets where generated molecules match or exceed test set affinities)
Diversity (average pairwise Tanimoto distance over Morgan fingerprints)
Sim.Train (maximum Tanimoto similarity to training set)

Baselines

Unconditional comparison against CVAE, AR (Luo et al. 2021a), and Pocket2Mol.

Key Results

Unconditional generation (Table 1): PrefixMol without conditions achieves sub-optimal results on Vina (-6.532), QED (0.551), SA (0.750), and LogP (1.415) compared to Pocket2Mol. However, it substantially outperforms all baselines on diversity (0.856 vs. 0.688 for Pocket2Mol) and novelty (Sim.Train of 0.239 vs. 0.376), indicating it generates genuinely novel molecules rather than memorizing training data.

Single-property control (Table 2): Molecular properties are positively correlated with conditional inputs across VINA, QED, SA, LogP, and Lipinski. With favorable control scales, PrefixMol surpasses Pocket2Mol on QED (0.767 vs. 0.563), SA (0.924 vs. 0.765), and LogP. The Vina score also improves when QED or LogP conditions are increased (e.g., -7.733 at QED control scale +2), revealing coupling between conditions.

Multi-property control (Table 3): Jointly adjusting all five conditions shows consistent positive relationships. For example, at control scale +4, QED reaches 0.722, SA reaches 0.913, and Lipinski saturates at 5.0. Joint QED+SA control at +2.0 achieves Lipinski = 5.0, confirming that certain properties are coupled.

Condition Relation Analysis

By computing partial derivatives of the prefix attention map with respect to each condition, the authors construct a relation matrix $\mathbf{R} = \sum_{i=2}^{6} |\partial \mathbf{A} / \partial c_i|$. Key findings:

Vina is weakly self-controllable but strongly influenced by QED, LogP, and SA, explaining why multi-condition control improves binding affinity even when Vina alone responds poorly.
LogP and QED are the most correlated property pair.
Lipinski is coupled to QED and SA, saturating at 5.0 when both QED and SA control scales reach +2.

Key Findings, Limitations, and Interpretability Insights

PrefixMol demonstrates that prefix embedding is an effective strategy for unifying target-aware and chemistry-aware molecular generation. The main findings are:

A single prefix-conditioned GPT model can control multiple chemical properties simultaneously while targeting specific protein pockets.
Multi-conditional generation outperforms unconditional baselines in drug-likeness metrics, and the controllability enables PrefixMol to surpass Pocket2Mol on QED, SA, and LogP.
The attention mechanism provides interpretable coupling relationships between conditions, offering practical guidance (e.g., improving QED indirectly improves Vina).

Limitations: The paper does not report validity rates for generated SMILES. The unconditional model underperforms Pocket2Mol on binding affinity (Vina), suggesting that generating 2D SMILES strings and relying on post hoc 3D conformer generation may be less effective than direct atom-by-atom 3D generation for binding affinity optimization. The condition relation analysis uses a first-order finite difference approximation ($\Delta = 1$), which may not capture nonlinear interactions. No external validation on prospective drug discovery tasks is provided. Hardware and training time details are not reported.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Training / Evaluation	CrossDocked (extended)	22.5M protein-ligand structures	Extended with molecular properties (QED, SA, LogP, Lipinski, Vina)

Algorithms

GPT-based auto-regressive SMILES generation with prefix conditioning
GVF (Geometric Vector Transformer) for 3D pocket encoding, extending GVP-GNN with global attention
Separate MLP encoders for each chemical property
Triplet property prediction loss with non-differentiable RDKit-computed properties
QVina for Vina score computation with UFF refinement

Models

GPT transformer backbone for SMILES generation
6 prefix condition vectors ($n_c = 6$): Pocket, Vina, QED, SA, LogP, Lipinski
Specific architectural hyperparameters (hidden dimension, number of layers, heads) not reported in the paper

Evaluation

Metric	PrefixMol (unconditional)	Pocket2Mol	Notes
Vina (kcal/mol)	-6.532	-7.288	Lower is better
QED	0.551	0.563	Higher is better
SA	0.750	0.765	Higher is better
Diversity	0.856	0.688	Higher is better
Sim.Train	0.239	0.376	Lower is better

Hardware

Not reported in the paper.

Artifacts

Artifact	Type	License	Notes
PrefixMol	Code	Not specified	Official PyTorch implementation

Paper Information

Citation: Gao, Z., Hu, Y., Tan, C., & Li, S. Z. (2023). PrefixMol: Target- and Chemistry-aware Molecule Design via Prefix Embedding. arXiv preprint arXiv:2302.07120.

@article{gao2023prefixmol,
  title={PrefixMol: Target- and Chemistry-aware Molecule Design via Prefix Embedding},
  author={Gao, Zhangyang and Hu, Yuqi and Tan, Cheng and Li, Stan Z.},
  journal={arXiv preprint arXiv:2302.07120},
  year={2023}
}

Link-INVENT: RL-Driven Molecular Linker Generation

Thu, 26 Mar 2026 00:00:00 +0000

A Method for Generative Linker Design with Reinforcement Learning

Link-INVENT is a Method paper that introduces a generative model for molecular linker design built on the REINVENT de novo design platform. The primary contribution is an encoder-decoder recurrent neural network (RNN) architecture that generates SMILES-based linkers connecting two molecular subunits, combined with a flexible multi-parameter optimization (MPO) scoring function and reinforcement learning (RL) to steer generation toward desired properties. Link-INVENT targets three practical drug discovery tasks: fragment linking, scaffold hopping, and proteolysis targeting chimera (PROTAC) design.

Why Linker Design Needs Flexible Multi-Parameter Optimization

Generating suitable chemical linkers between molecular subunits is a central challenge in fragment-based drug discovery (FBDD), scaffold hopping, and PROTAC design. Traditional computational approaches rely on database searches, inherently limiting the generalizability of proposed linkers to the pre-defined collection. Recent deep learning methods (DeLinker, SyntaLinker, 3DLinker, DiffLinker) can generate novel linkers but offer limited support for optimizing specific physicochemical properties. Users can typically control only linker length and a few properties like hydrogen-bond donor count.

The key gaps that Link-INVENT addresses are:

Conditioning on both subunits: Prior RNN-based approaches (SAMOA) generate linkers conditioned only on the SMILES sequence seen so far, which may not account for the second molecular subunit. Link-INVENT conditions on both warheads simultaneously.
Flexible scoring: Existing DL-based linker design tools lack the ability to define tailored MPO objectives. Link-INVENT inherits REINVENT 4’s full scoring infrastructure and adds linker-specific properties.
Generalizability: A single trained prior handles fragment linking, scaffold hopping, and PROTAC tasks without retraining.

Core Innovation: Conditional Linker Generation with Augmented Likelihood RL

Link-INVENT’s architecture is an encoder-decoder RNN adapted from the Lib-INVENT library design model. The encoder processes a pair of warheads (molecular subunits with defined exit vectors), and the decoder generates a linker token by token, yielding a connected molecule in SMILES format. The model uses three hidden layers of 512 LSTM cells with an embedding size of 256.

Training

The prior is trained on ChEMBL v27 data processed through reaction-based slicing to generate (linker, warheads pair, full molecule) tuples. SMILES randomization augments the training data at each epoch, improving chemical space generalizability. The prior is trained by maximizing the likelihood of generating a linker conditioned on the input warhead pair, with teacher forcing for stability.

Multi-Parameter Optimization via RL

The scoring function $S(x)$ is a weighted geometric mean of individual component scores:

$$ S(x) = \left(\prod_{i=1}^{n} C_{i}(x)^{w_{i}}\right)^{\frac{1}{\sum_{i=1}^{n} w_{i}}} $$

where $x$ is a sampled linked molecule, $C_{i}(x)$ is the score for the $i$-th component, and $w_{i}$ is its weight.

The agent (initialized as a copy of the prior) is updated via the Difference of Augmented and Posterior likelihoods (DAP) loss. The augmented log likelihood is:

$$ \log \pi_{\text{augmented}} = \log \pi_{\text{prior}} + \sigma \cdot S(x) $$

where $\pi$ denotes a policy (token sampling probabilities conditioned on the sequence so far) and $\sigma$ is a scalar factor. The loss function is:

$$ J(\theta) = \left(\log \pi_{\text{augmented}} - \log \pi_{\text{agent}}\right)^{2} $$

Minimizing $J(\theta)$ steers the agent to generate molecules that satisfy the scoring function while remaining anchored to the prior’s chemical space.

Diversity Filters

Link-INVENT uses Diversity Filters (DFs) to balance exploration and exploitation. Buckets of limited size track unique Bemis-Murcko scaffolds. When a bucket is full, further sampling of that scaffold receives a score of zero, encouraging the agent to explore diverse chemical space regions.

Linker-Specific Scoring Components

New scoring components provide direct control over linker properties:

Linker effective length: number of bonds between attachment atoms
Linker maximum graph length: bonds in the longest graph traversal path
Linker length ratio: effective length divided by maximum graph length (controls branching)
Linker ratio of rotatable bonds: rotatable bonds over total bonds (controls flexibility)
Linker number of rings: controls linearity vs. cyclicity
Linker number of HBDs: hydrogen-bond donors in the linker itself

Experimental Evaluation Across Three Drug Discovery Tasks

Link-INVENT was evaluated through four experiments across three drug discovery applications, all using the same pre-trained prior.

Illustrative Example: Two Benzene Rings

A simple experiment linked two benzene rings with the objectives of limiting HBDs and requiring exactly one ring in the linker. Over 20 epochs, the agent learned to satisfy both objectives, demonstrating the basic RL-guided generation process.

Experiment 1a: Fragment Linking (CK2 alpha Inhibitors)

Based on the casein kinase 2 (CK2 alpha) fragment linking campaign by Fusco and Brear et al., Link-INVENT was tasked with linking two fragment hits while retaining the Lys68 hydrogen-bond interaction via a DockStream docking constraint (Glide/LigPrep backend). The scoring function also enforced linker length ratio >= 70 and linker MW <= 200 Da.

Over 100 epochs in triplicate, the agent generated molecules with gradually improving docking scores. Key results:

Docking score distributions across triplicates were nearly identical, demonstrating reproducibility
Some generated molecules achieved more favorable docking scores than the reference ligand CAM4066 (-15.20 kcal/mol)
More than 5000 unique Bemis-Murcko scaffolds were generated, with minimal overlap across replicates
Binding pose analysis showed the generated linker closely resembled the ground-truth linker, retaining the Lys68 interaction

Experiment 1b: Comparison Fragment Linking (IMPDH Inhibitors)

Using the IMPDH inhibitor fragment linking case study from Trapero et al., this experiment applied core constrained docking (fragment pose within 0.3 A of reference) and compared results to DeLinker and SyntaLinker. The scoring function enforced linker effective length in [3, 5], length ratio >= 70, and linker MW <= 150 Da.

Link-INVENT generated 8960 SMILES across 70 epochs (comparable to DeLinker’s 9000 molecular graphs). Results:

Link-INVENT generated molecules with more favorable docking scores than the reference ligand across triplicate runs
Of 20 DeLinker and 3 SyntaLinker example molecules, none and one (the recovered reference) docked better than or equal to the reference
Approximately 3000 unique Bemis-Murcko scaffolds were generated from 5000 total molecules
Link-INVENT’s advantage comes from including docking explicitly as a learning objective rather than applying it post hoc

Experiment 2: Scaffold Hopping (DLK Inhibitor CNS Optimization)

Based on Patel et al.’s dual leucine zipper kinase (DLK) inhibitor campaign, Link-INVENT generated new scaffold ideas to improve CNS penetration while retaining potency. The scoring function included a Cys193 docking constraint plus CNS-compatible properties (HBDs < 2, tPSA <= 90 A squared, 3 <= SlogP <= 4, MW <= 450 Da, 1-2 aromatic rings in linker).

The solution space was significantly narrower than fragment linking. The agent still generated diverse scaffolds with favorable docking scores, though fewer exceeded the reference ligand’s score. Binding pose analysis confirmed retained Cys193 interactions and predicted additional Gln195 hydrogen bonds.

Experiment 3: PROTAC Design (Bcl-2/Mcl-1 Dual Degradation)

Three sub-experiments demonstrated linker-specific scoring components for PROTAC design based on Wang et al.’s Bcl-2/Mcl-1 dual degradation strategy:

Sub-Experiment	Objective	Key Finding
Sub-Exp 1: Linker length	Generate linkers within specified length intervals [4,6], [7,9], [10,12], [13,15]	Clear enrichment within target intervals vs. baseline broad distribution
Sub-Exp 2: Linearity	Control linear vs. cyclic linkers at fixed length [7,9]	Baseline ratio ~1:2 linear:cyclic; enforcing linearity or cyclicity achieved strong enrichment
Sub-Exp 3: Flexibility	Generate linkers with Low [0,30], Moderate [40,60], or High [70,100] rotatable bond ratios	Agent learned that rings and sp2 atoms yield rigidity; linear sp3 chains yield flexibility

Key Findings and Practical Implications for Drug Discovery

Link-INVENT demonstrates several practical advantages for molecular linker design:

Single prior, multiple tasks: The same pre-trained model handles fragment linking, scaffold hopping, and PROTAC design without retraining.
Docking as a learning signal: Including molecular docking explicitly in the scoring function (via DockStream) during RL yields molecules with more favorable docking scores than approaches that apply docking post hoc.
Implicit 3D awareness: The docking constraint guides the agent toward 3D structural awareness without explicit 3D coordinate inputs, as demonstrated by the overlap between generated and reference binding poses.
Diverse and reproducible output: Diversity filters ensure exploration of multiple chemical space regions, and triplicate experiments show consistent docking score distributions with minimal scaffold overlap.

Limitations acknowledged by the authors include:

The linker flexibility metric (ratio of rotatable bonds) is agnostic to intra-molecular hydrogen bonds and does not account for all rigidity factors
Molecular docking is an approximation that can be exploited (e.g., excessive HBDs achieving favorable scores at the expense of permeability)
Experiments 1a and 1b require a proprietary Schrodinger license for Glide/LigPrep docking
No direct experimental (wet-lab) validation was performed in this study

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Prior training	ChEMBL v27 (reaction-sliced)	Not specified	Filtered for drug-like compounds, then reaction-based slicing with SMIRKS
Validation	Held-out Bemis-Murcko scaffolds	287 scaffolds	Held out from training set
SMILES augmentation	Randomized SMILES per epoch	Same tuples, different representations	Improves generalizability

Algorithms

Architecture: Encoder-decoder RNN with 3 hidden layers of 512 LSTM cells, embedding size 256
RL loss: DAP (Difference of Augmented and Posterior likelihoods)
Batch size: 128 molecules per epoch
Diversity filter: Bemis-Murcko scaffold buckets of size 25
Score threshold: 0 (to store all molecules for analysis)
Scoring function: Weighted geometric mean of component scores

Models

Single pre-trained prior used across all experiments
Agent initialized as copy of prior, updated via RL
Pre-trained prior available at GitHub repository

Evaluation

Molecular docking via DockStream with Glide/LigPrep backend
Triplicate runs for all experiments
Metrics: docking scores, unique Bemis-Murcko scaffold counts, binding pose overlap

Hardware

Hardware specifications are not reported in the paper.

Artifacts

Artifact	Type	License	Notes
REINVENT (Link-INVENT code)	Code	Apache-2.0	Main codebase for Link-INVENT
ReinventCommunity (data + tutorial)	Code + Data	MIT	Training/validation data, reaction SMIRKS, pre-trained prior, Jupyter tutorial

Reproducibility status: Partially Reproducible. Code, training data, and pre-trained prior are publicly available. However, reproducing the docking-based experiments (1a, 1b, and 2) requires a proprietary Schrodinger license for Glide and LigPrep. The PROTAC experiments (Experiment 3) that use only physicochemical scoring are fully reproducible with the open-source code.

Paper Information

Citation: Guo, J., Knuth, F., Margreitter, C., Janet, J. P., Papadopoulos, K., Engkvist, O., & Patronov, A. (2023). Link-INVENT: generative linker design with reinforcement learning. Digital Discovery, 2, 392-408. https://doi.org/10.1039/D2DD00115B

@article{guo2023link,
  title={Link-INVENT: generative linker design with reinforcement learning},
  author={Guo, Jeff and Knuth, Franziska and Margreitter, Christian and Janet, Jon Paul and Papadopoulos, Kostas and Engkvist, Ola and Patronov, Atanas},
  journal={Digital Discovery},
  volume={2},
  number={2},
  pages={392--408},
  year={2023},
  publisher={Royal Society of Chemistry},
  doi={10.1039/D2DD00115B}
}

Lingo3DMol: Language Model for 3D Molecule Design

Thu, 26 Mar 2026 00:00:00 +0000

A Language Model Approach to Structure-Based Drug Design

This is a Method paper that introduces Lingo3DMol, a pocket-based 3D molecule generation model combining transformer language models with geometric deep learning. The primary contribution is threefold: (1) a new molecular representation called FSMILES (fragment-based SMILES) that encodes both 2D topology and 3D spatial coordinates, (2) a dual-decoder architecture that jointly predicts molecular topology and atomic positions, and (3) an auxiliary non-covalent interaction (NCI) predictor that guides molecule generation toward favorable binding modes.

Limitations of Existing 3D Molecular Generative Models

Existing approaches to structure-based drug design fall into two categories, each with notable limitations. Graph-based autoregressive methods (e.g., Pocket2Mol) represent molecules as 3D graphs and use GNNs for generation, but frequently produce non-drug-like structures: large rings (seven or more atoms), honeycomb-like ring arrays, and molecules with either too many or too few rings. The autoregressive sampling process tends to get stuck in local optima early in generation and accumulates errors at each step. Diffusion-based methods (e.g., TargetDiff) avoid autoregressive generation but still produce a notable proportion of undesirable structures due to weak perception of molecular topology, since they do not directly encode or predict bonds. Both approaches struggle with metrics like QED (quantitative estimate of drug-likeness) and SAS (synthetic accessibility score), and neither reliably reproduces known active compounds when evaluated on protein pockets.

FSMILES: Fragment-Based SMILES with Dual Coordinate Systems

The core innovation of Lingo3DMol is a new molecular sequence representation called FSMILES that addresses the topology problem inherent in atom-by-atom generation. FSMILES reorganizes a molecule into fragments using a ring-first, depth-first traversal. Each fragment is represented using standard SMILES syntax, and the full molecule is assembled by combining fragments with a specific connection syntax. Ring size information is encoded directly in atom tokens (e.g., C_6 for a carbon in a six-membered ring), providing the autoregressive decoder with critical context about local topology before it needs to close the ring.

The model integrates two coordinate systems. Local spherical coordinates encode bond length ($r$), bond angle ($\theta$), and dihedral angle ($\phi$) relative to three reference atoms (root1, root2, root3). These are predicted using separate MLP heads:

$$r = \operatorname{argmax}\left(\operatorname{softmax}\left(\operatorname{MLP}_1\left(\left[E_{\text{type}}(\text{cur}), H_{\text{topo}}, h_{\text{root1}}\right]\right)\right)\right)$$

$$\theta = \operatorname{argmax}\left(\operatorname{softmax}\left(\operatorname{MLP}_2\left(\left[E_{\text{type}}(\text{cur}), H_{\text{topo}}, h_{\text{root1}}, h_{\text{root2}}\right]\right)\right)\right)$$

$$\phi = \operatorname{argmax}\left(\operatorname{softmax}\left(\operatorname{MLP}_3\left(\left[E_{\text{type}}(\text{cur}), H_{\text{topo}}, h_{\text{root1}}, h_{\text{root2}}, h_{\text{root3}}\right]\right)\right)\right)$$

Global Euclidean coordinates ($x, y, z$) are predicted by a separate 3D decoder ($D_{\text{3D}}$). During inference, the model defines a search space around the predicted local coordinates ($r \pm 0.1$ A, $\theta \pm 2°$, $\phi \pm 2°$) and selects the global position with the highest joint probability within that space. This fusion strategy exploits the rigidity of bond lengths and angles (which makes local prediction easier) while maintaining global spatial awareness.

NCI/Anchor Prediction Model

A separately trained NCI/anchor prediction model identifies potential non-covalent interaction sites and anchor points in the protein pocket. This model shares the transformer architecture of the generation model and is initialized from pretrained parameters. It predicts whether each pocket atom will form hydrogen bonds, halogen bonds, salt bridges, or pi-pi stacking interactions with the ligand, and whether it lies within 4 A of any ligand atom (anchor points). The predicted NCI sites serve two purposes: they are incorporated as input features to the encoder, and they provide starting positions for molecule generation (the first atom is placed within 4.5 A of a sampled NCI site).

Pretraining and Architecture

The model uses a denoising pretraining strategy inspired by BART. During pretraining on 12 million drug-like molecules, the model receives perturbed molecules (with 25% of atoms deleted, coordinates perturbed by $\pm 0.5$ A, and 25% of carbon element types corrupted) and learns to reconstruct the original structure. The architecture is transformer-based with graph structural information encoded through distance and edge vector bias terms in the attention mechanism:

$$A_{\text{biased}} = \operatorname{softmax}\left(\frac{QK^{\top}}{\sqrt{d_k}} + B_D + B_J\right)V$$

The overall loss combines FSMILES token prediction, absolute coordinate prediction, and local coordinate predictions ($r$, $\theta$, $\phi$) with their auxiliary counterparts:

$$L = L_{\text{FSMILES}} + L_{\text{abs-coord}} + L_r + L_\theta + L_\phi + L_{r,\text{aux}} + L_{\theta,\text{aux}} + L_{\phi,\text{aux}}$$

Fine-tuning is performed on 11,800 protein-ligand complex samples from PDBbind 2020, with the first three encoder layers frozen to prevent overfitting.

Evaluation on DUD-E with Drug-Likeness Filtering

The evaluation uses the DUD-E dataset (101 targets, 20,000+ active compounds), comparing Lingo3DMol against Pocket2Mol and TargetDiff. A key methodological contribution is the emphasis on filtering generated molecules for drug-likeness (QED >= 0.3 and SAS <= 5) before evaluating binding metrics, as the authors demonstrate that molecules with good docking scores can still be poor drug candidates.

Molecular properties and binding mode (Table 1, drug-like molecules only):

Metric	Pocket2Mol	TargetDiff	Lingo3DMol
Drug-like molecules (% of total)	61%	49%	82%
Mean QED	0.56	0.60	0.59
Mean SAS	3.5	4.0	3.1
ECFP TS > 0.5 (% of targets)	8%	3%	33%
Mean min-in-place GlideSP	-6.7	-6.2	-6.8
Mean GlideSP redocking	-7.5	-7.0	-7.8
Mean RMSD vs. low-energy conformer (A)	1.1	1.1	0.9
Diversity	0.84	0.88	0.82

Lingo3DMol generates substantially more drug-like molecules (82% vs. 61% and 49%) and finds similar-to-active compounds for 33% of targets compared to 8% (Pocket2Mol) and 3% (TargetDiff). The model also achieves the best min-in-place GlideSP scores and lowest RMSD versus low-energy conformers, indicating higher quality binding poses and more realistic 3D geometries.

Molecular geometry: Lingo3DMol demonstrated the lowest Jensen-Shannon divergence for all atom-atom distance distributions and produced significantly fewer molecules with large rings (0.23% with 7-membered rings vs. 2.59% for Pocket2Mol and 11.70% for TargetDiff).

Information leakage analysis: The authors controlled for information leakage by excluding proteins with >30% sequence identity to DUD-E targets from training. When DUD-E targets were stratified by sequence identity to Pocket2Mol’s training set, Lingo3DMol’s advantage widened as leakage decreased, suggesting the performance gap is genuine rather than an artifact of training overlap.

Ablation studies (Table 2):

Metric	Standard	Random NCI	No Pretraining
Drug-like (%)	82%	47%	71%
ECFP TS > 0.5	33%	6%	3%
Mean min-in-place GlideSP	-6.8	-5.8	-4.9
Dice score	0.25	0.15	0.13

Both pretraining and the NCI predictor are essential. Removing pretraining reduces the number of valid molecules and binding quality. Replacing the trained NCI predictor with random NCI site selection severely degrades drug-likeness and the ability to generate active-like compounds.

Key Findings, Limitations, and Future Directions

Lingo3DMol demonstrates that combining language model sequence generation with geometric deep learning can produce drug-like 3D molecules that outperform graph-based and diffusion-based alternatives in binding mode quality, drug-likeness, and similarity to known actives. The FSMILES representation successfully constrains generated molecules to realistic topologies by encoding ring size information and using fragment-level generation.

Several limitations are acknowledged. Capturing all non-covalent interactions within a single molecule remains difficult with autoregressive generation. The model does not enforce equivariance (SE(3) invariance is approximated via rotation/translation augmentation and invariant features rather than built into the architecture). The pretraining dataset is partially proprietary (12M molecules from a commercial library, of which 1.4M from public sources are shared). Diversity of generated drug-like molecules is slightly lower than baselines, though the authors argue that baseline diversity explores chemical space away from known active regions. A comprehensive evaluation of drug-like properties beyond QED and SAS metrics is identified as an important next step.

Future directions include investigating electron density representations for molecular interactions, incorporating SE(3) equivariant architectures (e.g., GVP, Vector Neurons), and developing more systematic drug-likeness evaluation frameworks.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pretraining	In-house commercial library	12M molecules (1.4M public)	Filtered for drug-likeness; conformers via ConfGen
Fine-tuning	PDBbind 2020 (general set)	11,800 samples (8,201 PDB IDs)	Filtered for <30% sequence identity to DUD-E targets
NCI labels	PDBbind 2020	Same as fine-tuning	Labeled using ODDT for H-bonds, halogen bonds, salt bridges, pi-pi stacking
Evaluation	DUD-E	101 targets, 20,000+ active compounds	Standard benchmark for structure-based drug design
Geometry evaluation	CrossDocked2020	100 targets	Used for bond length and atom distance distribution comparisons

Algorithms

Transformer-based encoder-decoder with graph structural bias terms (distance matrix $B_D$, edge vector matrix $B_J$)
Denoising pretraining: 25% atom deletion, coordinate perturbation ($\pm 0.5$ A), 25% carbon element type corruption
Depth-first search sampling with reward function combining model confidence and anchor fulfillment
Fine-tuning: first three encoder layers frozen
Local-global coordinate fusion during inference with search space: $r \pm 0.1$ A, $\theta \pm 2°$, $\phi \pm 2°$

Models

Generation model: transformer encoder-decoder with dual decoders ($D_{\text{2D}}$ for topology, $D_{\text{3D}}$ for global coordinates)
NCI/anchor prediction model: same architecture, initialized from pretrained parameters
Pretrained, fine-tuned, and NCI model checkpoints available on GitHub and figshare

Evaluation

Metric	Lingo3DMol	Best Baseline	Notes
Drug-like molecules (%)	82%	61% (P2M)	QED >= 0.3, SAS <= 5
ECFP TS > 0.5 (% targets)	33%	8% (P2M)	Tanimoto similarity to known actives
Min-in-place GlideSP	-6.8	-6.7 (P2M)	Lower is better
GlideSP redocking	-7.8	-7.5 (P2M)	Lower is better
RMSD vs. low-energy conformer	0.9 A	1.1 A (both)	Lower is better
Generation speed (100 mol)	874 +/- 401 s	962 +/- 622 s (P2M)	NVIDIA Tesla V100

Hardware

Inference benchmarked on NVIDIA Tesla V100 GPUs
Generation of 100 valid molecules per target: 874 +/- 401 seconds

Artifacts

Artifact	Type	License	Notes
Lingo3DMol	Code	GPL-3.0	Inference code and model architecture
Model checkpoints	Model	GPL-3.0	Pretraining, fine-tuning, and NCI checkpoints
Training data	Dataset	Not specified	Partial pretraining data (1.4M public molecules), fine-tuning complexes, evaluation molecules
Online service	Other	N/A	Web interface for molecule generation

Paper Information

Citation: Feng, W., Wang, L., Lin, Z., Zhu, Y., Wang, H., Dong, J., Bai, R., Wang, H., Zhou, J., Peng, W., Huang, B., & Zhou, W. (2024). Generation of 3D molecules in pockets via a language model. Nature Machine Intelligence, 6(1), 62-73. https://doi.org/10.1038/s42256-023-00775-6

@article{feng2024generation,
  title={Generation of 3D molecules in pockets via a language model},
  author={Feng, Wei and Wang, Lvwei and Lin, Zaiyun and Zhu, Yanhao and Wang, Han and Dong, Jianqiang and Bai, Rong and Wang, Huting and Zhou, Jielong and Peng, Wei and Huang, Bo and Zhou, Wenbiao},
  journal={Nature Machine Intelligence},
  volume={6},
  number={1},
  pages={62--73},
  year={2024},
  publisher={Nature Publishing Group},
  doi={10.1038/s42256-023-00775-6}
}

Generative AI Survey for De Novo Molecule and Protein Design

Thu, 26 Mar 2026 00:00:00 +0000

A Systematization of Generative AI for Drug Design

This is a Systematization paper that provides a broad survey of generative AI methods applied to de novo drug design. The survey organizes the field into two overarching themes: small molecule generation and protein generation. Within each theme, the authors identify subtasks, catalog datasets and benchmarks, describe model architectures, and compare the performance of leading methods using standardized metrics. The paper covers over 200 references and provides 12 comparative benchmark tables.

The primary contribution is a unified organizational framework that allows both micro-level comparisons within each subtask and macro-level observations across the two application domains. The authors highlight parallel developments in both fields, particularly the shift from sequence-based to structure-based approaches and the growing dominance of diffusion models.

The Challenge of Navigating De Novo Drug Design

The drug design process requires creating ligands that interact with specific biological targets. These range from small molecules (tens of atoms) to large proteins (monoclonal antibodies). Traditional discovery methods are computationally expensive, with preclinical trials costing hundreds of millions of dollars and taking 3-6 years. The chemical space of potential drug-like compounds is estimated at $10^{23}$ to $10^{60}$, making brute-force exploration infeasible.

AI-driven generative methods have gained traction in recent years, with over 150 AI-focused biotech companies initiating small-molecule drugs in the discovery phase and 15 in clinical trials. The rate of AI-fueled drug design processes has expanded by almost 40% each year.

The rapid development of the field, combined with its inherent complexity, creates barriers for new researchers. Several prior surveys exist, but they focus on specific aspects: molecule generation, protein generation, antibody generation, or specific model architectures like diffusion models. This survey takes a broader approach, covering both molecule and protein generation under a single organizational framework.

Unified Taxonomy: Two Themes, Seven Subtasks

The survey’s core organizational insight is structuring de novo drug design into two themes with distinct subtasks, while identifying common architectural patterns across them.

Generative Model Architectures

The survey covers four main generative model families used across both molecule and protein generation:

Variational Autoencoders (VAEs) encode inputs into a latent distribution and decode from sampled points. The encoder maps input $x$ to a distribution parameterized by mean $\mu_\phi(x)$ and variance $\sigma^2_\phi(x)$. Training minimizes reconstruction loss plus KL divergence:

$$\mathcal{L} = \mathcal{L}_{\text{recon}} + \beta \mathcal{L}_{\text{KL}}$$

where the KL loss is:

$$\mathcal{L}_{\text{KL}} = -\frac{1}{2} \sum_{k} \left(1 + \log(\sigma_k^{(i)2}) - \mu_k^{(i)2} - \sigma_k^{(i)2}\right)$$

Generative Adversarial Networks (GANs) use a generator-discriminator game. The generator $G$ creates instances from random noise $z$ sampled from a prior $p_z(z)$, while the discriminator $D$ distinguishes real from synthetic data:

$$\min_{G} \max_{D} \mathbb{E}_x[\log D(x; \theta_d)] + \mathbb{E}_{z \sim p(z)}[\log(1 - D(G(z; \theta_g); \theta_d))]$$

Flow-Based Models generate data by applying an invertible function $f: z_0 \mapsto x$ to transform a simple latent distribution (Gaussian) to the target distribution. The log-likelihood is computed using the change-of-variable formula:

$$\log p(x) = \log p_0(z) + \log \left| \det \frac{\partial f}{\partial z} \right|$$

Diffusion Models gradually add Gaussian noise over $T$ steps in a forward process and learn to reverse the noising via a denoising neural network. The forward step is:

$$x_{t+1} = \sqrt{1 - \beta_t} x_t + \sqrt{\beta_t} \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)$$

The training loss minimizes the difference between the true noise and the predicted noise:

$$L_t = \mathbb{E}_{t \sim [1,T], x_0, \epsilon_t} \left[ | \epsilon_t - \epsilon_\theta(x_t, t) |^2 \right]$$

Graph neural networks (GNNs), particularly equivariant GNNs (EGNNs), are commonly paired with these generative methods to handle 2D/3D molecular and protein inputs. Diffusion and flow-based models are often paired with GNNs for processing 2D/3D-based input, while VAEs and GANs are typically used for 1D input.

Small Molecule Generation: Tasks, Datasets, and Models

Target-Agnostic Molecule Design

The goal is to generate a set of novel, valid, and stable molecules without conditioning on any specific biological target. Models are evaluated on atom stability, molecule stability, validity, uniqueness, novelty, and QED (Quantitative Estimate of Drug-Likeness).

Datasets: QM9 (small stable molecules from GDB-17) and GEOM-Drug (more complex, drug-like molecules).

The field has shifted from SMILES-based VAEs (CVAE, GVAE, SD-VAE) to 2D graph methods (JTVAE) and then to 3D diffusion-based models. Current leading methods on QM9:

Model	Type	At Stb. (%)	Mol Stb. (%)	Valid (%)	Val/Uniq. (%)
MiDi	EGNN, Diffusion	99.8	97.5	97.9	97.6
MDM	EGNN, VAE, Diffusion	99.2	89.6	98.6	94.6
JODO	EGNN, Diffusion	99.2	93.4	99.0	96.0
GeoLDM	VAE, Diffusion	98.9	89.4	93.8	92.7
EDM	EGNN, Diffusion	98.7	82.0	91.9	90.7

EDM provided an initial baseline using diffusion with an equivariant GNN. GCDM introduced attention-based geometric message-passing. MDM separately handles covalent bond edges and Van der Waals forces, and also addresses diversity through an additional distribution-controlling noise variable. GeoLDM maps molecules to a lower-dimensional latent space for more efficient diffusion. MiDi uses a “relaxed” EGNN and jointly models 2D and 3D information through a graph representation capturing both spatial and connectivity data.

On the larger GEOM-Drugs dataset, performance drops for most models:

Model	At Stb. (%)	Mol Stb. (%)	Valid (%)	Val/Uniq. (%)
MiDi	99.8	91.6	77.8	77.8
MDM	–	62.2	99.5	99.0
GeoLDM	84.4	–	99.3	–
EDM	81.3	–	–	–

MiDi distinguishes itself for generating more stable complex molecules, though at the expense of validity. Models generally perform well on QM9 but show room for improvement on more complex GEOM-Drugs molecules.

Target-Aware Molecule Design

Target-aware generation produces molecules for specific protein targets, using either ligand-based (LBDD) or structure-based (SBDD) approaches. SBDD methods have become more prevalent as protein structure information becomes increasingly available.

Datasets: CrossDocked2020 (22.5M ligand-protein pairs), ZINC20, Binding MOAD.

Metrics: Vina Score (docking energy), High Affinity Percentage, QED, SA Score (synthetic accessibility), Diversity (Tanimoto similarity).

Model	Type	Vina	Affinity (%)	QED	SA	Diversity
DiffSBDD	EGNN, Diffusion	-7.333	–	0.467	0.554	0.758
Luo et al.	SchNet	-6.344	29.09	0.525	0.657	0.720
TargetDiff	EGNN, Diffusion	-6.3	58.1	0.48	0.58	0.72
LiGAN	CNN, VAE	-6.144	21.1	0.39	0.59	0.66
Pocket2Mol	EGNN, MLP	-5.14	48.4	0.56	0.74	0.69

DrugGPT is an LBDD autoregressive model using transformers on tokenized protein-ligand pairs. Among the SBDD models, LiGAN introduces a 3D CNN-VAE framework, Pocket2Mol emphasizes binding pocket geometry using an EGNN with geometric vector MLP layers, and Luo et al. model atomic probabilities in the binding site using SchNet. TargetDiff performs diffusion on an EGNN and optimizes binding affinity by reflecting low atom type entropy. DiffSBDD applies an inpainting approach by masking and replacing segments of ligand-protein complexes. DiffSBDD leads in Vina score and diversity, while TargetDiff leads in high affinity. Interestingly, diffusion-based methods are outperformed by Pocket2Mol on drug-likeness metrics (QED and SA).

Molecular Conformation Generation

Conformation generation involves producing 3D structures from 2D connectivity graphs. Models are evaluated on Coverage (COV, percentage of ground-truth conformations “covered” within an RMSD threshold) and Matching (MAT, average RMSD to closest ground-truth conformation).

Datasets: GEOM-QM9, GEOM-Drugs, ISO17.

Model	Type	GEOM-QM9 COV (%)	GEOM-QM9 MAT	GEOM-Drugs COV (%)	GEOM-Drugs MAT
Torsional Diff.	Diffusion	92.8	0.178	72.7*	0.582
DGSM	MPNN, Diffusion	91.49	0.2139	78.73	1.0154
GeoDiff	GFN, Diffusion	90.07	0.209	89.13	0.8629
ConfGF	GIN, Diffusion	88.49	0.2673	62.15	1.1629
GeoMol	MPNN	71.26	0.3731	67.16	1.0875

*Torsional Diffusion uses a 0.75 A threshold instead of the standard 1.25 A for GEOM-Drugs coverage, leading to a deflated score. It outperforms GeoDiff and GeoMol when evaluated at the same threshold.

Torsional Diffusion operates in the space of torsion angles rather than Cartesian coordinates, allowing for improved representation and fewer denoising steps. GeoDiff uses Euclidean-space diffusion, treating each atom as a particle and incorporating Markov kernels that preserve E(3) equivariance through a graph field network (GFN) layer.

Protein Generation: From Sequence to Structure

Protein Representation Learning

Representation learning creates embeddings for protein inputs to support downstream tasks. Models are evaluated on contact prediction, fold classification (at family, superfamily, and fold levels), and stability prediction (Spearman’s $\rho$).

Key models include: UniRep (mLSTM RNN), ProtBERT (BERT applied to amino acid sequences), ESM-1B (33-layer, 650M parameter transformer), MSA Transformer (pre-trained on MSA input), and GearNET (Geo-EGNN using 3D structure with directed edges). OntoProtein and KeAP incorporate knowledge graphs for direct knowledge injection.

Protein Structure Prediction

Given an amino acid sequence, models predict 3D point coordinates for each residue. Evaluated using RMSD, GDT-TS, TM-score, and LDDT on CASP14 and CAMEO benchmarks.

AlphaFold2 is the landmark model, integrating MSA and pair representations through transformers with invariant point attention (IPA). ESMFold uses ESM-2 language model representations instead of MSAs, achieving faster processing. RoseTTAFold uses a three-track neural network learning from 1D sequence, 2D distance map, and 3D backbone coordinate information simultaneously. EigenFold uses diffusion, representing the protein as a system of harmonic oscillators.

Model	Type	CAMEO RMSD	CAMEO TMScore	CAMEO GDT-TS	CAMEO lDDT	CASP14 TMScore
AlphaFold2	Transformer	3.30	0.87	0.86	0.90	0.38
ESMFold	Transformer	3.99	0.85	0.83	0.87	0.68
RoseTTAFold	Transformer	5.72	0.77	0.71	0.79	0.37
EigenFold	Diffusion	7.37	0.75	0.71	0.78	–

Sequence Generation (Inverse Folding)

Given a fixed protein backbone structure, models generate amino acid sequences that will fold into that structure. The space of valid sequences is between $10^{65}$ and $10^{130}$.

Evaluated using Amino Acid Recovery (AAR), diversity, RMSD, nonpolar loss, and perplexity (PPL):

$$\text{PPL} = \exp\left(\frac{1}{N} \sum_{i=1}^{N} \log P(x_i | x_1, x_2, \ldots x_{i-1})\right)$$

ProteinMPNN is the current top performer, generating the most accurate sequences and leading in AAR, RMSD, and nonpolar loss. It uses a message-passing neural network with a flexible, order-agnostic autoregressive approach.

Model	Type	AAR (%)	Div.	RMSD	Non.	Time (s)
ProteinMPNN	MPNN	48.7	0.168	1.019	1.061	112
ESM-IF1	Transformer	47.7	0.184	1.265	1.201	1980
GPD	Transformer	46.2	0.219	1.758	1.333	35
ABACUS-R	Transformer	45.7	0.124	1.482	0.968	233280
3D CNN	CNN	44.5	0.272	1.62	1.027	536544
PiFold	GNN	42.8	0.141	1.592	1.464	221
ProteinSolver	GNN	24.6	0.186	5.354	1.389	180

Results are from the independent benchmark by Yu et al. GPD remains the fastest method, generating sequences around three times faster than ProteinMPNN. Current SOTA models recover fewer than half of target amino acid residues, indicating room for improvement.

Backbone Design

Backbone design creates protein structures from scratch, representing the core of de novo protein design. Models generate coordinates for backbone atoms (nitrogen, alpha-carbon, carbonyl, oxygen) and use external tools like Rosetta for side-chain packing.

Two evaluation paradigms exist: context-free generation (evaluated by self-consistency TM, or scTM) and context-given generation (inpainting, evaluated by AAR, PPL, RMSD).

ProtDiff represents residues as 3D Cartesian coordinates and uses particle-filtering diffusion. FoldingDiff instead uses an angular representation (six angles per residue) with a BERT-based DDPM. LatentDiff embeds proteins into a latent space using an equivariant autoencoder, then applies equivariant diffusion, analogous to GeoLDM for molecules. These early models work well for short proteins (up to 128 residues) but struggle with longer structures.

Frame-based methods address this scaling limitation. Genie uses Frenet-Serret frames with paired residue representations and IPA for noise prediction. FrameDiff parameterizes backbone structures on the $SE(3)^N$ manifold of frames using a score-based generative model. RFDiffusion is the current leading model, combining RoseTTAFold structure prediction with diffusion. It fine-tunes RoseTTAFold weights on a masked input sequence and random noise coordinates, using “self-conditioning” on predicted structures. Protpardelle co-designs sequence and structure by creating a “superposition” over possible sidechain states and collapsing them during each iterative diffusion step.

Model	Type	scTM (%)	Design. (%)	PPL	AAR (%)	RMSD
RFDiffusion	Diffusion	–	95.1	–	–	–
Protpardelle	Diffusion	85	–	–	–	–
FrameDiff	Diffusion	84	48.3	–	–	–
Genie	Diffusion	81.5	79.0	–	–	–
LatentDiff	EGNN, Diffusion	31.6	–	–	–	–
FoldingDiff	Diffusion	14.2	–	–	–	–
ProtDiff	EGNN, Diffusion	11.8	–	–	12.47*	8.01*

*ProtDiff context-given results are tested only on beta-lactamase metalloproteins from PDB.

Antibody Design

The survey covers antibody structure prediction, representation learning, and CDR-H3 generation. Antibodies are Y-shaped proteins with complementarity-determining regions (CDRs), where CDR-H3 is the most variable and functionally important region.

For CDR-H3 generation, models have progressed from sequence-based (LSTM) to structure-based (RefineGNN) and sequence-structure co-design approaches (MEAN, AntiDesigner, DiffAb). dyMEAN is the current leading model, providing an end-to-end method incorporating structure prediction, docking, and CDR generation into a single framework. MSA alignment cannot be used for antibody input, which makes general models like AlphaFold2 inefficient for antibody prediction. Specialized models like IgFold use sequence embeddings from AntiBERTy with invariant point attention to achieve faster antibody structure prediction.

Peptide Design

The survey briefly covers peptide generation, including models for therapeutic peptide generation (MMCD), peptide-protein interaction prediction (PepGB), peptide representation learning (PepHarmony), peptide sequencing (AdaNovo), and signal peptide prediction (PEFT-SP).

Current Trends, Challenges, and Future Directions

Current Trends

The survey identifies several parallel trends across molecule and protein generation:

Shift from sequence to structure: In molecule generation, graph-based diffusion models (GeoLDM, MiDi, TargetDiff) now dominate. In protein generation, structure-based representation learning (GearNET) and diffusion-based backbone design (RFDiffusion) have overtaken sequence-only methods.
Dominance of E(3) equivariant architectures: EGNNs appear across nearly all subtasks, reflecting the physical requirement that molecular and protein properties should be invariant to rotation and translation.
Structure-based over ligand-based approaches: In target-aware molecule design, SBDD methods that use 3D protein structures demonstrate clear advantages over LBDD approaches that operate on amino acid sequences alone.

Challenges

For small molecule generation:

Complexity: Models perform well on simple QM9 but struggle with complex GEOM-Drugs molecules.
Applicability: Generating molecules with high binding affinity to targets remains difficult.
Explainability: Methods are black-box, offering no insight into why generated molecules have desired properties.

For protein generation:

Benchmarking: Protein generative tasks lack a standard evaluative procedure, with variance between each model’s metrics and testing conditions.
Performance: SOTA models still struggle with fold classification, gene ontology, and antibody CDR-H3 generation.

The authors also note that many generative tasks are evaluated using predictive models (e.g., classifier networks for binding affinity or molecular properties). Improvements to these classification methods would lead to more precise alignment with real-world biological applications.

Future Directions

The authors identify increasing performance in existing tasks, defining more applicable tasks (especially in molecule-protein binding, antibody generation), and exploring entirely new areas of research as key future directions.

Reproducibility Details

As a survey paper, this work does not produce new models, datasets, or experimental results. All benchmark numbers reported are from the original papers cited.

Data

The survey catalogs the following key datasets across subtasks:

Subtask	Datasets	Notes
Target-agnostic molecule	QM9, GEOM-Drug	QM9 from GDB-17; GEOM-Drug for complex molecules
Target-aware molecule	CrossDocked2020, ZINC20, Binding MOAD	CrossDocked2020 most used (22.5M pairs)
Conformation generation	GEOM-QM9, GEOM-Drugs, ISO17	Conformer sets for molecules
Protein structure prediction	PDB, CASP14, CAMEO	CASP biennial blind evaluation
Protein sequence generation	PDB, UniRef, UniParc, CATH, TS500	CATH for domain classification
Backbone design	PDB, AlphaFoldDB, SCOP, CATH	AlphaFoldDB for expanded structural coverage
Antibody structure	SAbDab, RAB	SAbDab: all antibody structures from PDB
Antibody CDR generation	SAbDab, RAB, SKEMPI	SKEMPI for affinity optimization

Artifacts

Artifact	Type	License	Notes
GenAI4Drug	Code	Not specified	Organized repository of all covered sources

Paper Information

Citation: Tang, X., Dai, H., Knight, E., Wu, F., Li, Y., Li, T., & Gerstein, M. (2024). A survey of generative AI for de novo drug design: New frontiers in molecule and protein generation. Briefings in Bioinformatics, 25(4), bbae338. https://doi.org/10.1093/bib/bbae338

Publication: Briefings in Bioinformatics, Volume 25, Issue 4, 2024.

Additional Resources:

@article{tang2024survey,
  title={A survey of generative AI for de novo drug design: new frontiers in molecule and protein generation},
  author={Tang, Xiangru and Dai, Howard and Knight, Elizabeth and Wu, Fang and Li, Yunyang and Li, Tianxiao and Gerstein, Mark},
  journal={Briefings in Bioinformatics},
  volume={25},
  number={4},
  pages={bbae338},
  year={2024},
  doi={10.1093/bib/bbae338}
}

Curriculum Learning for De Novo Drug Design (REINVENT)

Thu, 26 Mar 2026 00:00:00 +0000

Curriculum Learning as a Method for Molecular Generation

This is a Method paper that introduces curriculum learning (CL) into the REINVENT de novo molecular design platform. The primary contribution is a training strategy that decomposes complex multi-parameter optimization (MPO) objectives into sequences of simpler tasks with increasing complexity. The agent learns each simpler task before progressing to the full production objective, accelerating convergence and improving the quality and diversity of generated molecules compared to standard policy-based reinforcement learning (RL).

The Computational Cost of Complex Reward Functions

Policy-based RL for molecular design works by training a generative model (the agent) to produce molecules that maximize a reward function. In practice, drug design reward functions often include computationally expensive components such as molecular docking. When the reward landscape is complex and minima are difficult to find, the agent may spend many epochs sampling molecules far from the desired objective. The resulting small gradients cause minimal policy updates, leading to long periods of non-productivity. This is particularly wasteful when each reward evaluation involves expensive physics-based computations.

The core problem is that standard RL treats the full MPO objective as a monolithic task. If the agent cannot find any rewarding molecules early in training, it receives near-zero gradients and makes negligible progress. This creates a bootstrapping problem: the agent needs to already be sampling from favorable regions of chemical space to receive useful learning signals, but it has no guidance on how to get there.

Curriculum learning, originally proposed by Bengio et al. (2009), addresses this by arranging training tasks in order of increasing difficulty. When constituent tasks are correlated with the final objective, the gradients from simpler tasks provide more effective traversal of the optimization landscape.

Formalized Curriculum Strategy for REINVENT

The key innovation is a two-phase training protocol with formal definitions for curriculum progression.

A scoring function maps SMILES strings to desirability scores in $[0, 1]$ using a weighted geometric mean:

$$S(x) = \left(\prod_{i=1}^{n} c_{i}(x)^{w_{i}}\right)^{1 / \sum_{i=1}^{n} w_{i}}$$

where $x$ is a sampled compound in SMILES format, $c_{i}$ is the $i$-th scoring component, and $w_{i}$ is its weight.

A Curriculum $C$ consists of a sequence of Objectives $O = {O_{C_1}, \ldots, O_{C_n}, O_{P}}$, where subscripts $C$ and $P$ denote Curriculum and Production Objectives respectively. Each Objective has a corresponding scoring function. Progression is controlled by Curriculum Progression Criteria $P = {P_{1}, \ldots, P_{n}}$, where each $P_{i}$ defines a score threshold the agent must achieve before advancing to the next objective.

Curriculum Phase: The agent trains on sequential Curriculum Objectives with increasing complexity. A diversity filter is not applied during this phase, as it could be counterproductive to guiding the agent toward favorable chemical space. No computationally expensive components (e.g., docking) are used in Curriculum Objectives.

Production Phase: Activated only when the final Curriculum Progression Criterion $P_{n}$ is satisfied. The agent now optimizes the full Production Objective, which may include expensive components like molecular docking. A new inception memory is initialized (clearing Curriculum Phase compounds), and a Bemis-Murcko scaffold diversity filter is applied to encourage exploration across multiple local minima.

The implementation builds on REINVENT’s RNN architecture: three hidden layers of 512 LSTM cells with an embedding size of 256 and a linear layer with softmax activation, pretrained on ChEMBL to learn SMILES syntax.

Three Experiments on PDK1 Inhibitor Design

The authors evaluate CL on three molecular design tasks of increasing complexity, all centered on designing 3-phosphoinositide-dependent protein kinase-1 (PDK1) inhibitors.

Experiment 1: Target Scaffold Construction

The goal is to generate compounds possessing a dihydro-pyrazoloquinazoline scaffold with a phenyl substituent, a scaffold not present in the prior’s training set. Standard RL fails entirely over 2000 epochs because the probability of randomly sampling a compound with this scaffold is negligibly small, producing binary rewards (1.0 if scaffold present, 0.5 otherwise) that never rise above 0.5.

CL decomposes the target scaffold into 5 progressively complex substructures. Each Curriculum Progression Criterion threshold is set to 0.8. The agent learns to generate compounds with each substructure before advancing. CL finds the target scaffold within 1750 epochs, while baseline RL cannot find it in the same timeframe.

Experiments 2 and 3: Molecular Docking Constraints

These experiments use a Production Objective combining a molecular docking constraint (retaining two hydrogen-bonding interactions with Ala 162 of PDK1, PDB ID: 2XCH) and QED (Quantitative Estimate of Druglikeness). Both experiments limit computational cost by capping production epochs at 300.

Experiment 2 uses Tanimoto (2D) similarity to a reference ligand as the Curriculum Objective. Two scenarios are tested: “Low” (threshold 0.5) and “High” (threshold 0.8).

Experiment 3 uses ROCS (3D) shape-based similarity to the reference ligand as the Curriculum Objective, with “Low” (0.5) and “High” (0.75) thresholds.

All experiments are run in triplicate. The baseline includes both standard RL and RL with Tanimoto/ROCS components added directly to the scoring function (not sequentially), to control for the presence of these components.

Across all CL experiments, CL generates between 2,941 and 9,068 more compounds with docking scores better than the reference ligand (-10.907 kcal/mol) compared to baseline RL, corresponding to 12.42-23.79% improvement in the fraction of compounds exceeding the reference. Between the Curriculum Objectives, the “High” threshold scenario outperforms the “Low” scenario by 316-3,415 additional compounds (with percentage improvements ranging from -0.4% to 10.57%).

Baseline RL produces essentially no compounds satisfying the docking constraint for the first 100 epochs. CL agents achieve immediate productivity: in the “High” Tanimoto scenario, the initial docking score already exceeds the maximum score achieved by baseline RL over 300 epochs.

Scaffold Diversity Analysis

CL generates more unique Bemis-Murcko scaffolds than baseline RL in all experiments. The “High” scenarios produce more unique scaffolds than the “Low” scenarios. CL also produces a higher fraction of “favorable” scaffolds (those with better docking scores than the reference ligand).

Accelerated Convergence with a Diversity Trade-off

The results demonstrate three consistent findings across all experiments:

Accelerated productivity: CL agents reach productive sampling of favorable compounds substantially faster than baseline RL. Even a single Curriculum Objective with a computationally inexpensive metric can guide the agent to regions of chemical space where expensive Production Objectives are readily satisfied.
Improved output quality: CL generates more compounds with favorable docking scores, more unique scaffolds, and a higher fraction of scaffolds that outperform the reference ligand.
Controllable trade-off: The Curriculum Progression Criterion threshold provides direct control over agent policy. Higher thresholds produce better Production Objective optimization but reduce intra-set diversity (higher cross-Tanimoto similarities among generated compounds). UMAP visualizations confirm that “Low” and “High” scenarios sample from nearby but distinct regions of chemical space.

The authors note that even moderate optimization of similarity-based Curriculum Objectives (the “Low” scenarios) already substantially narrows the agent’s perceived solution space. This suggests that CL inherently regularizes the agent policy, trading some diversity for convergence speed.

Limitations: The authors acknowledge that data supporting the findings are available only upon request rather than as public deposits. The approach is demonstrated on a single target (PDK1), and the curriculum design requires domain expertise to decompose objectives appropriately. The inverse relationship between Curriculum Objective optimization and solution diversity means practitioners must carefully tune thresholds for their specific applications.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Prior training	ChEMBL	Not specified	Used to pretrain the RNN on SMILES syntax
Docking target	PDB 2XCH	1 structure	PDK1 receptor crystal structure

Raw data supporting the findings are available from the corresponding author upon request.

Algorithms

REINVENT platform with LSTM-based RNN (3 hidden layers, 512 cells, embedding size 256)
Scoring function: weighted geometric mean of components
Curriculum Progression Criteria: score thresholds (0.5 or 0.75-0.8 depending on scenario)
Diversity filter: Identical Murcko Scaffold with bucket size 25 (Production Phase only)
Inception (experience replay) for both phases, reset at phase transition
Batch size: 128, learning rate: 0.0001, sigma: 128, Adam optimizer

Models

Prior: RNN pretrained on ChEMBL SMILES
Agent: Initialized from prior, focused via RL/CL
No pretrained model weights are publicly released

Evaluation

Metric	Description	Notes
Docking score (Glide SP)	Predicted binding affinity (kcal/mol)	Lower is better; reference ligand: -10.907
QED	Quantitative Estimate of Druglikeness	Range [0, 1]
Unique Bemis-Murcko scaffolds	Scaffold diversity measure	Averaged over triplicates
Cross-Tanimoto similarity	Intra-set compound diversity	Calculated on pooled triplicates
Tanimoto/ROCS similarity	Curriculum Objective metrics	2D fingerprint and 3D shape similarity

Hardware

GPU: NVIDIA Tesla V100 (32 GB)
Docking: AWS p3.8xlarge instance
LigPrep parallelized over 8 CPU cores
Glide docking parallelized over 48 CPU cores via DockStream

Artifacts

Artifact	Type	License	Notes
REINVENT	Code	Apache-2.0	De novo molecular design platform
CL Tutorial Notebook	Code	MIT	Jupyter notebook tutorial for curriculum learning

Paper Information

Citation: Guo, J., Fialková, V., Arango, J. D., Margreitter, C., Janet, J. P., Papadopoulos, K., Engkvist, O., & Patronov, A. (2022). Improving de novo molecular design with curriculum learning. Nature Machine Intelligence, 4, 555-563. https://doi.org/10.1038/s42256-022-00494-4

@article{guo2022curriculum,
  title={Improving de novo molecular design with curriculum learning},
  author={Guo, Jeff and Fialkov{\'a}, Vendy and Arango, Juan Diego and Margreitter, Christian and Janet, Jon Paul and Papadopoulos, Kostas and Engkvist, Ola and Patronov, Atanas},
  journal={Nature Machine Intelligence},
  volume={4},
  number={6},
  pages={555--563},
  year={2022},
  publisher={Springer Nature},
  doi={10.1038/s42256-022-00494-4}
}

CogMol: Controlled Molecule Generation for COVID-19

Thu, 26 Mar 2026 00:00:00 +0000

A Controlled Generation Framework for Target-Specific Drug Design

This is a Method paper that introduces CogMol (Controlled Generation of Molecules), an end-to-end framework for de novo drug design. The primary contribution is a pipeline that combines a SMILES-based Variational Autoencoder (VAE) with multi-attribute controlled latent space sampling (CLaSS) to generate novel drug-like molecules with high binding affinity to specified protein targets, off-target selectivity, and favorable drug-likeness properties. The framework operates on protein sequence embeddings, allowing it to generalize to unseen target proteins without model retraining.

Multi-Constraint Drug Design for Novel Viral Targets

Traditional drug discovery costs 2-3 billion USD and takes over a decade with less than 10% success rate. Generating drug molecules requires satisfying multiple competing objectives simultaneously: target binding affinity, off-target selectivity, synthetic accessibility, drug-likeness, and low toxicity. Prior generative approaches using reinforcement learning or Bayesian optimization are computationally expensive and typically require fine-tuning on target-specific ligand libraries, making them unable to generalize to unseen protein targets.

The emergence of SARS-CoV-2 in 2020 created an urgent need for antiviral drug candidates targeting novel viral proteins. Because no binding affinity data existed for these new targets, and the viral proteins were not closely related to proteins in existing databases like BindingDB, existing target-specific generative frameworks could not be directly applied. CogMol addresses this by using pre-trained protein sequence embeddings from UniRep (trained on 24 million UniRef50 sequences) rather than learning protein representations from the limited BindingDB training set.

Controlled Latent Space Sampling with Pre-trained Protein Embeddings

CogMol’s core innovation is a three-component architecture that enables multi-constraint molecule generation for unseen targets:

1. SMILES VAE with adaptive pre-training. A Variational Autoencoder is first trained unsupervised on the MOSES/ZINC dataset (1.6M molecules), then jointly fine-tuned with QED and SA property predictors on BindingDB molecules. The standard VAE objective is:

$$\mathcal{L}_{\text{VAE}}(\theta, \phi) = \mathbb{E}_{p(x)} \left\{ \mathbb{E}_{q_\phi(z|x)} [\log p_\theta(x|z)] - D_{\text{KL}}(q_\phi(z|x) | p(z)) \right\}$$

where $q_\phi(z|x) = \mathcal{N}(z; \mu(x), \Sigma(x))$ specifies a diagonal Gaussian encoder distribution.

2. Protein-molecule binding affinity predictor. A regression model takes pre-trained UniRep protein sequence embeddings and molecule latent embeddings $z$ as input and predicts pIC50 binding affinity ($= -\log(\text{IC50})$). Because UniRep embeddings capture sequence, structural, and functional relationships from a large unsupervised corpus, the predictor can estimate binding affinity for novel target sequences not present in the training data.

3. CLaSS controlled sampling. The Conditional Latent attribute Space Sampling scheme generates molecules satisfying multiple constraints (affinity, QED, selectivity) through rejection sampling in the VAE latent space:

$$p(\mathbf{x} | \mathbf{a}) = \mathbb{E}_{\mathbf{z}} [p(\mathbf{z} | \mathbf{a}) , p(\mathbf{x} | \mathbf{z})] \approx \mathbb{E}_{\mathbf{z}} [\hat{p}_\xi(\mathbf{z} | \mathbf{a}) , p_\theta(\mathbf{x} | \mathbf{z})]$$

where $\mathbf{a} = [a_1, a_2, \ldots, a_n]$ is a set of independent attribute constraints. The conditional density $\hat{p}_\xi(\mathbf{z} | \mathbf{a})$ is approximated using a Gaussian mixture model $Q_\xi(\mathbf{z})$ and per-attribute classifiers $q_\xi(a_i | \mathbf{z})$, with Bayes’ rule and conditional independence assumptions. The acceptance probability equals the product of all attribute predictor scores, enabling efficient multi-constraint sampling without surrogate model or policy learning.

Selectivity modeling. Off-target selectivity for a molecule $m$ against target $T$ is defined as:

$$\text{Sel}_{T,m} = \text{BA}(T, m) - \frac{1}{k} \sum_{i=1}^{k} \text{BA}(T_i, m)$$

where $\text{BA}(T, m)$ is binding affinity to the target and $T_i$ are $k$ randomly selected off-targets. This selectivity score is incorporated as a control attribute during CLaSS sampling.

Experimental Setup: COVID-19 Targets and In Silico Screening

Target proteins. CogMol was applied to three SARS-CoV-2 targets not present in BindingDB: NSP9 Replicase dimer, Main Protease (Mpro), and the Receptor-Binding Domain (RBD) of the spike protein. A cancer target (human HDAC1) with low ligand coverage in the training data was also evaluated.

Training data. The SMILES VAE was trained on the MOSES benchmark (1.6M molecules from ZINC). The binding affinity predictor used curated IC50 data from BindingDB as reported in DeepAffinity, with all protein classes included in training.

CLaSS controlled generation. Molecules were generated with simultaneous constraints on binding affinity (> 0.5 normalized), QED (> 0.8 normalized), and selectivity (> 0.5 normalized). Approximately 1000 molecules per target were selected for downstream evaluation.

In silico screening pipeline. Generated molecules underwent:

Toxicity prediction via a multi-task deep neural network (MT-DNN) on 12 Tox21 in vitro endpoints and ClinTox clinical trial failure
Binding affinity rescoring with a higher-accuracy SMILES-level predictor
Blind docking (5 independent runs per molecule) using AutoDock Vina against target protein structures
Synthetic feasibility assessment using a retrosynthetic algorithm based on the Molecular Transformer trained on patent reaction data

Baselines. VAE performance was benchmarked against models from the MOSES platform. CLaSS-accepted molecules were compared against randomly sampled molecules from the latent space. Generated molecules were compared against FDA-approved drugs for toxicity and synthesizability.

Key Results

CLaSS enrichment (Table 1). CLaSS consistently produced higher fractions of molecules meeting all criteria compared to random sampling. For the triple constraint (affinity > 0.5, QED > 0.8, selectivity > 0.5), the enrichment was substantial: 6.9% vs. 0.7% for NSP9, 9.0% vs. 0.9% for RBD, and 10.4% vs. 1.1% for Mpro.

Target	CLaSS (Aff+QED+Sel)	Random (Aff+QED+Sel)	Enrichment
NSP9	6.9%	0.7%	~10x
RBD	9.0%	0.9%	~10x
Mpro	10.4%	1.1%	~9.5x

Docking results (Table 3). 87-95% of high-affinity generated molecules showed docking binding free energy (BFE) below -6 kcal/mol, with minimum BFEs reaching -8.6 to -9.5 kcal/mol depending on the target.

Novelty. The likelihood of generating an exact duplicate of a training molecule was 2% or less. Against the full PubChem database (~103M molecules), exact matches ranged from 3.7% to 9.5%. Generated molecules also showed novel chemical scaffolds as confirmed by high Frechet ChemNet Distance.

Synthesizability. Generated molecules for COVID-19 targets showed 85-90% synthetic feasibility using retrosynthetic analysis, exceeding the ~78% rate of FDA-approved drugs.

Toxicity. Approximately 70% of generated parent molecules and ~80% of predicted metabolites were toxic in 0-1 endpoints out of 13, comparable to FDA-approved drugs.

Generated Molecules Show Favorable Binding and Drug-Like Properties

CogMol demonstrates that controlled latent space sampling with pre-trained protein embeddings can generate novel, drug-like molecules for unseen viral targets. The key findings are:

CLaSS provides roughly 10x enrichment over random latent space sampling for molecules satisfying all three constraints (affinity, QED, selectivity).
Generated molecules bind favorably to druggable pockets in target protein 3D structures, even though the generation model uses only 1D sequence information.
Some generated SMILES matched existing PubChem molecules with known biological activity, suggesting the model identifies chemically relevant regions of molecular space.
The framework generalizes across targets of varying novelty, with Mpro (more similar to training proteins) yielding easier generation than NSP9 or RBD.

Limitations. The authors note that no wet-lab validation was performed on generated candidates. There may be divergence between ML-predicted properties and experimental measurements. The binding affinity predictor’s accuracy is bounded by the quality and coverage of BindingDB training data. Selectivity modeling uses a random sample of off-targets rather than a pharmacologically curated panel.

Future directions. The authors propose incorporating additional contexts beyond target protein (e.g., metabolic pathways), adding more pharmacologically relevant controls, and weighting objectives by relative importance.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
VAE pre-training	MOSES/ZINC	1.6M train, 176K test	Publicly available benchmark
VAE adaptive training	BindingDB (DeepAffinity split)	~27K protein-ligand pairs	Curated IC50 data
Protein embeddings	UniRef50 via UniRep	24M sequences	Pre-trained, publicly available
Toxicity prediction	Tox21 + ClinTox	12 in vitro + clinical endpoints	Public benchmark datasets
Docking validation	PDB structures	3 SARS-CoV-2 targets	Public crystal structures

Algorithms

VAE architecture: SMILES encoder-decoder with diagonal Gaussian latent space, jointly trained with QED and SA regressors
CLaSS: rejection sampling from Gaussian mixture model of latent space with per-attribute classifiers
Binding affinity: regression on concatenated UniRep protein embeddings and VAE molecule embeddings
Selectivity: excess binding affinity over average of $k$ random off-targets

Models

SMILES VAE with adaptive pre-training (ZINC then BindingDB)
Multi-task toxicity classifier (MT-DNN) for Tox21 and ClinTox endpoints
Binding affinity predictor (latent-level for generation, SMILES-level for screening)
Retrosynthetic predictor based on Molecular Transformer

Evaluation

Metric	Value	Baseline	Notes
Validity	90%	-	Generated SMILES
Uniqueness	99%	-	Among valid molecules
Filter pass	95%	-	Relevant chemical filters
Docking BFE < -6 kcal/mol	87-95%	-	Varies by target
Synthetic feasibility	85-90%	78% (FDA drugs)	COVID-19 targets
Low toxicity (0-1 endpoints)	~70% parent, ~80% metabolite	Comparable to FDA drugs	MT-DNN prediction

Hardware

The paper does not specify GPU types or training times. The work was funded internally by IBM Research.

Artifacts

Artifact	Type	License	Notes
CogMol (GitHub)	Code	Apache-2.0	Official implementation
~3500 generated molecules	Dataset	Open license	For three SARS-CoV-2 targets

Paper Information

Citation: Chenthamarakshan, V., Das, P., Hoffman, S. C., Strobelt, H., Padhi, I., Lim, K. W., Hoover, B., Manica, M., Born, J., Laino, T., & Mojsilovic, A. (2020). CogMol: Target-Specific and Selective Drug Design for COVID-19 Using Deep Generative Models. Advances in Neural Information Processing Systems, 33, 4320-4332.

@inproceedings{chenthamarakshan2020cogmol,
  title={CogMol: Target-Specific and Selective Drug Design for COVID-19 Using Deep Generative Models},
  author={Chenthamarakshan, Vijil and Das, Payel and Hoffman, Samuel C. and Strobelt, Hendrik and Padhi, Inkit and Lim, Kar Wai and Hoover, Benjamin and Manica, Matteo and Born, Jannis and Laino, Teodoro and Mojsilovi{\'c}, Aleksandra},
  booktitle={Advances in Neural Information Processing Systems},
  volume={33},
  pages={4320--4332},
  year={2020}
}

CDDD: Learning Descriptors by Translating SMILES

Thu, 26 Mar 2026 00:00:00 +0000

A Translation-Based Method for Learned Molecular Descriptors

This is a Method paper that introduces Continuous and Data-Driven Descriptors (CDDD), a neural machine translation approach for learning fixed-size, continuous molecular representations. Rather than training an autoencoder to reconstruct SMILES strings, Winter et al. train an encoder-decoder model to translate between semantically equivalent but syntactically different molecular representations (e.g., randomized SMILES to canonical SMILES, or InChI to canonical SMILES). The bottleneck latent vector serves as a general-purpose molecular descriptor. Pretrained on approximately 72 million compounds from ZINC15 and PubChem, CDDD produces 512-dimensional descriptors that achieve competitive QSAR performance and significantly outperform all tested molecular fingerprints in ligand-based virtual screening.

Why Translation Instead of Reconstruction?

Molecular descriptors are central to cheminformatics. Traditional approaches rely on human-engineered fingerprints like ECFPs, which encode structural features as fixed-length bit vectors. While effective, these representations are constrained by predefined feature extraction rules.

Recent work applied deep neural networks directly to molecular graphs or SMILES strings to learn task-specific representations. However, these end-to-end approaches must learn features from scratch for each new dataset, making them prone to overfitting on the small bioactivity datasets typical in drug discovery.

Unsupervised approaches based on autoencoders (notably Gomez-Bombarelli et al.’s VAE and Xu et al.’s seq2seq model) offered a path toward general-purpose learned descriptors. These models reconstruct SMILES strings through an information bottleneck, forcing the latent space to capture molecular information. The concern with reconstruction, however, is that the model may focus on syntactic patterns of the string representation rather than the underlying chemical semantics. A model that memorizes SMILES syntax shortcuts can achieve low reconstruction error without truly encoding chemical meaning.

Winter et al. address this by drawing on the analogy to neural machine translation: a translator must understand the meaning of a sentence to produce a correct translation in another language. By training the model to translate between different molecular representations (which share chemical semantics but differ in syntax), the latent space is forced to capture the chemical information common to both representations, rather than representation-specific syntactic artifacts.

Translation as Semantic Compression

The core insight is that translating between two syntactically different but semantically equivalent representations forces the encoder to capture only the chemical meaning shared by both. The model architecture follows the standard encoder-decoder framework from neural machine translation.

The encoder reads a source molecular string (e.g., a randomized SMILES or InChI) and compresses it into a fixed-size latent vector. The decoder takes this latent vector and generates the target molecular string (canonical SMILES). The model is trained to minimize character-level cross-entropy between the decoder output and the target sequence.

Four translation tasks were evaluated:

Randomized SMILES to canonical SMILES (best performing)
InChI to canonical SMILES
Canonical SMILES to canonical SMILES (autoencoding baseline)
Canonical SMILES to InChI (failed to learn)

The final model uses an RNN encoder with 3 stacked GRU layers (512, 1024, and 2048 units). The concatenated cell states pass through a fully connected layer with tanh activation to produce a 512-dimensional latent vector. The decoder mirrors this architecture, initializing its GRU states from the latent vector via separate fully connected layers. Teacher forcing is used during training, and left-to-right beam search is used at inference.

An auxiliary property prediction network takes the latent vector as input and predicts nine molecular properties (logP, partial charges, valence electrons, H-bond donors/acceptors, Balaban’s J, molar refractivity, TPSA). This multi-task signal encourages the latent space to encode physically meaningful information. The full training objective combines the translation cross-entropy loss with the property prediction mean squared error:

$$\mathcal{L} = \mathcal{L}_{\text{translation}} + \mathcal{L}_{\text{properties}}$$

To ensure invariance to input SMILES representation at inference time, the model uses randomized SMILES as input half the time and canonical SMILES the other half during training. Input dropout (15% at the character level) and Gaussian noise (standard deviation 0.05) are applied for regularization.

QSAR Benchmarks, Virtual Screening, and Latent Space Exploration

Pretraining

The model was pretrained on approximately 72 million compounds from ZINC15 and PubChem (merged, deduplicated, filtered for organic molecules with MW 12-600, >3 heavy atoms, logP between -7 and 5). All evaluation compounds were removed from the pretraining set.

QSAR Experiments

Ten QSAR datasets were used, spanning classification (Ames mutagenicity, hERG inhibition, BBB penetration, BACE inhibition, bee toxicity) and regression (EGFR inhibition, Plasmodium falciparum inhibition, lipophilicity, aqueous solubility, melting point). Two datasets (Ames and lipophilicity) served as validation for architecture selection; the remaining eight were held out for final evaluation.

CDDD descriptors with an SVM were benchmarked against:

Nine circular fingerprint variants (Morgan fingerprints, radius 1-3, folded to 512/1024/2048 bits) with RF, SVM, and GB
Graph convolution models (DeepChem)

Both random-split and cluster-split (K-means on MACCS fingerprints, K=5) cross-validation were performed.

Task	Split	CDDD + SVM	Best Fingerprint	Graph Conv
Ames (ROC-AUC)	Random	0.89	0.89 (ecfc2, RF)	0.88
hERG (ROC-AUC)	Random	0.86	0.85 (ecfc4, RF)	0.86
BBBP (ROC-AUC)	Random	0.93	0.93 (ecfc2, RF)	0.92
BACE (ROC-AUC)	Random	0.90	0.91 (ecfc2, RF)	0.91
Bee toxicity (ROC-AUC)	Random	0.92	0.91 (ecfc6, RF)	0.89
Lipophilicity ($r^2$)	Random	0.72	0.69 (ecfc2, SVM)	0.73
ESOL ($r^2$)	Random	0.92	0.58 (ecfc6, SVM)	0.86
Melting point ($r^2$)	Random	0.42	0.38 (ecfc2, SVM)	0.39

CDDD descriptors showed competitive or better performance across all tasks. Notably, CDDD achieved substantially higher $r^2$ on aqueous solubility (0.92 vs. 0.58 for the best fingerprint). The authors emphasize that CDDD’s feature extraction was fixed based on two validation tasks, while baseline methods selected the best fingerprint/model combination per task, making the comparison conservative for CDDD.

Virtual Screening

Ligand-based virtual screening experiments followed the Riniker et al. benchmarking protocol on 40 DUD targets and 17 MUV targets. Five active compounds were randomly selected per target, and remaining compounds were ranked by similarity (cosine similarity for CDDD, Tanimoto for fingerprints). This process was repeated 50 times per target.

Database	CDDD (ROC-AUC)	Second Best	p-value (Wilcoxon)
DUD	0.949	0.899 (laval)	$5 \times 10^{-38}$
MUV	0.679	0.677 (ap)	0.04

CDDD significantly outperformed all 14 baseline fingerprints on both databases. The DUD improvement was particularly large (+5.0 ROC-AUC points over the next best). On MUV, which is designed to be harder, the advantage was smaller but still statistically significant. Importantly, while the best baseline fingerprint varied between DUD and MUV (laval vs. ap), CDDD ranked first on both, demonstrating consistent performance.

Latent Space Exploration

The continuous, reversible nature of CDDD enables chemical space navigation. Shifting a molecule’s embedding along the first principal component of the pretraining data correlates with molecular size (Spearman $r = 0.947$, $p = 0.00048$), while the second principal component correlates with polarity/logP ($r = -0.916$, $p = 0.00015$).

When shifting 1000 compounds along 100 random directions, the model maintained high valid SMILES generation rates (>97% for the top beam search output, >99% when considering the top 3 outputs). Euclidean distance in the descriptor space correlated smoothly with Tanimoto distance in fingerprint space, confirming that the latent space supports meaningful interpolation.

Consistent Learned Descriptors for Chemistry

CDDD demonstrated that translation between molecular representations produces more informative latent spaces than autoencoder reconstruction. The key findings are:

Translation outperforms reconstruction: Models trained on translating between different representations consistently produced better downstream descriptors than autoencoding models, despite autoencoding being an easier task.
Auxiliary property prediction helps: The additional classification task for molecular properties improved descriptor quality, particularly for physicochemical endpoints correlated with the predicted properties.
Consistent performance: Unlike baseline methods where the best fingerprint varies by task, CDDD showed consistent performance across all QSAR and VS experiments.
Smooth latent space: The continuous descriptor space supports meaningful interpolation and chemical space exploration with high valid SMILES rates.

The authors acknowledge several limitations. The InChI-to-SMILES translation worked but produced inferior descriptors compared to SMILES-to-SMILES, and SMILES-to-InChI translation failed entirely, likely due to InChI’s complex syntax (counting, arithmetic). The approach was only tested with string-based representations; translation between conceptually different representations (e.g., 3D structures) remains future work. The QSAR evaluation, while extensive, used relatively standard datasets, and the method’s advantage over graph convolution models was modest on tasks where end-to-end learning had sufficient data.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pretraining	ZINC15 + PubChem (merged)	~72M compounds	Filtered: organic, MW 12-600, >3 heavy atoms, logP -7 to 5
Validation	Ames mutagenicity	6,130	Classification
Validation	Lipophilicity	3,817	Regression
Test	hERG, BBBP, BACE, bee toxicity	188-3,440	Classification
Test	EGFR, Plasmodium, ESOL, melting point	184-4,451	Regression
VS	DUD	40 targets	Ligand-based virtual screening
VS	MUV	17 targets	Maximum unbiased validation

Algorithms

Encoder: 3 stacked GRU layers (512, 1024, 2048 units) with tanh bottleneck to 512-dim latent space
Decoder: Matching 3 stacked GRU layers, initialized from latent space
Auxiliary classifier: 3 FC layers (512, 128, 9) predicting molecular properties
Optimizer: Adam, initial LR $5 \times 10^{-4}$, decayed by 0.9 every 50,000 steps
Batch size: 64 with bucketing by sequence length
Input regularization: 15% character dropout + Gaussian noise (std 0.05)
Beam search for decoding at inference

Models

Artifact	Type	License	Notes
CDDD (GitHub)	Code + Model	MIT	Pretrained model and extraction code

Evaluation

QSAR: 5-fold random CV and 5-fold cluster CV (K-means on MACCS, K=5)
Classification metric: ROC-AUC
Regression metric: $r^2$
VS: ROC-AUC averaged over 50 random active set selections per target
Statistical test: Wilcoxon signed-rank test for VS comparisons

Hardware

Framework: TensorFlow 1.4.1
Fingerprint extraction on GPU is comparable in speed to RDKit on CPU
SVM training on 512-dim CDDD descriptors takes seconds (vs. minutes for 2048-dim fingerprints)
Graph convolution training: ~30 minutes per task on GPU

Paper Information

Citation: Winter, R., Montanari, F., Noe, F., & Clevert, D.-A. (2019). Learning continuous and data-driven molecular descriptors by translating equivalent chemical representations. Chemical Science, 10(6), 1692-1701. https://doi.org/10.1039/C8SC04175J

@article{winter2019learning,
  title={Learning continuous and data-driven molecular descriptors by translating equivalent chemical representations},
  author={Winter, Robin and Montanari, Floriane and No{\'e}, Frank and Clevert, Djork-Arn{\'e}},
  journal={Chemical Science},
  volume={10},
  number={6},
  pages={1692--1701},
  year={2019},
  publisher={Royal Society of Chemistry},
  doi={10.1039/C8SC04175J}
}

BindGPT: GPT for 3D Molecular Design and Docking

Thu, 26 Mar 2026 00:00:00 +0000

A Language Model for Joint 3D Molecular Graph and Conformation Generation

BindGPT is a Method paper that introduces a GPT-based language model for generating 3D molecular structures. The primary contribution is a unified framework that jointly produces molecular graphs (via SMILES) and 3D coordinates (via XYZ tokens) within a single autoregressive model. This eliminates the need for external graph reconstruction tools like OpenBabel, which are error-prone when applied to noisy atom positions. The same pre-trained model serves as a 3D molecular generative model, a conformer generator conditioned on molecular graphs, and a pocket-conditioned 3D molecule generator.

The Graph Reconstruction Problem in 3D Molecular Generation

Most existing 3D molecular generators focus on predicting atom types and positions, relying on supplementary software (e.g., OpenBabel or RDKit) to reconstruct molecular bonds from predicted coordinates. This introduces a fragile dependency: small positional errors can drastically change the reconstructed molecular graph or produce disconnected structures. Additionally, while diffusion models and equivariant GNNs have shown strong results for 3D molecular generation, they often depend on SE(3) equivariance inductive biases and are computationally expensive at sampling time (up to $10^6$ seconds for 1000 valid molecules for EDM). The pocket-conditioned generation task is further limited by the small size of available 3D binding pose datasets (e.g., CrossDocked), making it difficult for specialized models to generalize without large-scale pre-training.

SMILES+XYZ Tokenization: Jointly Encoding Graphs and Coordinates

The core innovation in BindGPT is coupling SMILES notation with XYZ coordinate format in a single token sequence. The sequence starts with a token, followed by character-level SMILES tokens encoding the molecular graph, then an token marking the transition to coordinate data. Each 3D atom position is encoded using 6 tokens (integer and fractional parts for each of the three coordinates). The atom ordering is synchronized between SMILES and XYZ, so atom symbols from SMILES are not repeated in the coordinate section.

For protein pockets, sequences begin with a token followed by atom names and coordinates. Following AlphaFold’s approach, only alpha-carbon coordinates are retained to keep pocket representations compact.

The model uses the GPT-NeoX architecture with rotary position embeddings (RoPE), which enables length generalization between pre-training and fine-tuning where sequence lengths differ substantially. The pre-trained model has 108M parameters with 15 layers, 12 attention heads, and a hidden dimension of 768.

Pre-training on Large-Scale 3D Data

Pre-training uses the Uni-Mol dataset containing 208M conformations for 12M molecules and 3.2M protein pocket structures. Each training batch contains either ligand sequences or pocket sequences (not mixed within a sequence). Since pockets are far fewer than ligands, the training schedule runs 5 pocket epochs per ligand epoch, resulting in roughly 8% pocket tokens overall. Training uses large batches of 1.6M tokens per step with Flash Attention and DeepSpeed optimizations.

Supervised Fine-Tuning with Augmentation

For pocket-conditioned generation, BindGPT is fine-tuned on CrossDocked 2020, which contains aligned pocket-ligand pairs. Unlike prior work that subsamples less than 1% of the best pairs, BindGPT uses all intermediate ligand poses (including lower-quality ones), yielding approximately 27M pocket-ligand pairs. To combat overfitting on the limited diversity (14k unique molecules, 3k pockets), two augmentation strategies are applied:

SMILES randomization: Each molecule can yield 100-1000 different valid SMILES strings
Random 3D rotation: The same rotation matrix is applied to both pocket and ligand coordinates

During fine-tuning, the pocket token sequence is concatenated before the ligand token sequence. An optional variant conditions on binding energy scores from the CrossDocked dataset, enabling contrastive learning between good and bad binding examples.

Reinforcement Learning with Docking Feedback

BindGPT applies REINFORCE (not PPO or REINVENT, which were found less stable) to further optimize pocket-conditioned generation. On each RL step, the model generates 3D ligand structures for a batch of random protein pockets, computes binding energy rewards using QVINA docking software, and updates model parameters. A KL-penalty between the current model and the SFT initialization stabilizes training.

The RL objective can be written as:

$$\mathcal{L}_{\text{RL}} = -\mathbb{E}_{x \sim \pi_\theta}\left[ R(x) \right] + \beta \cdot D_{\text{KL}}(\pi_\theta | \pi_{\text{SFT}})$$

where $R(x)$ is the docking reward from QVINA and $\beta$ controls the strength of the KL regularization.

Experimental Evaluation Across Three 3D Generation Tasks

Datasets

Purpose	Dataset	Size	Notes
Pre-training	Uni-Mol 3D	208M conformations (12M molecules) + 3.2M pockets	Large-scale 3D molecular dataset
Fine-tuning (SFT)	CrossDocked 2020	~27M pocket-ligand pairs	14k molecules x 3k pockets, includes all pose qualities
Fine-tuning (conformer)	GEOM-DRUGS	27M conformations for 300k molecules	Standard benchmark for 3D conformer generation
Evaluation (conformer)	Platinum	Experimentally validated conformations	Zero-shot evaluation holdout
Evaluation (pocket)	CrossDocked holdout	100 pockets	Held out from training

Task 1: 3D Molecule Generation (Pre-training)

Compared against XYZ-Transformer (the only other model capable of large-scale pre-training), BindGPT achieves 98.58% validity (vs. 12.87% for XYZ-TF without hydrogens), higher SA (0.77 vs. 0.21), QED (0.59 vs. 0.30), and Lipinski scores (4.86 vs. 4.79). BindGPT also produces conformations with RMSD of 0.89 (XYZ-TF’s RMSD calculation failed to converge). Generation is 12x faster (13s vs. 165s for 1000 molecules).

Task 2: 3D Molecule Generation (Fine-tuned on GEOM-DRUGS)

Against EDM and MolDiff (diffusion baselines), BindGPT outperforms on nearly all 3D distributional metrics:

Metric	EDM	MolDiff	BindGPT
JS bond lengths	0.246	0.365	0.029
JS bond angles	0.282	0.155	0.075
JS dihedral angles	0.328	0.162	0.098
JS freq. bond types	0.378	0.163	0.045
JS freq. bond pairs	0.396	0.136	0.043
JS freq. bond triplets	0.449	0.125	0.042
Time (1000 molecules)	1.4e6 s	7500 s	200 s

BindGPT is two orders of magnitude faster than diffusion baselines while producing more accurate 3D geometries. MolDiff achieves better drug-likeness scores (QED, SA), but the authors argue 3D distributional metrics are more relevant for evaluating 3D structure fidelity.

Task 3: Pocket-Conditioned Molecule Generation

Method	Vina Score	SA	QED	Lipinski
Pocket2Mol	-7.15 +/- 4.89	0.75	0.57	4.88
TargetDiff	-7.80 +/- 3.61	0.58	0.48	4.51
BindGPT-FT	-5.44 +/- 2.09	0.78	0.50	4.72
BindGPT-RFT	-7.24 +/- 1.68	0.74	0.48	4.32
BindGPT-RL	-8.60 +/- 1.90	0.84	0.43	4.81

The RL-fine-tuned model achieves the best Vina binding scores (-8.60 vs. -7.80 for TargetDiff) with lower variance and the highest SA score (0.84). The SFT-only model (BindGPT-FT) underperforms baselines on binding score, demonstrating that RL is essential for strong pocket-conditioned generation. QED is lower for BindGPT-RL, but the authors note that QED could be included in the RL reward and was excluded for fair comparison.

Conformer Generation

On the Platinum dataset (zero-shot), BindGPT matches the performance of Torsional Diffusion (the specialized state-of-the-art) when assisted by RDKit, with a small gap without RDKit assistance. Uni-Mol fails to generalize to this dataset despite pre-training on the same Uni-Mol data.

Key Findings, Limitations, and Future Directions

BindGPT demonstrates that a simple autoregressive language model without equivariance inductive biases can match or surpass specialized diffusion models and GNNs across multiple 3D molecular generation tasks. The key findings include:

Joint SMILES+XYZ generation eliminates graph reconstruction errors, achieving 98.58% validity compared to 12.87% for XYZ-Transformer
Large-scale pre-training is critical for pocket-conditioned generation, as none of the baselines use pre-training and instead rely on heavy inductive biases
RL fine-tuning with docking feedback substantially improves binding affinity beyond what SFT alone achieves
Sampling is two orders of magnitude faster than diffusion baselines (200s vs. 1.4M s for EDM)

Limitations include the relatively modest model size (108M parameters), with the authors finding this sufficient for current tasks but not exploring larger scales. The RL optimization uses only Vina score as reward; multi-objective optimization incorporating SA, QED, and other properties is left as future work. The model also relies on character-level SMILES tokenization rather than more sophisticated chemical tokenizers. BindGPT is the first model to explicitly generate hydrogens at scale, though validity drops from 98.58% to 77.33% when hydrogens are included.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pre-training	Uni-Mol 3D	208M conformations, 12M molecules, 3.2M pockets	From Zhou et al. (2023)
SFT (pocket)	CrossDocked 2020	~27M pocket-ligand pairs	Full version including low-quality poses
SFT (conformer)	GEOM-DRUGS	27M conformations, 300k molecules	Standard benchmark
Evaluation	Platinum	Experimentally validated conformations	Zero-shot holdout

Algorithms

Architecture: GPT-NeoX with rotary position embeddings (RoPE)
Pre-training: Causal language modeling with 1.6M tokens per batch
SFT augmentation: SMILES randomization + random 3D rotation
RL: REINFORCE with KL-penalty from SFT initialization; QVINA docking as reward

Models

Size: 108M parameters, 15 layers, 12 heads, hidden size 768
Vocabulary: Character-level SMILES tokens + special tokens (, , ) + coordinate tokens (6 per 3D position)

Evaluation

Validity, SA, QED, Lipinski: Standard drug-likeness metrics
Jensen-Shannon divergences: Distribution-level 3D structural metrics (bond lengths, angles, dihedrals, bond types)
RMSD: Alignment quality of generated conformations vs. RDKit reference
RMSD-Coverage: CDF of RMSD between generated and reference conformers
Vina score: Binding energy from QVINA docking software

Hardware

Pre-training and fine-tuning use Flash Attention and DeepSpeed for efficiency
Specific GPU counts and training times are described in Appendix G (not available in the main text)

Artifacts

Artifact	Type	License	Notes
Project Page	Other	Not specified	Project website with additional details

No public code repository or pre-trained model weights were identified. The project website exists but no source code has been released as of this writing.

Reproducibility Status: Partially Reproducible. The paper provides detailed architecture specs and hyperparameters, but no public code or model weights are available. All training datasets (Uni-Mol, CrossDocked, GEOM-DRUGS) are publicly accessible.

Paper Information

Citation: Zholus, A., Kuznetsov, M., Schutski, R., Shayakhmetov, R., Polykovskiy, D., Chandar, S., & Zhavoronkov, A. (2025). BindGPT: A Scalable Framework for 3D Molecular Design via Language Modeling and Reinforcement Learning. Proceedings of the AAAI Conference on Artificial Intelligence, 39(24), 26083-26091. https://doi.org/10.1609/aaai.v39i24.34804

@inproceedings{zholus2025bindgpt,
  title={BindGPT: A Scalable Framework for 3D Molecular Design via Language Modeling and Reinforcement Learning},
  author={Zholus, Artem and Kuznetsov, Maksim and Schutski, Roman and Shayakhmetov, Rim and Polykovskiy, Daniil and Chandar, Sarath and Zhavoronkov, Alex},
  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
  volume={39},
  number={24},
  pages={26083--26091},
  year={2025},
  doi={10.1609/aaai.v39i24.34804}
}

Avoiding Failure Modes in Goal-Directed Generation

Thu, 26 Mar 2026 00:00:00 +0000

Reinterpreting Goal-Directed Generation Failures as QSAR Model Issues

This is an Empirical study that challenges a widely cited finding about failure modes in goal-directed molecular generation. Renz et al. (2019) had shown that when molecules are optimized against a machine learning scoring function, control models trained on the same data distribution assign much lower scores to the generated molecules. This was interpreted as evidence that generation algorithms exploit model-specific biases. Langevin et al. demonstrate that this divergence is already present in the original data distribution and is attributable to disagreement among the QSAR classifiers, not to flaws in the generation algorithms themselves.

Why QSAR Model Agreement Matters for Molecular Generation

Goal-directed generation uses a scoring function (typically a QSAR model) to guide the design of molecules that maximize predicted activity. In the experimental framework from Renz et al., three Random Forest classifiers are trained: an optimization model $C_{opt}$ on Split 1, a model control $C_{mc}$ on Split 1 with a different random seed, and a data control $C_{dc}$ on Split 2. Each returns a confidence score ($S_{opt}$, $S_{mc}$, $S_{dc}$). The expectation is that molecules with high $S_{opt}$ should also score highly under $S_{mc}$ and $S_{dc}$, since all three models are trained on the same data distribution for the same target.

Renz et al. observed that during optimization, $S_{mc}$ and $S_{dc}$ diverge from $S_{opt}$, reaching substantially lower values. This was interpreted as goal-directed generation exploiting biases unique to the optimization model. The recommendation was to halt generation when control scores stop increasing, requiring a held-out dataset for a control model, which may not be feasible in low-data regimes.

The key insight of Langevin et al. is that nobody had checked whether this score disagreement existed before generation even began. If the classifiers already disagree on high-scoring molecules in the original dataset, the divergence during generation is expected behavior, not evidence of algorithmic failure.

Pre-Existing Classifier Disagreement Explains the Divergence

The core contribution is showing that the gap between optimization and control scores is a property of the QSAR models, not of the generation algorithms.

The authors introduce a held-out test set (10% of the data, used for neither training split) and augment it via Topliss tree enumeration to produce structural analogs for smoother statistical estimates. On this held-out set, they compute the Mean Average Difference (MAD) between $S_{opt}$ and control scores as a function of $S_{opt}$:

$$ \text{MAD}(x) = \frac{1}{|\{i : S_{opt}(x_i) \geq x\}|} \sum_{S_{opt}(x_i) \geq x} |S_{opt}(x_i) - S_{dc}(x_i)| $$

On the three original datasets (DRD2, EGFR, JAK2), the MAD between $S_{opt}$ and $S_{dc}$ grows substantially with $S_{opt}$, reaching approximately 0.3 for the highest-scoring molecules. For EGFR, even the top molecules (with $S_{opt}$ between 0.5 and 0.6) have $S_{dc}$ below 0.2. This disagreement exists entirely within the original data distribution, before any generative algorithm is applied.

The authors formalize this with tolerance intervals. At each generation time step $t$, the distribution of optimization scores is $P_t[S_{opt}(x)]$. From the held-out set, the conditional distributions $P[S_{dc}(x) | S_{opt}(x)]$ and $P[S_{mc}(x) | S_{opt}(x)]$ are estimated empirically. The expected control scores at time $t$ are then:

$$ \mathbb{E}[S_{dc}] = \int P[S_{dc}(x) | S_{opt}(x)] \cdot P_t[S_{opt}(x)] , dS_{opt} $$

By sampling from these distributions, the authors construct 95% tolerance intervals for the expected control scores at each time step. The observed trajectories of $S_{mc}$ and $S_{dc}$ during generation fall within these intervals, demonstrating that the divergence is fully explained by pre-existing classifier disagreement.

Experimental Setup: Original Reproduction and Corrected Experiments

Reproduction of Renz et al.

The original experimental framework uses three datasets from ChEMBL: DRD2 (842 molecules, 59 actives), EGFR (842 molecules, 40 actives), and JAK2 (667 molecules, 140 actives). These are small, noisy, and chemically diverse. Three goal-directed generation algorithms are tested:

Algorithm	Type	Mechanism
Graph GA	Genetic algorithm on molecular graphs	Mutation and crossover of molecular graphs
SMILES-LSTM	Recurrent neural network	Hill-climbing fine-tuning on best molecules
MSO	Particle swarm in CDDD latent space	Multiple swarm optimization

All algorithms are run for 151 epochs with 10 runs each. The reproduction confirms the findings of Renz et al.: $S_{mc}$ and $S_{dc}$ diverge from $S_{opt}$ during optimization.

Tolerance interval analysis

The held-out set is augmented using Topliss tree enumeration on phenyl rings, providing structural analogs that are reasonable from a medicinal chemistry perspective. The optimization score range is divided into 25 equal bins, and for each molecule at each time step, 10 samples from the conditional control score distribution are drawn to construct empirical tolerance intervals.

Corrected experiments with adequate models

To test whether generation algorithms actually exploit biases when the classifiers agree, the authors construct two tasks where optimization and control models correlate well:

ALDH1 dataset: 464 molecules from LIT-PCBA, split using similarity-based pairing to maximize intra-pair chemical similarity. This ensures both splits sample similar chemistry.
Modified JAK2: The same JAK2 dataset but with Random Forest hyperparameters adjusted (200 trees instead of 100, minimum 3 samples per leaf instead of 1) to reduce overfitting to spurious correlations.

In both cases, $S_{opt}$, $S_{mc}$, and $S_{dc}$ agree well on the held-out test set. The starting population for generation is set to the held-out test set (rather than random ChEMBL molecules) to avoid building in a distribution shift.

Findings: No Algorithmic Failure When Models Agree

On the corrected experimental setups (ALDH1 and modified JAK2), there is no major divergence between optimization and control scores during generation. The three algorithms produce molecules that score similarly under all three classifiers.

Key findings:

Pre-existing disagreement explains divergence: On all three original datasets, the divergence between $S_{opt}$ and control scores during generation falls within the tolerance intervals predicted from the initial data distribution alone. The generation algorithms are not exploiting model-specific biases beyond what already exists in the data.
Split similarity bias is also pre-existing: Renz et al. observed that generated molecules are more similar to Split 1 (used to train $C_{opt}$) than Split 2. The authors show this bias is already present in the top-5 percentile of the held-out set: on EGFR and DRD2, high-scoring molecules are inherently more similar to Split 1.
Appropriate model design resolves the issue: When Random Forest hyperparameters are chosen to avoid overfitting (more trees, higher minimum samples per leaf), or when data splits are constructed to be chemically balanced, the classifiers agree and the generation algorithms behave as expected.
Quality problems remain independent: Even when optimization and control scores align, the generated molecules can still be poor drug candidates (unreactive, unsynthesizable, containing unusual fragments). The score divergence issue and the chemical quality issue are separate problems.

Limitations acknowledged by the authors

The study focuses on Random Forest classifiers with ECFP fingerprints. The behavior of other model types (e.g., graph neural networks) and descriptor types is not fully explored, though supplementary results show similar patterns with physico-chemical descriptors and Atom-Pair fingerprints.
The corrected ALDH1 task uses a relatively small dataset (464 molecules) with careful split construction. Scaling this approach to larger, more heterogeneous datasets is not demonstrated.
The authors note that their results do not prove generation algorithms never exploit biases; they show that the specific evidence from Renz et al. can be explained without invoking algorithmic failure.
The problem of low-quality generated molecules (poor synthesizability, unusual fragments) remains unresolved and is acknowledged as an open question.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Original tasks	DRD2, EGFR, JAK2	842, 842, 667 molecules	Extracted from ChEMBL; small with few actives
New task	ALDH1	464 molecules (173 with purine substructure)	Extracted from LIT-PCBA; similarity-based split
Augmentation	Topliss tree analogs	~10x augmentation of held-out set	Structural analogs via phenyl ring enumeration

Algorithms

Three goal-directed generation algorithms from the original Renz et al. study:

Graph GA: Genetic algorithm on molecular graphs (Jensen, 2019)
SMILES-LSTM: Hill-climbing on LSTM-generated SMILES (Segler et al., 2018)
MSO: Multi-Swarm Optimization in CDDD latent space (Winter et al., 2019)

All run for 151 epochs, 10 runs each.

Models

Random Forest classifiers (scikit-learn) with:

ECFP fingerprints (radius 2, 1024 bits, RDKit)
Default parameters for original tasks
Modified parameters for JAK2 correction: 200 trees, min 3 samples per leaf

Evaluation

Metric	Purpose	Notes
Mean Average Difference (MAD)	Measures disagreement between optimization and control scores	Computed as function of $S_{opt}$ on held-out set
95% tolerance intervals	Expected range of control scores given optimization scores	Empirical, constructed from held-out set
Tanimoto similarity	Split bias assessment	Morgan fingerprints, radius 2, 1024 bits
ROC-AUC	Classifier predictive performance	Used to verify models have comparable accuracy

Hardware

Not specified in the paper.

Artifacts

Artifact	Type	License	Notes
Code and datasets	Code	Apache-2.0	Fork of Renz et al. codebase with modifications

Paper Information

Citation: Langevin, M., Vuilleumier, R., & Bianciotto, M. (2022). Explaining and avoiding failure modes in goal-directed generation of small molecules. Journal of Cheminformatics, 14, 20. https://doi.org/10.1186/s13321-022-00601-y

@article{langevin2022explaining,
  title={Explaining and avoiding failure modes in goal-directed generation of small molecules},
  author={Langevin, Maxime and Vuilleumier, Rodolphe and Bianciotto, Marc},
  journal={Journal of Cheminformatics},
  volume={14},
  number={1},
  pages={20},
  year={2022},
  publisher={Springer},
  doi={10.1186/s13321-022-00601-y}
}

Augmented Hill-Climb for RL-Based Molecule Design

Thu, 26 Mar 2026 00:00:00 +0000

A Hybrid RL Strategy for De Novo Molecule Generation

This is a Method paper that proposes Augmented Hill-Climb (AHC), a reinforcement learning strategy for conditioning SMILES-based language models during de novo molecule generation. The primary contribution is a simple hybrid between the REINVENT and Hill-Climb (HC) RL strategies that computes the REINVENT loss function only on the top-k highest-scoring molecules per batch (as in HC), thereby removing the counterproductive regularization effect of low-scoring molecules. The authors demonstrate that AHC improves optimization ability ~1.5-fold and sample efficiency ~45-fold compared to REINVENT across docking tasks against four GPCR targets, and that the approach generalizes to transformer architectures.

Sample Efficiency Bottleneck in RL-Guided Molecular Generation

Recurrent neural networks trained on SMILES have become a standard approach for de novo molecule generation, with RL strategies like REINVENT and Hill-Climb achieving top performance on benchmarks such as GuacaMol and MOSES. However, RL-guided generation can be highly sample-inefficient, often requiring $10^5$ or more molecules to optimize complex objectives. This is acceptable for cheap scoring functions (e.g., QSAR models, property calculators) but becomes a practical bottleneck when using computationally expensive scoring functions like molecular docking or computer-aided synthesis planning.

The REINVENT strategy regularizes the agent by computing a loss based on the difference between the agent’s policy and an “augmented likelihood” that combines the prior policy with a scaled reward. When low-scoring molecules are sampled ($R_T \approx 0$), the augmented likelihood reduces to the prior likelihood, causing the agent to trend back toward the prior policy. This negates useful learnings, especially early in training or when the objective is difficult. Meanwhile, Hill-Climb simply fine-tunes the RNN on the top-k molecules per batch, which is sample-efficient but lacks explicit regularization, leading to mode collapse and generation of invalid SMILES.

Previous work by Neil et al. compared RL strategies but did not clearly quantify sample-efficiency differences, and modifications to the REINVENT loss function by Fialkova et al. showed no significant improvement. The best agent reminder (BAR) mechanism offered modest gains but was originally tested on graph-based models.

Core Innovation: Filtering Low-Scoring Molecules from the REINVENT Loss

Augmented Hill-Climb combines the loss formulation of REINVENT with the top-k selection mechanism of Hill-Climb. The agent samples a batch of molecules, ranks them by reward, and computes the REINVENT loss only on the top-k molecules. This removes the counterproductive regularization caused by low-scoring molecules while retaining the prior-based regularization for high-scoring molecules.

The REINVENT loss defines an augmented likelihood:

$$ \log P_{\mathbb{U}}(A) = \log P_{prior}(A) + \sigma R_T $$

where $\sigma$ is a scaling coefficient controlling the reward contribution. The agent loss is the squared difference between the augmented likelihood and the agent’s log-likelihood:

$$ L(\theta) = \left[\log P_{\mathbb{U}}(A) - \log P_{agent}(A)\right]^2 $$

In standard REINVENT, this loss is computed over all molecules in the batch. When $R_T \approx 0$, the augmented likelihood collapses to the prior likelihood, pushing the agent back toward the prior. AHC avoids this by computing the loss only on the top-k molecules ranked by reward, exactly as Hill-Climb selects molecules for fine-tuning.

The key insight is that high-scoring molecules are still regularized by the prior component of the augmented likelihood ($\log P_{prior}(A)$), preventing catastrophic forgetting. Low-scoring molecules, which would otherwise pull the agent back toward the prior, are simply excluded from the loss computation.

Diversity Filters to Prevent Mode Collapse

AHC is more susceptible to mode collapse than REINVENT because it focuses learning on high-scoring molecules. The authors address this with diversity filters (DFs) that penalize the reward of molecules similar to previously generated ones. Through a hyperparameter search over 825 configurations on three GuacaMol tasks, they identify an optimal DF configuration (DF2) with:

Minimum score threshold of 0.5 (lower than DF1’s 0.8)
Linear penalization output mode (softer than binary)
Bin size of 50 (larger than DF1’s 25)
Scaffold similarity based on ECFP4 fingerprints

The authors find that stricter DFs (lower thresholds, smaller bins) better prevent mode collapse but reduce optimization performance, while more lenient DFs enable better learning of chemotype-reward associations. DF2 represents a compromise.

Experimental Setup: Docking Tasks and Benchmark Comparisons

The evaluation spans five experiments:

Experiment 1: AHC vs. REINVENT on DRD2 docking over 100 RL updates (6,400 samples), varying $\sigma$ from 30 to 240. RNN trained on the MOSESn dataset (MOSES with neutralized charges, 2.45M molecules).

Experiment 2: AHC + DF2 vs. REINVENT on four GPCR targets (DRD2, OPRM1, AGTR1, OX1R) over 500 RL updates. Docking performed with Glide-SP after ligand preparation with LigPrep.

Experiment 3: Diversity filter hyperparameter search (825 configurations) on three GuacaMol tasks (Aripiprazole similarity, C11H24 isomers, Osimertinib MPO) using the GuacaMol training set (1.27M molecules from ChEMBL24).

Experiment 4: Benchmark of AHC against REINFORCE, REINVENT (v1 and v2), BAR, and Hill-Climb (with and without KL regularization) on six tasks of varying difficulty:

Task	Difficulty	Objective
Heavy atoms	Easy	Maximize number of heavy atoms
Risperidone similarity	Easy	Maximize Tanimoto similarity to Risperidone
DRD2 activity	Medium	Maximize QSAR-predicted DRD2 activity
DRD2 docking	Medium	Minimize Glide-SP docking score
DRD2-DRD3 dual	Hard	Maximize predicted activity against both targets
DRD2/DRD3 selective	Hard	Maximize selective DRD2 activity over DRD3

Experiment 5: AHC vs. REINVENT on transformer (Tr) and gated transformer (GTr) architectures on the same six benchmark tasks. The GTr implements a GRU-style gate in place of residual connections to stabilize RL training.

RNN and Transformer Architectures

Three RNN configurations were used: (1) embedding 128 + 3 GRU layers of 512 (REINVENT v1), (2) embedding 256 + 3 LSTM layers of 512 (REINVENT 2.0), (3) 3 LSTM layers of 512 with dropout 0.2 (GuacaMol). Transformers used 4 encoder layers with hidden dimension 512, 8 attention heads, and feed-forward dimension 1024.

QSAR models for DRD2 and DRD3 activity were random forest classifiers trained on ExCAPE-DB data with GHOST threshold identification for handling class imbalance.

Key Findings: 45-Fold Sample Efficiency Improvement

Experiment 1: AHC Consistently Outperforms REINVENT

AHC improved optimization ability by 1.39-fold over REINVENT averaged across all $\sigma$ values, with maximum optimization of 205% at $\sigma = 240$ (compared to 128% for REINVENT). AHC required ~80 fewer RL steps to match REINVENT’s mean docking score at 100 steps. With DF1 applied, the improvement was 1.45-fold.

AHC showed greater sensitivity to $\sigma$, giving practitioners more control over the regularization-optimization trade-off. At $\sigma = 60$ (heavily regularized), AHC still improved 1.47-fold over REINVENT while maintaining property space defined by the MOSESn training set. At higher $\sigma$ values, AHC extrapolated further outside the training distribution, which can be favorable (novel chemical space) or unfavorable (scoring function exploitation, e.g., larger molecules getting better docking scores due to the additive nature of scoring functions).

Experiment 2: Improvement Across Four GPCR Targets

Across DRD2, OPRM1, AGTR1, and OX1R, AHC + DF2 required on average 7.4-fold fewer training steps and 45.5-fold fewer samples to reach optimization thresholds. The improvement was largest early in training: 19.8-fold fewer steps to reach 120% optimization, and 71.8-fold fewer samples to first produce a molecule exceeding 160% optimization.

AHC + DF2 surpassed the 80% retrospective precision threshold within 100 RL updates for all targets except the challenging OX1R. DF2 successfully stabilized learning, avoiding the convergence-to-threshold failure mode observed with DF1.

Scaffold analysis showed AHC generates similar chemistry to REINVENT. The top 500 scaffolds produced by REINVENT were also generated by AHC, but typically much sooner.

Experiment 4: Benchmark Against All RL Strategies

AHC outperformed all other RL strategies on all six benchmark tasks except maximizing heavy atoms (an extrapolation task of limited practical relevance). AHC was particularly superior during early-stage optimization and for harder objectives (dual activity, selective activity).

Hill-Climb with a smaller batch size (HC*) showed improved early-stage sample efficiency similar to AHC, but rapidly underwent mode collapse. KL regularization did not rescue mode collapse in any case and sometimes worsened performance. BAR performed poorly in most tasks, possibly because the best-agent memory acts as a second regularizer that inhibits learning.

In terms of wall time for the DRD2 docking task, AHC reached 140% optimization in 16 CPU hours vs. 202 CPU hours for REINVENT 2.0. AHC was the only strategy to reach 200% optimization within the allotted time (216 CPU hours). Parallelized over 10 CPUs, this corresponds to ~21.6 hours, making docking-guided generation feasible on local machines.

Experiment 5: Generalization to Transformers

AHC outperformed REINVENT on both the standard transformer and the gated transformer architectures. The standard transformer was unstable under RL, readily undergoing mode collapse. The gated transformer (with GRU-style gating replacing residual connections) stabilized RL training. AHC’s efficiency gains generalized to both architectures.

Limitations

The authors acknowledge several limitations:

Chemistry quality evaluation is complicated by the interaction between RL strategy and scoring function suitability. Greater optimization may lead to unreasonable chemistry due to scoring function exploitation rather than the RL strategy itself.
The diversity filter hyperparameter search was conducted on GuacaMol toy tasks, which may not fully transfer to docking-based objectives.
The docking scoring function was system-dependent: DRD2 and OPRM1 were optimized effectively, while AGTR1 and OX1R proved more challenging (especially AGTR1, where the docking algorithm targeted the wrong sub-pocket).
KL regularization proved ineffective for HC and REINFORCE, suggesting it is not a sufficient regularization method in this context.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
RNN pretraining	MOSESn (MOSES neutralized)	2,454,087 molecules	ZINC15 clean leads with neutralized charges
RNN pretraining	GuacaMol train	1,273,104 molecules	ChEMBL24 with property filters
QSAR training	ExCAPE-DB (DRD2)	4,609 actives / 343,026 inactives	Random forest with GHOST thresholds
QSAR training	ExCAPE-DB (DRD3)	2,758 actives / 402,524 inactives	Unique subsets for dual/selective tasks
DF parameter search	GuacaMol benchmark tasks	3 tasks	825 configurations tested

Algorithms

AHC: REINVENT loss computed on top-k molecules per batch, ranked by reward
Baselines: REINFORCE, REINVENT (v1, v2), BAR, Hill-Climb, Hill-Climb + KL regularization
Hyperparameters: Default values from each original publication (listed in Supplementary Table S3)
Docking: Glide-SP with Schrodinger Protein Preparation Wizard, LigPrep for ligand preparation

Models

RNNs: 3 configurations (GRU/LSTM, 512 hidden units, trained 5-10 epochs)
Transformer: 4 encoder layers, 512 hidden dim, 8 heads, 1024 FFN dim
Gated Transformer: Same architecture with GRU-style gating replacing residual connections
QSAR: Random forest classifiers (100 estimators, max depth 15, min leaf 2)

Evaluation

Metric	AHC + DF2	REINVENT	Notes
Optimization fold-improvement	1.45x	baseline	DRD2 docking, averaged across sigma values
Sample efficiency	45.5x fewer samples	baseline	Averaged across 4 GPCR targets
Step efficiency	7.4x fewer steps	baseline	Averaged across 4 GPCR targets
CPU hours to 140% (DRD2 docking)	16h	202h (REINVENT 2.0)	AMD Threadripper 1920 + RTX 2060 Super

Hardware

AMD Threadripper 1920 CPU
Nvidia GeForce RTX 2060 Super GPU
DRD2 docking benchmark: 216 CPU hours for AHC to reach 200% optimization (~21.6h parallelized over 10 CPUs)

Artifacts

Artifact	Type	License	Notes
SMILES-RNN	Code	MIT	RNN and transformer generative model code
MolScore	Code	MIT	Scoring function platform
Figshare datasets	Dataset	CC-BY-4.0	Supporting data (published under same license as paper)

Paper Information

Citation: Thomas, M., O’Boyle, N. M., Bender, A., & de Graaf, C. (2022). Augmented Hill-Climb increases reinforcement learning efficiency for language-based de novo molecule generation. Journal of Cheminformatics, 14, 68.

@article{thomas2022augmented,
  title={Augmented Hill-Climb increases reinforcement learning efficiency for language-based de novo molecule generation},
  author={Thomas, Morgan and O'Boyle, Noel M. and Bender, Andreas and de Graaf, Chris},
  journal={Journal of Cheminformatics},
  volume={14},
  number={1},
  pages={68},
  year={2022},
  doi={10.1186/s13321-022-00646-z}
}

AlphaDrug: MCTS-Guided Target-Specific Drug Design

Thu, 26 Mar 2026 00:00:00 +0000

Target-Conditioned Molecular Generation via Transformer and MCTS

AlphaDrug is a Method paper that proposes a target-specific de novo molecular generation framework. The primary contribution is the combination of two components: (1) an Lmser Transformer (LT) that embeds protein-ligand context through hierarchical skip connections from encoder to decoder, and (2) a Monte Carlo tree search (MCTS) procedure guided by both the LT’s predicted probabilities and docking scores from the SMINA program. The method generates SMILES strings autoregressively, with each symbol selection informed by look-ahead search over potential binding affinities.

Bridging the Gap Between Molecular Generation and Protein Targeting

Most deep learning methods for de novo molecular generation optimize physicochemical properties (LogP, QED, SA) without conditioning on a specific protein target. Virtual screening approaches rely on existing compound databases and are computationally expensive. The few methods that do consider protein targets, such as LiGANN and the transformer-based approach of Grechishnikova (2021), show limited docking performance. The core challenge is twofold: the search space of drug-like molecules is estimated at $10^{60}$ compounds, and learning protein-ligand interaction patterns from sequence data is difficult because proteins and ligands have very different structures and sequence lengths.

AlphaDrug addresses these gaps by proposing a method that jointly learns protein-ligand representations and uses docking-guided search to navigate the vast chemical space.

Lmser Transformer and Docking-Guided MCTS

The key innovations are the Lmser Transformer architecture and the MCTS search strategy.

Lmser Transformer (LT)

The standard transformer for sequence-to-sequence tasks passes information from the encoder’s top layer to the decoder through cross-attention. AlphaDrug identifies an information transfer bottleneck: deep protein features from the encoder’s final layer must serve all decoder layers. Inspired by the Lmser (least mean squared error reconstruction) network, the authors add hierarchical skip connections from each encoder layer to the corresponding decoder layer.

Each decoder layer receives protein features at the matching level of abstraction through a cross-attention mechanism:

$$f_{ca}(Q_m, K_S, V_S) = \text{softmax}\left(\frac{Q_m K_S^T}{\sqrt{d_k}}\right) V_S$$

where $Q_m$ comes from the ligand molecule decoder and $(K_S, V_S)$ are passed through skip connections from the protein encoder. This allows different decoder layers to access different levels of protein features, rather than all layers sharing the same top-level encoding.

The multi-head attention follows the standard formulation:

$$\text{MultiHead}(Q, K, V) = \text{Concat}(H_1, \dots, H_h) W^O$$

$$H_i = f_{ca}(Q W_i^Q, K W_i^K, V W_i^V)$$

MCTS for Molecular Generation

The molecular generation process models SMILES construction as a sequential decision problem. At each step $\tau$, the context $C_\tau = {S, a_1 a_2 \cdots a_\tau}$ consists of the protein sequence $S$ and the intermediate SMILES string. MCTS runs a fixed number of simulations per step, each consisting of four phases:

Select: Starting from the current root node, child nodes are selected using a variant of the PUCT algorithm:

$$\tilde{a}_{\tau+t} = \underset{a \in A}{\arg\max}\left(Q(\tilde{C}_{\tau+t-1}, a) + U(\tilde{C}_{\tau+t-1}, a)\right)$$

where $Q(\tilde{C}, a) = W_a / N_a$ is the average reward and $U(\tilde{C}, a) = c_{puct} \cdot P(a | \tilde{C}) \cdot \sqrt{N_t} / (1 + N_t(a))$ is an exploration bonus based on the LT’s predicted probability.

The Q-values are normalized to $[0, 1]$ using the range of docking scores in the tree:

$$Q(\tilde{C}, a) \leftarrow \frac{Q(\tilde{C}, a) - \min_{m \in \mathcal{M}} f_d(S, m)}{\max_{m \in \mathcal{M}} f_d(S, m) - \min_{m \in \mathcal{M}} f_d(S, m)}$$

Expand: At a leaf node, the LT computes next-symbol probabilities and adds child nodes to the tree.

Rollout: A complete molecule is generated greedily using LT probabilities. Valid molecules are scored with SMINA docking; invalid molecules receive the minimum observed docking score.

Backup: Docking values propagate back up the tree, updating visit counts and cumulative rewards.

Training Objective

The LT is trained on known protein-ligand pairs using cross-entropy loss:

$$J(\Theta) = -\sum_{(S,m) \in \mathcal{D}} \sum_{\tau=1}^{L_m} \sum_{a \in \mathcal{A}} y_a \ln P(a \mid C_\tau(S, m))$$

MCTS is only activated during inference, not during training.

Experiments on Diverse Protein Targets

Dataset

The authors use BindingDB, filtered to 239,455 protein-ligand pairs across 981 unique proteins. Filtering criteria include: human proteins only, IC50 < 100 nM, molecular weight < 1000 Da, and single-chain targets. Proteins are clustered at 30% sequence identity using MMseqs2, with 25 clusters held out for testing (100 proteins), and the remainder split 90/10 for training (192,712 pairs) and validation (17,049 pairs).

Baselines

T+BS10: Standard transformer with beam search (K=10) from Grechishnikova (2021)
LT+BS10: The proposed Lmser Transformer with beam search
LiGANN: 3D pocket-to-ligand shape generation via BicycleGAN
SBMolGen: ChemTS-based method with docking constraints
SBDD-3D: 3D autoregressive graph-based generation
Decoys: Random compounds from ZINC database
Known ligands: Original binding partners from the database

Main Results

Method	Docking	Uniqueness	LogP	QED	SA	NP
Decoys	7.3	-	2.4	0.8	2.4	-1.2
Known ligands	9.8	-	2.2	0.5	3.3	-1.0
LiGANN	6.7	94.7%	2.9	0.6	3.0	-1.1
SBMolGen	7.7	100%	2.6	0.7	2.8	-1.2
SBDD-3D	7.7	99.3%	1.5	0.6	4.0	0.3
T+BS10	8.5	90.6%	3.8	0.5	2.8	-0.8
LT+BS10	8.5	98.1%	4.0	0.5	2.7	-1.0
AlphaDrug (freq)	10.8	99.5%	4.9	0.4	2.9	-1.0
AlphaDrug (max)	11.6	100%	5.2	0.4	2.7	-0.8

AlphaDrug (max) achieves the highest average docking score (11.6), surpassing known ligands (9.8). Statistical significance is confirmed with two-tailed t-test P-values below 0.01 for all comparisons.

MCTS vs. Beam Search Under Equal Compute

When constrained to the same number of docking evaluations, MCTS consistently outperforms beam search:

Docking times (N)	BS	MCTS	P-value
N = 105 (S=10)	8.4 (10.9)	10.9 (11.5)	1.8e-34 (4.5e-3)
N = 394 (S=50)	8.3 (11.4)	11.6 (12.2)	1.4e-31 (1.8e-3)
N = 1345 (S=500)	8.4 (11.9)	12.4 (13.2)	2.2e-39 (8.2e-6)

Values in parentheses are average top-1 scores per protein.

Ablation: Effect of Protein Sequence Input

Replacing the full transformer (T) or LT with a transformer encoder only (TE, no protein input) demonstrates that protein conditioning improves both uniqueness and docking score per symbol (SpS):

Method	Uniqueness	SpS	Molecular length
TE + MCTS (S=50)	81.0%	0.1926	62.93
T + MCTS (S=50)	98.0%	0.2149	55.63
LT + MCTS (S=50)	100.0%	0.2159	56.54

The SpS metric (docking score normalized by molecule length) isolates the quality improvement from the tendency of longer molecules to score higher.

Computational Efficiency

A docking lookup table caches previously computed protein-molecule docking scores, reducing actual docking calls by 81-86% compared to the theoretical maximum ($L \times S$ calls per molecule). With $S = 10$, AlphaDrug generates molecules in about 52 minutes per protein; with $S = 50$, about 197 minutes per protein.

Docking Gains with Acknowledged Limitations

Key Findings

86% of AlphaDrug-generated molecules have higher docking scores than known ligands for their respective targets.
The LT architecture with hierarchical skip connections improves uniqueness (from 90.6% to 98.1% with beam search) and provides slight SpS gains over the vanilla transformer.
MCTS is the dominant factor in performance improvement: even with only 10 simulations, it boosts docking scores by 31.3% over greedy LT decoding.
Case studies on three proteins (3gcs, 3eig, 4o28) show that generated molecules share meaningful substructures with known ligands, suggesting chemical plausibility.

Limitations

The authors identify three areas for improvement:

Sequence-only representation: AlphaDrug uses amino acid sequences rather than 3D protein structures. While it outperforms existing 3D methods (SBDD-3D), incorporating 3D pocket geometry could further improve performance.
External docking as value function: SMINA docking calls are computationally expensive and become a bottleneck during MCTS. A learnable end-to-end value network would reduce this cost and allow joint policy-value training.
Full rollout requirement: Every MCTS simulation requires generating a complete molecule for docking evaluation. Estimating binding affinity from partial molecules remains an open challenge.

The physicochemical properties (QED, SA) of AlphaDrug’s outputs are comparable to baselines but not explicitly optimized. LogP values trend toward the upper end of the Ghose filter range (4.9-5.2 vs. the 5.6 limit), which may indicate lipophilicity bias.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Training	BindingDB (filtered)	192,712 protein-ligand pairs	Human proteins, IC50 < 100 nM, MW < 1000 Da
Validation	BindingDB (filtered)	17,049 pairs	Same filtering criteria
Testing	BindingDB (filtered)	100 proteins from 25 clusters	Clustered at 30% sequence identity via MMseqs2

Algorithms

MCTS with PUCT selection criterion, $c_{puct} = 1.5$
$S = 50$ simulations per step (default), $S = 10$ for fast variant
Greedy rollout policy using LT probabilities
Docking lookup table for efficiency (caches SMINA results)
Two generation modes: max (deterministic, highest visit count) and freq (stochastic, proportional to visit counts)

Models

Lmser Transformer with hierarchical encoder-to-decoder skip connections
Sinusoidal positional encoding
Multi-head cross-attention at each decoder layer
Detailed hyperparameters (embedding dimensions, number of layers/heads) are in the supplementary material (Table S1)

Evaluation

Metric	AlphaDrug (max)	Known ligands	Best baseline (T+BS10)
Docking score	11.6	9.8	8.5
Uniqueness	100%	-	90.6%
Validity	100%	-	Not reported

Hardware

Hardware specifications are not explicitly reported in the paper. Generation time is reported as approximately 52 minutes per protein ($S = 10$) and 197 minutes per protein ($S = 50$), with docking (via SMINA) being the dominant cost.

Artifacts

Artifact	Type	License	Notes
CMACH508/AlphaDrug	Code	MIT	Official implementation, includes data processing and generation scripts

Paper Information

Citation: Qian, H., Lin, C., Zhao, D., Tu, S., & Xu, L. (2022). AlphaDrug: protein target specific de novo molecular generation. PNAS Nexus, 1(4), pgac227. https://doi.org/10.1093/pnasnexus/pgac227

@article{qian2022alphadrug,
  title={AlphaDrug: protein target specific de novo molecular generation},
  author={Qian, Hao and Lin, Cheng and Zhao, Dengwei and Tu, Shikui and Xu, Lei},
  journal={PNAS Nexus},
  volume={1},
  number={4},
  pages={pgac227},
  year={2022},
  doi={10.1093/pnasnexus/pgac227}
}

TamGen: GPT-Based Target-Aware Drug Design and Generation

Wed, 25 Mar 2026 00:00:00 +0000

A Method for Target-Conditioned Molecular Generation

This is a Method paper that introduces TamGen (Target-aware molecular generation), a three-module architecture for generating drug-like compounds conditioned on protein binding pocket structures. The primary contribution is a GPT-like chemical language model pre-trained on 10 million SMILES from PubChem, combined with a Transformer-based protein encoder and a VAE-based contextual encoder for compound refinement. The authors validate TamGen on the CrossDocked2020 benchmark and apply it through a Design-Refine-Test pipeline to discover 14 novel inhibitors of the Mycobacterium tuberculosis ClpP protease, with $\text{IC}_{50}$ values ranging from 1.88 to 35.2 $\mu$M.

Bridging Generative AI and Practical Drug Discovery

Target-based generative drug design aims to create novel compounds with desired pharmacological properties from scratch, exploring the estimated $10^{60}$ feasible compounds in chemical space rather than screening existing libraries of $10^{4}$ to $10^{8}$ molecules. Prior approaches using diffusion models, GANs, VAEs, and autoregressive models have demonstrated the feasibility of generating compounds conditioned on target proteins. However, most generated compounds lack satisfactory physicochemical properties for drug-likeness, and validations with biophysical or biochemical assays are largely missing.

The key limitations of existing 3D generation methods (TargetDiff, Pocket2Mol, ResGen, 3D-AR) include:

Generated compounds frequently contain multiple fused rings, leading to poor synthetic accessibility
High cellular toxicity and decreased developability associated with excessive fused ring counts
Slow generation speeds (tens of minutes to hours per 100 compounds)
Limited real-world experimental validation of generated candidates

TamGen addresses these issues by operating in 1D SMILES space rather than 3D coordinate space, leveraging pre-training on natural compound distributions to produce more drug-like molecules.

TamGen consists of three components: a compound decoder, a protein encoder, and a contextual encoder.

Compound Decoder (Chemical Language Model)

The compound decoder is a GPT-style autoregressive model pre-trained on 10 million SMILES randomly sampled from PubChem. The pre-training objective follows standard next-token prediction:

$$ \min -\sum_{y \in \mathcal{D}_0} \frac{1}{M_y} \sum_{i=1}^{M_y} \log P(y_i \mid y_{i-1}, y_{i-2}, \ldots, y_1) $$

where $M_y$ is the SMILES sequence length. This enables both unconditional and conditional generation. The decoder uses 12 Transformer layers with hidden dimension 768.

Protein Encoder with Distance-Aware Attention

The protein encoder processes binding pocket residues using both sequential and geometric information. Given amino acids $\mathbf{a} = (a_1, \ldots, a_N)$ with 3D coordinates $\mathbf{r} = (r_1, \ldots, r_N)$, the input representation combines amino acid embeddings with coordinate embeddings:

$$ h_i^{(0)} = E_a a_i + E_r \rho\left(r_i - \frac{1}{N}\sum_{j=1}^{N} r_j\right) $$

where $\rho$ denotes a random roto-translation operation applied as data augmentation, and coordinates are centered to the origin.

The encoder uses a distance-aware self-attention mechanism that weights attention scores by spatial proximity:

$$ \begin{aligned} \hat{\alpha}_j &= \exp\left(-\frac{|r_i - r_j|^2}{\tau}\right)(h_i^{(l)\top} W h_j^{(l)}) \\ \alpha_j &= \frac{\exp \hat{\alpha}_j}{\sum_{k=1}^{N} \exp \hat{\alpha}_k} \\ \hat{\boldsymbol{h}}_i^{(l+1)} &= \sum_{j=1}^{N} \alpha_j (W_v h_j^{(l)}) \end{aligned} $$

where $\tau$ is a temperature hyperparameter and $W$, $W_v$ are learnable parameters. The encoder uses 4 layers with hidden dimension 256. Outputs are passed to the compound decoder via cross-attention.

VAE-Based Contextual Encoder

A VAE-based contextual encoder determines the mean $\mu$ and standard deviation $\sigma$ for any (compound, protein) pair. During training, the model recovers the input compound. During application, a seed compound enables compound refinement. The full training objective combines reconstruction loss with KL regularization:

$$ \min_{\Theta, q} \frac{1}{|\mathcal{D}|} \sum_{(\mathbf{x}, \mathbf{y}) \in \mathcal{D}} -\log P(\mathbf{y} \mid \mathbf{x}, z; \Theta) + \beta \mathcal{D}_{\text{KL}}(q(z \mid \mathbf{x}, \mathbf{y}) | p(z)) $$

where $\beta$ is a hyperparameter controlling the KL divergence weight, and $p(z)$ is a standard Gaussian prior.

Benchmark Evaluation and Tuberculosis Drug Discovery

CrossDocked2020 Benchmark

TamGen was evaluated against five baselines (liGAN, 3D-AR, Pocket2Mol, ResGen, TargetDiff) on the CrossDocked2020 dataset (~100k drug-target pairs for training, 100 test binding pockets). For each target, 100 compounds were generated per method. Evaluation metrics included:

Docking score (AutoDock-Vina): binding affinity estimate
QED: quantitative estimate of drug-likeness
Lipinski’s Rule of Five: physicochemical property compliance
SAS: synthetic accessibility score
LogP: lipophilicity (optimal range 0-5 for oral administration)
Molecular diversity: Tanimoto similarity between Morgan fingerprints

TamGen ranked first or second on 5 of 6 metrics and achieved the best overall score using mean reciprocal rank (MRR) across all metrics. On synthetic accessibility for high-affinity compounds, TamGen performed best. The generated compounds averaged 1.78 fused rings, closely matching FDA-approved drugs, while competing 3D methods produced compounds with significantly more fused rings.

TamGen was also 85x to 394x faster than competing methods: generating 100 compounds per target in an average of 9 seconds on a single A6000 GPU, compared to tens of minutes or hours for the baselines.

Design-Refine-Test Pipeline for ClpP Inhibitors

The practical application targeted ClpP protease of Mycobacterium tuberculosis, an emerging antibiotic target with no documented advanced inhibitors beyond Bortezomib.

Design stage: Using the ClpP binding pocket from PDB structure 5DZK, TamGen generated 2,612 unique compounds. Compounds were filtered by molecular docking (retaining those with better scores than Bortezomib) and Ligandformer phenotypic activity prediction. Peptidomimetic compounds were excluded for poor ADME properties. Four seed compounds were selected.

Refine stage: Using the 4 seed compounds plus 3 weakly active compounds ($\text{IC}_{50}$ 100-200 $\mu$M) from prior experiments, TamGen generated 8,635 unique compounds conditioned on both the target and seeds. After filtering, 296 compounds were selected for testing.

Test stage: From a 446k commercial compound library, 159 analogs (MCS similarity > 0.55) were identified. Five analogs showed significant inhibitory effects. Dose-response experiments revealed $\text{IC}_{50}$ values below 20 $\mu$M for all five, with Analog-005 achieving $\text{IC}_{50}$ of 1.9 $\mu$M. Three additional novel compounds were synthesized for SAR analysis:

Compound	Series	Source	$\text{IC}_{50}$ ($\mu$M)	Key Feature
Analog-005	II	Commercial library	1.9	Most potent analog
Analog-003	I	Commercial library	< 20	Strongest single-dose inhibition
Syn-A003-01	I	TamGen (synthesized)	< 20	Diphenylurea scaffold

Both compound series (diphenylurea and benzenesulfonamide scaffolds) represent novel ClpP inhibitor chemotypes distinct from Bortezomib. Additionally, 6 out of 8 directly synthesized TamGen compounds demonstrated $\text{IC}_{50}$ below 40 $\mu$M, confirming TamGen’s ability to produce viable hits without the library search step.

Ablation Studies

Four ablation experiments clarified the contributions of TamGen’s components:

Without pre-training: Significantly worse docking scores and simpler structures. The optimal decoder depth dropped from 12 to 4 layers without pre-training due to overfitting.
Shuffled pocket-ligand pairs (TamGen-r): Substantially worse docking scores, confirming TamGen learns meaningful pocket-ligand interactions rather than generic compound distributions.
Without distance-aware attention: Significant decline in docking scores when removing the geometric attention term from Eq. 2.
Without coordinate augmentation: Performance degradation when removing the roto-translation augmentation $\rho$, highlighting the importance of geometric invariance.

Validated Drug-Like Generation with Practical Limitations

TamGen demonstrates that 1D SMILES-based generation with pre-training on natural compounds produces molecules with better drug-likeness properties than 3D generation methods. The experimental validation against ClpP is a notable strength, as most generative drug design methods lack biochemical assay confirmation.

Key limitations acknowledged by the authors include:

Insufficient sensitivity to minor target differences: TamGen cannot reliably distinguish targets with point mutations or protein isoforms, limiting applicability for cancer-related proteins
Requires known structure and pocket: As a structure-based method, TamGen needs the 3D structure of the target protein and binding pocket information
Limited cellular validation: The study focuses on hit identification; cellular activities and toxicities of proposed compounds were not extensively tested
1D generation trade-off: SMILES-based generation does not fully exploit 3D protein-ligand geometric interactions available in coordinate space

Future directions include integrating insights from 3D autoregressive methods, using Monte Carlo Tree Search or reinforcement learning to guide generation for better docking scores and ADME/T properties, and property-guided generation as explored in PrefixMol.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pre-training	PubChem (random sample)	10M SMILES	Compound decoder pre-training
Fine-tuning	CrossDocked2020	~100k pairs	Filtered pocket-ligand pairs
Extended fine-tuning	CrossDocked + PDB	~300k pairs	Used for TB compound generation
Evaluation	CrossDocked2020 test	100 pockets	Same split as TargetDiff/Pocket2Mol

Algorithms

Compound decoder: 12-layer GPT with hidden dimension 768, pre-trained for 200k steps
Protein encoder: 4-layer Transformer with hidden dimension 256, distance-aware attention
VAE encoder: 4-layer standard Transformer encoder with hidden dimension 256
Optimizer: Adam with initial learning rate $3 \times 10^{-5}$
VAE $\beta$: 0.1 or 1.0 depending on generation stage
Beam search: beam sizes of 4, 10, or 20 depending on stage
Pocket definition: residues within 10 or 15 Angstrom distance cutoff from ligand center

Models

Pre-trained model weights are available via Zenodo at https://doi.org/10.5281/zenodo.13751391.

Evaluation

Metric	TamGen	Best Baseline	Notes
Overall MRR	Best	TargetDiff (2nd)	Ranked across 6 metrics
Fused rings (avg)	1.78	~3-5 (others)	Matches FDA-approved drug average
Generation speed	9 sec/100 compounds	~13 min (ResGen)	Single A6000 GPU
ClpP hit rate	6/8 synthesized	N/A	$\text{IC}_{50}$ < 40 $\mu$M

Hardware

Pre-training: 8x V100 GPUs for 200k steps
Inference benchmarking: 1x A6000 GPU
Generation time: ~9 seconds per 100 compounds per target

Artifact	Type	License	Notes
TamGen code	Code	MIT	Official implementation
Model weights and data	Model + Data	CC-BY-4.0	Pre-trained weights, source data

Paper Information

Citation: Wu, K., Xia, Y., Deng, P., Liu, R., Zhang, Y., Guo, H., Cui, Y., Pei, Q., Wu, L., Xie, S., Chen, S., Lu, X., Hu, S., Wu, J., Chan, C.-K., Chen, S., Zhou, L., Yu, N., Chen, E., Liu, H., Guo, J., Qin, T., & Liu, T.-Y. (2024). TamGen: drug design with target-aware molecule generation through a chemical language model. Nature Communications, 15, 9360. https://doi.org/10.1038/s41467-024-53632-4

@article{wu2024tamgen,
  title={TamGen: drug design with target-aware molecule generation through a chemical language model},
  author={Wu, Kehan and Xia, Yingce and Deng, Pan and Liu, Renhe and Zhang, Yuan and Guo, Han and Cui, Yumeng and Pei, Qizhi and Wu, Lijun and Xie, Shufang and Chen, Si and Lu, Xi and Hu, Song and Wu, Jinzhi and Chan, Chi-Kin and Chen, Shawn and Zhou, Liangliang and Yu, Nenghai and Chen, Enhong and Liu, Haiguang and Guo, Jinjiang and Qin, Tao and Liu, Tie-Yan},
  journal={Nature Communications},
  volume={15},
  number={1},
  pages={9360},
  year={2024},
  publisher={Nature Publishing Group},
  doi={10.1038/s41467-024-53632-4}
}

Review of Molecular Representation Learning Models

Wed, 25 Mar 2026 00:00:00 +0000

A Systematization of Molecular Representation Foundation Models

This paper is a Systematization that provides the first comprehensive review of foundation models for molecular representation learning (MRL). The authors classify existing models by their input modality (unimodal vs. multimodal), analyze four mainstream pretraining strategies, survey five downstream application domains, and propose practical guidelines for model selection. The review covers over 35 representative models published between 2020 and 2024, with parameter counts ranging from 2 million to over 1 trillion.

Why a Systematic Review of MRL Foundation Models Is Needed

Molecular representation learning transforms molecular structures and properties into numerical vectors that serve as inputs for machine learning models. The field has evolved rapidly from molecular fingerprints through SMILES-based sequence models to graph neural networks and 3D geometry-aware architectures. Foundation models, characterized by large-scale pretraining on unlabeled molecular data followed by fine-tuning on downstream tasks, have introduced new opportunities for generalizability and transfer learning in drug discovery.

Despite this rapid progress, the authors identify a gap: no prior work has systematically reviewed MRL foundation models across all input modalities and pretraining paradigms. Existing surveys tend to focus on specific representations (e.g., graph-based methods) or specific applications (e.g., property prediction) without providing the cross-cutting perspective needed to guide model selection. This review fills that gap by offering a unified taxonomy and practical guidelines.

Taxonomy of Molecular Descriptors and Model Architectures

The core organizational framework classifies models along two axes: the molecular descriptor used as input and the backbone architecture.

Molecular Descriptors

The review identifies five primary descriptor types:

Molecular fingerprints: Binary vectors encoding structural features (e.g., Morgan fingerprints). Rarely used in foundation models due to information loss and dimensional complexity.
1D sequences: SMILES and SELFIES string representations. SMILES is compact and widely used but can produce invalid molecules. SELFIES guarantees valid molecular strings by construction.
2D topological graphs: Atoms as nodes, bonds as edges. Can be derived from SMILES via RDKit, making graph datasets effectively interchangeable with SMILES datasets.
3D geometry: Spatial coordinates capturing conformational information, energy states, and stereochemistry. Experimentally expensive to obtain, limiting dataset availability.
Multimodal: Combinations of the above with text, IUPAC names, knowledge graphs, and molecular images.

The paper also discusses mathematically abstract molecular representations. For example, the Wiener index quantifies structural complexity:

$$ W = \frac{1}{2} \sum_{i < j} d_{ij} $$

where $d_{ij}$ is the topological distance (shortest bonding path length) between atoms $i$ and $j$.

Degree centrality captures local connectivity:

$$ C_{D}(v_{i}) = \sum_{j=1}^{n} A_{ij} $$

where $A \in \mathbb{R}^{n \times n}$ is the molecular graph adjacency matrix.

Model Architectures

Models are classified into two primary categories:

Unimodal-based models:

Sequence-based: Transformer models operating on SMILES/SELFIES (e.g., ChemBERTa-2, MoLFormer, MolGEN, LlaSMol). These capture syntactic patterns but miss spatial and topological features.
Topological graph-based: GNN variants (GIN, GCN, GAT) and Transformer-based graph models (Graphormer). GNNs capture local topology through message passing; Transformers overcome locality limitations through global self-attention.
3D geometry-based: Models like Uni-Mol and 3D PGT that incorporate spatial coordinates. Uni-Mol uses distance-aware self-attention with an SE(3)-equivariant coordinate head for rotation/translation invariance.
Image-based: CNN-based models (ImageMol) that process 2D molecular images using visual representation learning.

Multimodal-based models:

Sequence + Graph: DVMP, PanGu Drug Model. Combines the strengths of string and topological representations.
Graph + 3D Geometry: GraphMVP, Transformer-M. Enriches topological features with spatial information.
Text + Molecular Structure: KV-PLM, MolT5, MoleculeSTM, MolReGPT, Y-mol. Aligns molecular structural information with biomedical text through cross-modal learning.

Four Pretraining Paradigms for MRL

The review systematically categorizes pretraining strategies into four paradigms:

Masked Language Modeling (MLM)

The cornerstone strategy for sequence-based models. Randomly masks tokens in molecular sequences and trains the model to predict them. ChemBERTa pretrained on 77 million SMILES sequences from PubChem achieves 5-10% improvement in AUC-ROC on property prediction tasks compared to task-specific models. MLM captures local dependencies and global sequence patterns but cannot model spatial or topological features, making it best suited for unimodal sequence inputs.

Contrastive Learning (CL)

The dominant strategy for multimodal models. Constructs positive-negative sample pairs to align features across modalities or views. In unimodal settings, CL generates negative samples by perturbing molecular graphs. In multimodal settings, it aligns features from different modalities. GraphMVP, which contrasts 2D topological features with 3D spatial features, reduces RMSE by 15% on QM9 energy prediction compared to unimodal models. Performance depends heavily on the quality of positive sample construction.

Reconstruction-Based Pretraining (RBP)

Learns global molecular features by reconstructing original data from corrupted inputs. Tasks include node feature reconstruction, graph structure reconstruction, and coordinate/energy reconstruction. MGMAE masks more than 50% of nodes and edges in molecular graphs and trains the model to reconstruct them, achieving 94.2% AUC-ROC on BBBP. RBP captures global structural patterns but requires high model complexity and training cost.

Multimodal Alignment Pretraining (MAP)

Designed for multimodal inputs, aligning and fusing features from different modalities through cross-modal tasks. KV-PLM uses SMILES-to-text matching to align molecular structure and functional information. MAP fuses structural information (SMILES, graphs) with semantic information (text) but requires large-scale cross-modal labeled data, posing significant data acquisition challenges.

Downstream Applications and Performance Benchmarks

The review evaluates MRL foundation models across five application domains.

Molecular Property Prediction

The most common benchmark for MRL models. The review provides comprehensive ROC-AUC comparisons across eight MoleculeNet classification datasets:

Model	Type	BBBP	BACE	ClinTox	Tox21	SIDER	HIV
MGMAE	Graph	94.2	92.7	96.7	86.0	66.4	-
MPG	Graph	92.2	92.0	96.3	83.7	66.1	-
GROVER	Graph+Trans.	94.0	89.4	94.4	83.1	65.8	-
MoLFormer	Sequence	93.7	88.2	94.8	84.7	69.0	82.2
MM-Deacon	Seq.+IUPAC	78.5	-	99.5	-	69.3	80.1
Uni-Mol	3D	72.9	85.7	91.9	79.6	65.9	80.8
DVMP	Seq.+Graph	77.8	89.4	95.6	79.1	69.8	81.4
TxD-T-LLM	Seq.+Text	-	-	86.3	88.2	-	73.2

The table shows that no single architecture dominates across all datasets. Transformer- and GIN-based architectures with graph inputs generally perform well. The review notes that model effectiveness depends heavily on the dataset, with Mole-BERT encountering negative transfer due to a small and unbalanced atomic vocabulary.

Molecular Generation

MolGEN (SELFIES-based, 8B parameters) achieves 100% validity on synthetic molecules. MolT5 excels at text-to-molecule generation. Uni-Mol generates 3D conformations with 97.95% coverage on QM9.

Drug-Drug Interaction Prediction

MPG achieves 96.6% AUC-ROC on BIOSNAP by combining unsupervised pretraining with supervised fine-tuning and multi-task learning.

Retrosynthesis Prediction

DVMP achieves 66.5% top-1 accuracy on USPTO-50K when reaction types are provided as priors (54.2% without).

Drug Synergy Prediction

SynerGPT (GPT-based) achieves 77.7% AUC-ROC in few-shot settings for novel drug combinations, outperforming baselines through contextual learning.

Guidelines, Limitations, and Future Directions

Model Selection Guidelines

The authors provide structured guidelines for choosing MRL foundation models based on:

Task objective: Property prediction favors GNNs or large pretrained frameworks (ChemBERTa-2, Uni-Mol). Generation tasks favor GPT-style autoregressive models (MolGEN). Retrosynthesis benefits from multimodal architectures.
Data characteristics: SMILES/graph representations suit generation tasks. Knowledge graph-enhanced models benefit interaction and synergy prediction. Transfer learning helps data-limited scenarios.
Interpretability needs: Transformer architectures are preferred when interpretability is required, as attention matrices enable visualization of learned molecular features.
Computational budget: GIN-based models have $\mathcal{O}(|V| + |E|)$ complexity, while Transformer-based models scale as $\mathcal{O}(n^2 \cdot d)$.

Limitations and Future Directions

The review identifies five key challenges:

Multimodal data integration: Each representation paradigm has distinct limitations (1D neglects spatial configuration, 2D omits conformational details, 3D faces rotational invariance challenges). The authors propose incorporating molecular dynamics trajectories as a dynamic modality and using cross-modal data augmentation.
Data scarcity: Semi-supervised learning can achieve more than 90% of fully supervised performance using only 10% labeled data on QM9. Cross-modal augmentation (e.g., 3D InfoMax) can generate plausible 3D conformers from 2D graphs.
Interpretability: Current methods rely primarily on attention-based visualization, which is insufficient for multimodal models. The authors suggest assessing decision consistency across modalities and incorporating chemical knowledge graphs.
Training efficiency: Large parameter counts demand distributed parallel training techniques, with data parallelism being the most common approach.
Robustness and generalization: Strategies include data augmentation (multiple SMILES representations, 3D conformer generation), meta-learning for rapid adaptation, and sparse attention mechanisms to reduce sensitivity to irrelevant long-range interactions.

Reproducibility Details

This is a review paper, so standard reproducibility criteria for experimental papers do not directly apply. The review compiles results from the original publications of each surveyed model.

Data

The review catalogs 28 representative molecular datasets used by the surveyed foundation models:

Dataset	Size	Descriptor	Primary Use
PubChem	~118M	SMILES, 3D, Image, IUPAC	Pretraining
ZINC15	~980M	SMILES	Pretraining
ChEMBL	~2.4M	SMILES	Pretraining
QM9	133,884	SMILES	Property prediction
GEOM	450,000	3D coordinates	Property prediction
USPTO-full	950,000	SMILES	Reaction prediction
Molecule3D	4M	3D coordinates	Property prediction

Artifacts

Artifact	Type	License	Notes
Review Materials (GitHub)	Code/Data	Not specified	Code and data tables for figures
Paper (PMC)	Paper	CC-BY	Open access via PubMed Central

Evaluation

All performance metrics reported in the review are directly cited from the original studies. The evaluation protocols follow each model’s original setup. The review covers:

ROC-AUC for classification tasks (property prediction, DDI, synergy)
RMSE/MAE for regression tasks
Validity and novelty for molecular generation
Top-k accuracy for retrosynthesis
COV and MAT for conformation generation

Paper Information

Citation: Song, B., Zhang, J., Liu, Y., Liu, Y., Jiang, J., Yuan, S., Zhen, X., & Liu, Y. (2025). A systematic review of molecular representation learning foundation models. Briefings in Bioinformatics, 27(1), bbaf703. https://doi.org/10.1093/bib/bbaf703

@article{song2025systematic,
  title={A systematic review of molecular representation learning foundation models},
  author={Song, Bosheng and Zhang, Jiayi and Liu, Ying and Liu, Yuansheng and Jiang, Jing and Yuan, Sisi and Zhen, Xia and Liu, Yiping},
  journal={Briefings in Bioinformatics},
  volume={27},
  number={1},
  pages={bbaf703},
  year={2025},
  publisher={Oxford University Press},
  doi={10.1093/bib/bbaf703}
}

PMO: Benchmarking Sample-Efficient Molecular Design

Wed, 25 Mar 2026 00:00:00 +0000

A Standardized Benchmark for Molecular Optimization

This is a Resource paper that introduces PMO (Practical Molecular Optimization), an open-source benchmark for evaluating molecular optimization algorithms with a focus on sample efficiency. The primary contribution is not a new algorithm but a comprehensive evaluation framework that exposes blind spots in how the field measures progress. By benchmarking 25 methods across 23 oracle functions under a fixed budget of 10,000 oracle calls, the authors provide a standardized protocol for transparent and reproducible comparison of molecular design methods.

The Missing Dimension: Oracle Budget in Molecular Design

Molecular optimization is central to drug and materials discovery, and the field has seen rapid growth in computational methods. Despite this progress, the authors identify three persistent problems with how methods are evaluated:

Lack of oracle budget control: Most papers do not report how many candidate molecules were evaluated by the oracle to achieve their results, despite this number spanning orders of magnitude. In practice, the most valuable oracles (wet-lab experiments, high-accuracy simulations) are expensive, making sample efficiency critical.
Trivial or self-designed oracles: Many papers only report on easy objectives like QED or penalized LogP, or introduce custom tasks that make cross-method comparison impossible.
Insufficient handling of randomness: Many algorithms are stochastic, yet existing benchmarks examined no more than five methods and rarely reported variance across independent runs.

Prior benchmarks such as GuacaMol, Therapeutics Data Commons (TDC), and Tripp et al.’s analysis each suffer from at least one of these issues. PMO addresses all three simultaneously.

The PMO Benchmark Design

The core innovation of PMO is its evaluation protocol rather than any single algorithmic contribution. The benchmark enforces three design principles:

Oracle budget constraint: All methods are limited to 10,000 oracle calls. This is deliberately much smaller than the unconstrained budgets typical in the literature, reflecting the practical reality that experimental evaluations are costly.

AUC-based metric: Instead of reporting only the final top-K score, PMO uses the area under the curve (AUC) of top-K average property value versus oracle calls:

$$ \text{AUC Top-}K = \int_{0}^{N} \bar{f}_{K}(n) , dn $$

where $\bar{f}_{K}(n)$ is the average property value of the top $K$ molecules found after $n$ oracle calls, and $N = 10{,}000$. The paper uses $K = 10$. This metric rewards methods that reach high property values quickly, not just those that eventually converge given enough budget. All AUC values are min-max scaled to [0, 1].

Standardized data: All methods use only the ZINC 250K dataset (approximately 250,000 molecules) whenever a database is required, ensuring a level playing field.

The benchmark includes 23 oracle functions: QED, DRD2, GSK3-beta, JNK3, and 19 oracles from GuacaMol covering multi-property objectives (MPOs) based on similarity, molecular weight, CLogP, and other pharmaceutically relevant criteria. All oracle scores are normalized to [0, 1].

25 Methods Across Nine Algorithm Families

The benchmark evaluates 25 molecular optimization methods organized along two dimensions: molecular assembly strategy (SMILES, SELFIES, atom-level graphs, fragment-level graphs, synthesis-based) and optimization algorithm (GA, MCTS, BO, VAE, GAN, score-based modeling, hill climbing, RL, gradient ascent). Each method was hyperparameter-tuned on two held-out tasks (zaleplon_mpo and perindopril_mpo) and then evaluated across all 23 oracles for 5 independent runs.

The following table summarizes the top 10 methods by sum of mean AUC Top-10 across all 23 tasks:

Rank	Method	Assembly	Sum AUC Top-10
1	REINVENT	SMILES	14.196
2	Graph GA	Fragments	13.751
3	SELFIES-REINVENT	SELFIES	13.471
4	GP BO	Fragments	13.156
5	STONED	SELFIES	13.024
6	LSTM HC	SMILES	12.223
7	SMILES GA	SMILES	12.054
8	SynNet	Synthesis	11.498
9	DoG-Gen	Synthesis	11.456
10	DST	Fragments	10.989

The bottom five methods by overall ranking were GFlowNet-AL, Pasithea, JT-VAE, Graph MCTS, and MolDQN.

REINVENT is ranked first across all six metrics considered (AUC Top-1, AUC Top-10, AUC Top-100, Top-1, Top-10, Top-100). Graph GA is consistently second. Both methods were released several years before many of the methods they outperform, yet they are rarely used as baselines in newer work.

Key Findings: Older Methods Win and SELFIES Offers Limited Advantage

The benchmark yields several findings with practical implications:

No method solves optimization within realistic budgets. None of the 25 methods can optimize the included objectives within hundreds of oracle calls (the scale at which experimental evaluations would be feasible), except for trivially easy oracles like QED, DRD2, and osimertinib_mpo.

Older algorithms remain competitive. REINVENT (2017) and Graph GA (2019) outperform all newer methods tested, including those published at top AI conferences. The absence of standardized benchmarking had obscured this fact.

SMILES versus SELFIES. SELFIES was designed to guarantee syntactically valid molecular strings, but head-to-head comparisons show that SELFIES-based variants of language model methods (REINVENT, LSTM HC, VAE) generally do not outperform their SMILES counterparts. Modern language models learn SMILES grammar well enough that syntactic invalidity is no longer a practical issue. The one exception is genetic algorithms, where SELFIES-based GAs (STONED) outperform SMILES-based GAs, likely because SELFIES provides more intuitive mutation operations.

Model-based methods need careful design. Model-based variants (GP BO relative to Graph GA, GFlowNet-AL relative to GFlowNet) do not consistently outperform their model-free counterparts. GP BO outperformed Graph GA in 12 of 23 tasks but underperformed on sum, and GFlowNet-AL underperformed GFlowNet in nearly every task. The bottleneck is the quality of the predictive surrogate model, and naive surrogate integration can actually hurt performance.

Oracle landscape determines method suitability. Clustering analysis of relative AUC Top-10 scores reveals clear patterns. String-based GAs excel on isomer-type oracles (which are sums of atomic contributions), while RL-based and fragment-based methods perform better on similarity-based MPOs. This suggests there is no single best algorithm, and method selection should be informed by the optimization landscape.

Hyperparameter tuning and multiple runs are essential. Optimal hyperparameters differed substantially from default values in original papers. For example, REINVENT’s performance is highly sensitive to its sigma parameter, and the best value under the constrained-budget setting is much larger than originally suggested. Methods like Graph GA and GP BO also show high variance across runs, underscoring the importance of reporting distributional outcomes rather than single-run results.

Limitations

The authors acknowledge several limitations: they cannot exhaustively tune every hyperparameter or include every variant of each method; the conclusion may be biased toward similarity-based oracles (which dominate the 23 tasks); important quantities like synthesizability and diversity are not thoroughly evaluated; and oracle calls from pre-training data in model-based methods are counted against the budget, which may disadvantage methods that could leverage prior data collection. For a follow-up study that adds property filters and diversity requirements to the PMO evaluation, see Re-evaluating Sample Efficiency.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Molecule library	ZINC 250K	~250,000 molecules	Used for screening, pre-training generative models, and fragment extraction
Oracle functions	TDC / GuacaMol	23 tasks	All scores normalized to [0, 1]

Algorithms

25 molecular optimization methods spanning 9 algorithm families and 5 molecular assembly strategies. Each method was hyperparameter-tuned on 2 held-out tasks (zaleplon_mpo, perindopril_mpo) using 3 independent runs, then evaluated on all 23 tasks with 5 independent runs each.

Evaluation

Metric	Description	Notes
AUC Top-K	Area under curve of top-K average vs. oracle calls	Primary metric; K=10; min-max scaled to [0, 1]
Top-K	Final top-K average property value at 10K calls	Secondary metric
Sum rank	Sum of AUC Top-10 across all 23 tasks	Used for overall ranking

Hardware

The paper states hardware details are in Appendix C.2. The benchmark runs on standard compute infrastructure and does not require GPUs for most methods. Specific compute requirements vary by method.

Artifacts

Artifact	Type	License	Notes
mol_opt	Code	MIT	Full benchmark implementation with all 25 methods
Benchmark results	Dataset	Unknown	All experimental results from the paper
TDC	Dataset	MIT	Oracle functions and evaluation infrastructure

Citation

@inproceedings{gao2022sample,
  title={Sample Efficiency Matters: A Benchmark for Practical Molecular Optimization},
  author={Gao, Wenhao and Fu, Tianfan and Sun, Jimeng and Coley, Connor W.},
  booktitle={Advances in Neural Information Processing Systems},
  volume={35},
  pages={21342--21357},
  year={2022}
}

Paper Information

Citation: Gao, W., Fu, T., Sun, J., & Coley, C. W. (2022). Sample Efficiency Matters: A Benchmark for Practical Molecular Optimization. Advances in Neural Information Processing Systems, 35, 21342-21357. https://arxiv.org/abs/2206.12411

Publication: NeurIPS 2022

Additional Resources:

MolScore: Scoring and Benchmarking for Drug Design

Wed, 25 Mar 2026 00:00:00 +0000

A Unified Resource for Generative Molecular Design

MolScore is a Resource paper that introduces an open-source Python framework for scoring, evaluating, and benchmarking generative models in de novo drug design. The primary contribution is the software itself: a modular, configurable platform that consolidates functionality previously scattered across multiple tools (GuacaMol, MOSES, MolOpt, REINVENT, TDC) into a single package. MolScore provides scoring functions for molecular optimization, evaluation metrics for assessing the quality of generated molecules, and a benchmark mode for standardized comparison of generative models.

The Fragmented Landscape of Generative Model Evaluation

Generative models for molecular design have proliferated rapidly, but evaluating and comparing them remains difficult. Existing benchmarks each address only part of the problem:

GuacaMol provides 20 fixed optimization objectives but cannot separate top-performing models on most tasks, and custom objectives require code modification.
MOSES focuses on distribution-learning metrics but does not support molecular optimization.
MolOpt extends benchmark evaluation to 25 generative approaches but lacks evaluation of the quality of generated chemistry.
Docking benchmarks (smina-docking-benchmark, DOCKSTRING, TDC) test structure-based scoring but often lack proper ligand preparation, leading generative models to exploit non-holistic objectives by generating large or greasy molecules.
REINVENT provides configurable scoring functions but is tightly coupled to its own generative model architecture.

No single tool offered configurable objectives, comprehensive evaluation metrics, generative-model-agnostic design, and graphical user interfaces together. This fragmentation forces practitioners to write custom glue code and makes reproducible comparison across methods difficult.

Modular Architecture for Scoring, Evaluation, and Benchmarking

MolScore is split into two sub-packages:

molscore: Molecule Scoring

The molscore sub-package handles iterative scoring of SMILES generated by any generative model. The workflow for each iteration:

Parse and validate SMILES via RDKit, canonicalize, and check intra-batch uniqueness.
Cross-reference against previously generated molecules to reuse cached scores (saving compute for expensive scoring functions like docking).
Run user-specified scoring functions on valid, unique molecules (invalid molecules receive a score of 0).
Transform each score to a 0-1 range using configurable transformation functions (normalize, linear threshold, Gaussian threshold, step threshold).
Aggregate transformed scores into a single desirability score using configurable aggregation (weighted sum, product, geometric mean, arithmetic mean, Pareto front, or auto-weighted variants).
Optionally apply diversity filters to penalize non-diverse molecules, or use any scoring function as a multiplicative filter.

The full objective is specified in a single JSON configuration file, with a Streamlit GUI provided for interactive configuration writing. The available scoring functions span:

Category	Examples
Descriptors	RDKit descriptors, linker descriptors, penalized logP
Similarity	Fingerprint similarity, ROCS, Open3DAlign, substructure matching
Predictive models	Scikit-learn models, PIDGINv5 (2,337 ChEMBL31 targets), ChemProp, ADMET-AI
Docking	Glide, PLANTS, GOLD, OEDock, Smina, Gnina, Vina, rDock
Synthesizability	SA score, RA Score, AiZynthFinder, reaction filters

Most scoring functions support multiprocessing, and computationally expensive functions (docking, ligand preparation) can be distributed across compute clusters via Dask.

moleval: Molecule Evaluation

The moleval sub-package computes performance metrics on generated molecules relative to reference datasets. It extends the MOSES metric suite with additional intrinsic metrics (sphere exclusion diversity, scaffold uniqueness, functional group and ring system diversity, ZINC20 purchasability via molbloom) and extrinsic metrics (analogue similarity/coverage, functional group and ring system similarity, outlier bits or “Silliness”).

Benchmark Mode

A MolScoreBenchmark class iterates over a list of JSON configuration files, providing standardized comparison. Pre-built presets reimplement GuacaMol and MolOpt benchmarks, and users can define custom benchmark suites without writing code.

Case Studies: 5-HT2A Ligand Design and Fine-Tuning Evaluation

The authors demonstrate MolScore with a SMILES-based RNN generative model using Augmented Hill-Climb for optimization, designing serotonin 5-HT2A receptor ligands across three objective sets of increasing complexity.

First Objective Set: Basic Drug Properties

Four objectives combine predicted 5-HT2A activity (via PIDGINv5 random forest models at 1 uM) with synthesizability (RAscore) and/or BBB permeability property ranges (TPSA < 70, HBD < 2, logP 2-4, MW < 400). All objectives were optimized successfully, with diversity filters preventing mode collapse. The most difficult single objective (5-HT2A activity alone) was hardest primarily because the diversity filter more heavily penalized similar molecules for this relatively easy task.

Second Objective Set: Selectivity

Six objectives incorporate selectivity proxies using PIDGINv5 models for off-target prediction against Class A GPCR membrane receptors (266 models), the D2 dopamine receptor, dopamine receptor family, serotonin receptor subtypes, and combinations. These proved substantially harder: selectivity against dopamine and serotonin receptor families combined was barely improved during optimization. Even with imperfect predictive models, the PIDGINv5 ensemble correctly identified 95 of 126 known selective 5-HT2A ligands. Nearest-neighbor analysis of de novo molecules (Tanimoto similarity 0.3-0.6) showed they tended to be structurally simpler versions of known selective ligands.

Third Objective Set: Structure-Based Docking

Two objectives use molecular docking via GlideSP into 5-HT2A (PDB: 6A93) and D2 (PDB: 6CM4) crystal structures with full ligand preparation (LigPrep for stereoisomer/tautomer/protonation state enumeration). Multi-parameter optimization includes docking score, D155 polar interaction constraint, formal charge, and consecutive rotatable bond limits. Single-target docking scores reached the mean of known ligands within 200 steps, but optimizing for divergent 5-HT2A vs D2 docking scores was much harder due to binding pocket similarity. Protein-ligand interaction fingerprint analysis (ProLIF) revealed that molecules optimized for selectivity avoided specific binding pocket regions shared between the two receptors.

Evaluation Case Study: Fine-Tuning Epochs

The moleval sub-package was used to track metrics across fine-tuning epochs of a SMILES RNN on A2A receptor ligands, showing that just one or two epochs sufficed to increase similarity to the fine-tuning set, while further epochs reduced novelty and diversity.

Configurable Benchmarking with Practical Drug Design Relevance

MolScore provides a more comprehensive platform than any single existing tool. Compared to prior work:

Feature	GuacaMol	MOSES	MolOpt	TDC	REINVENT	MolScore
Configurable objectives	No	N/A	No	No	Yes	Yes
Optimization objectives	Yes	No	Yes	Yes	Yes	Yes
Evaluation metrics	Yes	Yes	No	No	No	Yes
Model-agnostic	Yes	Yes	Yes	Yes	No	Yes
GUI	No	No	No	No	Yes	Yes

The framework integrates into any Python-based generative model in three lines of code. Dependency conflicts between scoring function libraries are handled by running conflicting components as local servers from isolated conda environments.

Key limitations acknowledged by the authors include: the assumption of conda for environment management, the inherent difficulty of designing non-exploitable objectives, and the fact that ligand-based predictive models may have limited applicability domains for out-of-distribution de novo molecules.

Future directions include accepting 3D molecular conformations as inputs, structure interaction fingerprint rescoring, and dynamic configuration files for curriculum learning.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pre-training	ChEMBL compounds	Not specified	Standard ChEMBL training set for SMILES RNN
Evaluation reference	5-HT2A ligands from ChEMBL31	3,771 compounds	Extracted for score distribution comparison
Activity models	PIDGINv5 on ChEMBL31	2,337 target models	Random forest classifiers at various concentration thresholds
Fine-tuning	A2A receptor ligands	Not specified	Used for moleval case study

Algorithms

The generative model used in case studies is a SMILES-based RNN with Augmented Hill-Climb reinforcement learning. Diversity filters penalize non-diverse molecules during optimization. Score transformation functions (normalize, linear threshold, Gaussian threshold, step threshold) map raw scores to 0-1 range. Aggregation functions (arithmetic mean, weighted sum, product, geometric mean, Pareto front) combine multi-parameter objectives.

Models

PIDGINv5 provides 2,337 pre-trained random forest classifiers on ChEMBL31 targets. RAscore provides pre-trained synthesizability prediction. ADMET-AI and ChemProp models are supported via isolated environments. Docking uses GlideSP with LigPrep for ligand preparation in the structure-based case study.

Evaluation

Intrinsic metrics: validity, uniqueness, scaffold uniqueness, internal diversity, sphere exclusion diversity, Solow-Polasky diversity, scaffold diversity, functional group diversity, ring system diversity, MCF and PAINS filters, ZINC20 purchasability.

Extrinsic metrics: novelty, FCD, analogue similarity/coverage, functional group similarity, ring system similarity, SNN similarity, fragment similarity, scaffold similarity, outlier bits, Wasserstein distance on LogP/SA/NP/QED/MW.

Hardware

Not specified in the paper. Docking-based objectives can be distributed across compute clusters via Dask.

Artifacts

Artifact	Type	License	Notes
MolScore	Code	MIT	Main framework, installable via pip
MolScore Examples	Code	MIT	Integration examples with SMILES-RNN, CReM, GraphGA

Paper Information

Citation: Thomas, M., O’Boyle, N. M., Bender, A., & de Graaf, C. (2024). MolScore: a scoring, evaluation and benchmarking framework for generative models in de novo drug design. Journal of Cheminformatics, 16(1), 64. https://doi.org/10.1186/s13321-024-00861-w

@article{thomas2024molscore,
  title={MolScore: a scoring, evaluation and benchmarking framework for generative models in de novo drug design},
  author={Thomas, Morgan and O'Boyle, Noel M. and Bender, Andreas and de Graaf, Chris},
  journal={Journal of Cheminformatics},
  volume={16},
  number={1},
  pages={64},
  year={2024},
  publisher={BioMed Central},
  doi={10.1186/s13321-024-00861-w}
}

MolGenBench: Benchmarking Molecular Generative Models

Wed, 25 Mar 2026 00:00:00 +0000

A Comprehensive Benchmark for Structure-Based Molecular Generation

MolGenBench is a Resource paper that provides a large-scale, application-oriented benchmark for evaluating molecular generative models in the context of structure-based drug design (SBDD). The primary contribution is a dataset of 220,005 experimentally validated active molecules across 120 protein targets, organized into 5,433 chemical series, along with a suite of novel evaluation metrics. The benchmark addresses both de novo molecular design and hit-to-lead (H2L) optimization, a critical drug discovery stage that existing benchmarks largely ignore.

Gaps in Existing Molecular Generation Benchmarks

Despite rapid progress in deep generative models for drug discovery, the evaluation landscape has not kept pace. The authors identify four categories of limitations in existing benchmarks:

Dataset construction: Existing benchmarks use overly stringent activity cutoffs and too few protein targets. The widely used CrossDocked2020 dataset contains very few reference ligands per target, making it difficult to evaluate whether a model can rediscover the full distribution of active compounds.
Model selection: Prior benchmark studies evaluate a narrow range of architectures and do not systematically examine the effects of training data composition, prior knowledge integration, or architectural paradigm.
Evaluation scenarios: Existing benchmarks focus exclusively on de novo generation. Hit-to-lead optimization, where a hit compound is refined through R-group modifications, remains unstandardized.
Evaluation metrics: Standard metrics (QED, Vina score, SA score) correlate strongly with atom count and fail to assess target-specific generation capacity. The AddCarbon model illustrates this: simply adding random carbon atoms to training molecules achieves near-perfect scores on standard metrics while producing nonsensical chemistry.

Novel Metrics for Evaluating Molecular Generation

MolGenBench introduces three key metrics designed to capture aspects of model performance that existing metrics miss.

Target-Aware Score (TAScore)

The TAScore measures whether a model generates target-specific molecules rather than generic structures. It compares the ratio of active molecule or scaffold recovery on a specific target to the background recovery across all targets:

$$ \text{TAScore}_{\text{label}, i} = \frac{S_{i} / S_{\text{all}}}{R_{i} / R_{\text{all}}}; \quad \text{label} \in \{\text{SMILES}, \text{scaffold}\} $$

For target $i$: $R_{\text{all}}$ is the total number of distinct molecules generated across all 120 targets; $R_{i}$ is the subset matching known actives for target $i$ (without conditioning on target $i$); $S_{\text{all}}$ is the total generated when conditioned on target $i$; and $S_{i}$ is the subset matching known actives for target $i$. A TAScore above 1 indicates the model uses target-specific information effectively.

Hit Rate

The hit rate quantifies the efficiency of active compound discovery:

$$ \text{HitRate}_{\text{label}} = \frac{\mathcal{M}_{\text{active}}}{\mathcal{M}_{\text{sampled}}}; \quad \text{label} \in \{\text{SMILES}, \text{scaffold}\} $$

where $\mathcal{M}_{\text{active}}$ is the number of unique active molecules or scaffolds found, and $\mathcal{M}_{\text{sampled}}$ is the total number of generated molecules.

Mean Normalized Affinity (MNA) Score

For H2L optimization, the MNA Score measures whether models generate compounds with improved potency relative to the known activity range within each chemical series:

$$ \text{NA}_{g} = \frac{\text{Affinity}_{g}^{\text{series}} - \text{Affinity}_{\min}^{\text{series}}}{\text{Affinity}_{\max}^{\text{series}} - \text{Affinity}_{\min}^{\text{series}}} $$

$$ \text{MNAScore} = \frac{1}{G} \sum_{g}^{G} \text{NA}_{g} $$

This normalizes affinities to [0, 1] within each series, enabling cross-series comparison.

Systematic Evaluation of 17 Generative Models Across Two Drug Discovery Scenarios

Dataset Construction

The MolGenBench dataset was built from ChEMBL v33. Ligands failing RDKit validation were discarded, along with entries where binding affinity exceeded 10 uM. The 120 protein targets were selected based on minimum thresholds: at least 50 active molecules, at least 50 unique Bemis-Murcko scaffolds, and at least 20 distinct chemical series per target. For H2L optimization, maximum common substructures (MCS) were identified per series, with dual thresholds requiring the MCS to appear in over 80% of molecules and cover more than one-third of each molecule’s atoms. The top 5 series per target (ranked by dockable ligands) formed the H2L test set: 600 compound series across 120 targets.

Evaluated Models

De novo models (10): Pocket2Mol, TargetDiff, FLAG, DecompDiff, SurfGen, PocketFlow, MolCraft, TamGen, DiffSBDD-M (trained on BindingMOAD), DiffSBDD-C (trained on CrossDock). These span autoregressive, diffusion, and Bayesian flow network architectures.

H2L models (7): Fragment-based (DiffSBDD-M/C inpainting, Delete, DiffDec) and ligand-based (ShEPhERD, ShapeMol, PGMG). These use pharmacophore, surface, or shape priors.

Models were further stratified by whether test proteins appeared in their CrossDock training set (“Proteins in CrossDock” vs. “Proteins Not in CrossDock”), enabling direct measurement of generalization.

Evaluation Dimensions

The benchmark evaluates six dimensions:

Dimension	Key Metrics
Basic molecular properties	Validity, QED, SA score, uniqueness, diversity, JSD alignment
Chemical safety	Industry-standard filter pass rates (Eli Lilly, Novartis, ChEMBL rules)
Conformational quality	PoseBusters pass rate, strain energy, steric clash frequency
Active compound recovery	Hit rate, hit fraction, active molecule and scaffold recovery counts
Target awareness	TAScore at molecule and scaffold levels
Lead optimization	MNA Score, number of series with hits

Key Results: Basic Properties and Chemical Safety

Most models generate drug-like molecules with reasonable QED (0.4-0.6) and SA scores (0.5-0.8). However, two models (FLAG, SurfGen) showed validity below 0.4. TamGen exhibited low uniqueness (~27%), suggesting overreliance on pretrained patterns.

Chemical filter pass rates revealed a more concerning picture: only TamGen and PGMG exceeded 50% of molecules passing all industry-standard filters. Most models fell below 40%, and some (FLAG, SurfGen) below 5%. Nearly 70% of reference active molecules passed the same filters, indicating models frequently generate high-risk compounds.

Key Results: Conformational Quality

MolCraft achieved the highest PoseBusters validity (0.783 PB-valid score among valid molecules). PocketFlow, despite perfect SMILES validity, had fewer than half of its valid molecules pass conformational checks. Most models produced conformations with higher strain energy than those from AutoDock Vina. Some models (MolCraft for de novo, DiffDec for H2L) surpassed Vina in minimizing steric clashes, suggesting advanced architectures can exceed the patterns in their training data.

Key Results: Active Compound Recovery and Hit Rates

De novo models exhibited very low hit rates. The highest molecular hit rate among de novo models was 0.124% on proteins in CrossDock, dropping to 0.024% on unseen proteins. Scaffold-level hit rates were 10-fold higher, showing that generating pharmacologically plausible scaffolds is considerably easier than generating fully active molecules.

After removing molecules overlapping with the CrossDock training set, TamGen’s recovery dropped substantially (from 30.3 to 18.7 targets), confirming significant memorization effects. On proteins not in CrossDock, half of the de novo models failed to recover any active molecules at all.

Fragment-based H2L models substantially outperformed both de novo models and ligand-based H2L approaches. Delete recovered active molecules in 44.3 series (out of 600), and DiffDec in 34.7 series.

Key Results: Target Awareness

Most de novo models failed the TAScore evaluation. PocketFlow showed the strongest target awareness at the scaffold level, with only 27% of targets showing TAScore < 1 (indicating no target specificity). At the molecular level, results were even weaker: TamGen achieved TAScore > 1 for only 30.6% of CrossDock-seen targets and just 4 out of 35 unseen targets. Most models generated structurally similar molecules regardless of which target they were conditioned on.

Key Results: H2L Optimization (MNA Score)

DiffDec achieved the highest total active hits (121.7) and the best MNA Score (0.523), followed by Delete (104.7 hits, MNA Score 0.482). Ligand-based models (ShEPhERD, PGMG) recovered fewer hits but showed higher MNA Scores per hit, suggesting pharmacophore-based priors help prioritize more potent molecules when actives are found. The most successful model (Delete) achieved a hit in only 9.6% of series (57/600), indicating substantial room for improvement.

Critical Findings and Limitations of Current Molecular Generative Models

The benchmark reveals several consistent limitations:

Low screening efficiency: De novo models achieve molecular hit rates below 0.13%, far from practical utility. Scaffold recovery is more feasible but still limited.
Weak target awareness: Most SBDD models fail to use protein structural information effectively, generating similar molecules across different targets. This raises concerns about off-target effects.
Conformational prediction remains difficult: Most models produce conformations with higher strain energy than classical docking, and only a small fraction (typically below 23%) of generated poses match redocked conformations within 2 Angstrom RMSD.
Generalization gap: Performance consistently drops on proteins not in the training set, and prior benchmarks that do not stratify by training data exposure overestimate real-world utility.
Inference-time scaling does not solve the problem: Sampling up to 100,000 molecules increased the absolute number of active discoveries but with diminishing efficiency. Without better scoring functions, scaling sampling offers limited practical value.
Chemical safety: Most models produce a majority of molecules that fail industry-standard reactivity and promiscuity filters.

The authors acknowledge that the benchmark’s 220,005 active molecules represent a biased subset of bioactive chemical space. A model’s failure to rediscover known actives for a given target may reflect sampling limitations rather than generating inactive compounds.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Active compounds	ChEMBL v33	220,005 molecules, 120 targets	Filtered at 10 uM affinity threshold
H2L series	ChEMBL v33 + PDB	5,433 series (600 used for H2L test)	MCS-based series construction
Protein structures	PDB	120 targets	One PDB entry per target
Training (most models)	CrossDocked2020	Varies	Standard SBDD training set

Algorithms

De novo models sampled 1,000 molecules per target; H2L models sampled 200 per series
All experiments repeated three times with different random seeds
Docking performed with AutoDock Vina using standard parameters
Chemical filters applied via the medchem library
Conformational quality assessed with PoseBusters and PoseCheck
Interaction scores computed via ProLIF with frequency-weighted normalization

Models

All 17 models were obtained from their official GitHub repositories and run with default configurations. The benchmark does not introduce new model architectures.

Evaluation

Summary of key metrics across the best-performing models in each category:

Metric	Best De Novo	Value	Best H2L	Value
PB-valid score	MolCraft	0.783	DiffSBDD-M	0.597
Molecular hit rate (in CrossDock)	TamGen	0.124%	DiffDec	Higher than de novo
Scaffold hit rate (in CrossDock)	PocketFlow	>10%	Delete	Lower than PocketFlow
TAScore scaffold (% targets >1)	PocketFlow	73%	N/A	N/A
MNA Score	N/A	N/A	DiffDec	0.523
Filter pass rate	TamGen	>50%	PGMG	>50%

Hardware

Specific hardware requirements are not detailed in the paper. Models were run using their default configurations from official repositories.

Artifacts

Artifact	Type	License	Notes
MolGenBench	Code	MIT	Benchmark evaluation framework
Zenodo dataset	Dataset	CC-BY-NC-ND 4.0	Processed data and source data for all results

Paper Information

Citation: Cao, D., Fan, Z., Yu, J., Chen, M., Jiang, X., Sheng, X., Wang, X., Zeng, C., Luo, X., Teng, D., & Zheng, M. (2025). Benchmarking Real-World Applicability of Molecular Generative Models from De novo Design to Lead Optimization with MolGenBench. bioRxiv. https://doi.org/10.1101/2025.11.03.686215

@article{cao2025molgenbench,
  title={Benchmarking Real-World Applicability of Molecular Generative Models from De novo Design to Lead Optimization with MolGenBench},
  author={Cao, Duanhua and Fan, Zhehuan and Yu, Jie and Chen, Mingan and Jiang, Xinyu and Sheng, Xia and Wang, Xingyou and Zeng, Chuanlong and Luo, Xiaomin and Teng, Dan and Zheng, Mingyue},
  journal={bioRxiv},
  year={2025},
  doi={10.1101/2025.11.03.686215}
}

MoleculeNet: Benchmarking Molecular Machine Learning

Wed, 25 Mar 2026 00:00:00 +0000

A Resource Paper for Molecular Machine Learning Benchmarking

This is a Resource paper. MoleculeNet provides a standardized benchmark suite for evaluating molecular machine learning methods. Its primary contribution is the curation of 17 public datasets spanning four categories of molecular properties, together with standardized evaluation metrics, multiple dataset splitting strategies, and open-source implementations of featurization and learning algorithms via the DeepChem library.

Why Molecular ML Needed a Unified Benchmark

Prior to MoleculeNet, algorithmic progress in molecular machine learning was difficult to measure. Individual papers benchmarked proposed methods on different datasets with different metrics, making cross-method comparison unreliable. Several factors make molecular ML particularly challenging:

Data scarcity: Molecular datasets are much smaller than those available for computer vision or NLP, since obtaining accurate chemical property measurements requires specialized instruments and expert supervision.
Heterogeneous outputs: Properties of interest range from quantum mechanical characteristics to macroscopic physiological effects on the human body.
Variable input structures: Molecules have arbitrary size, variable connectivity, and many possible 3D conformers, all of which must be encoded into fixed-length representations for conventional ML algorithms.
No standard evaluation protocol: Without prescribed metrics, splits, or data subsets, two methods using the same underlying database (e.g., PubChem) could be entirely incomparable.

Existing databases like PubChem, ChEMBL, and the Quantum Machine collections provided raw data but did not define evaluation protocols suitable for machine learning development. MoleculeNet bridges this gap, following the precedent set by ImageNet in computer vision and WordNet in NLP.

Core Design: Datasets, Splits, Metrics, and Featurizations

MoleculeNet is organized around four components: curated datasets, splitting methods, evaluation metrics, and molecular featurizations.

Datasets Across Four Property Categories

The benchmark includes 17 datasets covering over 700,000 compounds and more than 800 tasks. These are organized into four categories reflecting different levels of molecular properties:

Category	Dataset	Tasks	Compounds	Task Type	Rec. Split	Rec. Metric
Quantum Mechanics	QM7	1	7,165	Regression	Stratified	MAE
	QM7b	14	7,211	Regression	Random	MAE
	QM8	12	21,786	Regression	Random	MAE
	QM9	12	133,885	Regression	Random	MAE
Physical Chemistry	ESOL	1	1,128	Regression	Random	RMSE
	FreeSolv	1	643	Regression	Random	RMSE
	Lipophilicity	1	4,200	Regression	Random	RMSE
Biophysics	PCBA	128	439,863	Classification	Random	PRC-AUC
	MUV	17	93,127	Classification	Random	PRC-AUC
	HIV	1	41,913	Classification	Scaffold	ROC-AUC
	PDBbind	1	11,908	Regression	Time	RMSE
	BACE	1	1,522	Classification	Scaffold	ROC-AUC
Physiology	BBBP	1	2,053	Classification	Scaffold	ROC-AUC
	Tox21	12	8,014	Classification	Random	ROC-AUC
	ToxCast	617	8,615	Classification	Random	ROC-AUC
	SIDER	27	1,427	Classification	Random	ROC-AUC
	ClinTox	2	1,491	Classification	Random	ROC-AUC

Quantum mechanics datasets (QM7, QM7b, QM8, QM9) contain DFT-computed electronic properties for subsets of the GDB database. Physical chemistry datasets cover solubility (ESOL), hydration free energy (FreeSolv), and lipophilicity. Biophysics datasets include high-throughput screening results (PCBA, MUV), HIV inhibition activity, protein-ligand binding affinity (PDBbind), and BACE-1 inhibition. Physiology datasets cover blood-brain barrier penetration (BBBP), toxicity (Tox21, ToxCast), side effects (SIDER), and clinical trial toxicity (ClinTox).

Data Splitting Strategies

MoleculeNet implements four splitting methods, all using an 80/10/10 train/validation/test ratio:

Random splitting: Standard random assignment to subsets.
Scaffold splitting: Separates molecules by their 2D structural frameworks (Bemis-Murcko scaffolds), providing a harder generalization test since structurally different molecules appear in different subsets.
Stratified splitting: Ensures each subset contains the full range of label values (used for QM7).
Time splitting: Trains on older data and tests on newer data to mimic real-world development (used for PDBbind).

Evaluation Metrics

Regression tasks use MAE or RMSE depending on the dataset. Classification tasks use either ROC-AUC or PRC-AUC. The choice between ROC-AUC and PRC-AUC depends on class imbalance: PRC-AUC is recommended for datasets with positive rates below 2% (PCBA, MUV), since precision-recall curves better capture performance under extreme imbalance.

The false positive rate and precision are defined as:

$$ \text{FPR} = \frac{\text{false positive}}{\text{false positive} + \text{true negative}} $$

$$ \text{precision} = \frac{\text{true positive}}{\text{false positive} + \text{true positive}} $$

When positive samples form a small fraction of the data, false positives influence precision much more than FPR, making PRC-AUC more informative than ROC-AUC.

Featurization Methods

MoleculeNet implements six molecular featurization approaches:

ECFP (Extended-Connectivity Fingerprints): Fixed-length binary fingerprints capturing topological substructures via hashing.
Coulomb Matrix: Encodes nuclear charges and 3D coordinates through atomic self-energies and Coulomb repulsion:

$$ M_{IJ} = \begin{cases} 0.5 Z_{I}^{2.4} & \text{for } I = J \\ \frac{Z_{I} Z_{J}}{|\mathbf{R}_{I} - \mathbf{R}_{J}|} & \text{for } I \neq J \end{cases} $$

Grid Featurizer: Designed for PDBbind, incorporating both ligand and protein structural information including salt bridges, hydrogen bonds, and SPLIF fingerprints.
Symmetry Functions: Preserve rotational and permutation symmetry through radial and angular functions between atom pairs and triplets.
Graph Convolutions: Compute initial atom feature vectors and neighbor lists from molecular graphs.
Weave: Similar to graph convolutions but also computes pairwise atom features encoding bond properties, graph distance, and ring information.

Benchmarked Models and Experimental Setup

MoleculeNet benchmarks 12 learning algorithms divided into conventional methods and graph-based methods.

Conventional Methods

Logistic Regression (classification only)
Kernel SVM with radial basis function kernel
Kernel Ridge Regression (KRR)
Random Forests
Gradient Boosting (XGBoost)
Singletask/Multitask Networks: Fully connected networks with shared layers across tasks
Bypass Networks: Multitask networks augmented with per-task “bypass” layers that directly connect inputs to outputs
Influence Relevance Voting (IRV): Refined K-nearest neighbor classifiers using Jaccard-Tanimoto similarity:

$$ S(\vec{A}, \vec{B}) = \frac{A \cap B}{A \cup B} $$

Graph-Based Methods

Graph Convolutional Models (GC): Extend circular fingerprints with learnable convolutions over molecular graphs.
Weave Models: Update atom features using information from all other atoms and their pairwise features.
Directed Acyclic Graph (DAG) Models: Define directed bonds toward a central atom and propagate features through the directed graph.
Deep Tensor Neural Networks (DTNN): Use nuclear charges and distance matrices directly, updating atom embeddings based on pairwise physical distances.
ANI-1: Learns transferable potentials using symmetry function features with atom-type-specific neural networks.
Message Passing Neural Networks (MPNN): Generalized framework with edge-dependent message functions and set2set readout.

Experimental Protocol

Gaussian process hyperparameter optimization was applied to each dataset-model combination, followed by three independent runs with different random seeds. All results are reported as means with standard deviations. Variable training-size experiments were conducted on Tox21, FreeSolv, and QM7 to study data efficiency.

Key Findings Across Property Categories

Biophysics and Physiology

Graph convolutional and weave models showed strong performance on larger datasets with less overfitting than conventional methods. Graph-based models outperformed multitask networks at 30% training data compared to 90% on Tox21. However, for smaller single-task datasets (under 3,000 samples), kernel SVM and ensemble tree methods were more robust. On highly imbalanced datasets like MUV (0.20% positive rate), graph-based models struggled to control false positives.

Multitask training had a regularizing effect, reducing the gap between train and test scores compared to single-task models. Bypass networks consistently matched or exceeded vanilla multitask networks, confirming that per-task layers add explanatory power.

Physical Chemistry

Graph-based methods (GC, DAG, MPNN, Weave) provided significant improvements over single-task networks for predicting solubility, solvation energy, and lipophilicity. The best models achieved accuracy comparable to ab initio predictions (within 0.5 RMSE for ESOL, within 1.5 kcal/mol for FreeSolv). On FreeSolv, a weave model trained on approximately 200 samples matched the accuracy of alchemical free energy calculations.

Quantum Mechanics

Models incorporating 3D distance information (DTNN, MPNN, KRR with Coulomb matrix) substantially outperformed models using only topological features. DTNN and MPNN covered the best-performing models on 28 of 39 tasks across QM datasets. The choice of physics-aware featurization proved more important than the choice of learning algorithm for these tasks.

Summary of Best Performances

Graph-based models outperformed conventional methods on 11 of 17 datasets. Key results on the test set:

Dataset	Metric	Best Conventional	Best Graph-Based
QM7	MAE	KRR (CM): 10.22	DTNN: 8.75
QM9	MAE	Multitask (CM): 4.35	DTNN: 2.35
ESOL	RMSE	XGBoost: 0.99	MPNN: 0.58
FreeSolv	RMSE	XGBoost: 1.74	MPNN: 1.15
PCBA	PRC-AUC	Logreg: 0.129	GC: 0.136
Tox21	ROC-AUC	KernelSVM: 0.822	GC: 0.829
HIV	ROC-AUC	KernelSVM: 0.792	GC: 0.763
BACE	ROC-AUC	RF: 0.867	Weave: 0.806

Conventional methods (KernelSVM, RF) still won on several smaller or scaffold-split datasets (HIV, BACE, MUV, PDBbind, BBBP, SIDER), highlighting that graph-based models are not universally superior, particularly under data scarcity or challenging splits.

Conclusions and Limitations

MoleculeNet demonstrated that learnable representations broadly offer the best performance for molecular machine learning. However, the authors identify several important caveats:

Data scarcity: Graph-based methods are not robust enough on complex tasks with limited training data.
Class imbalance: On heavily imbalanced classification datasets, conventional methods such as kernel SVM outperform learnable featurizations with respect to recall of positives.
Task-specific featurizations: For quantum mechanical and biophysical datasets, incorporating physics-aware features (Coulomb matrix, 3D coordinates) is more important than the choice of learning algorithm.
Data-driven physical chemistry: On FreeSolv, data-driven methods outperformed ab initio calculations with moderate data, suggesting data-driven approaches will become increasingly important as methods and datasets mature.

The authors express hope that MoleculeNet will stimulate algorithmic development similar to how ImageNet catalyzed breakthroughs in computer vision. Future directions include extending coverage to 3D protein structure prediction, DNA topological modeling, and other areas of molecular science.

Reproducibility Details

Data

All 17 datasets are publicly available and integrated into the DeepChem Python package. Users can load any dataset with a single library call.

Purpose	Dataset	Size	Notes
QM benchmark	QM7/QM7b/QM8/QM9	7K-134K compounds	DFT-computed properties from GDB subsets
Physical chemistry	ESOL/FreeSolv/Lipophilicity	643-4,200 compounds	Experimental measurements
Biophysics	PCBA/MUV/HIV/PDBbind/BACE	1.5K-440K compounds	Bioassay and binding data
Physiology	BBBP/Tox21/ToxCast/SIDER/ClinTox	1.4K-8.6K compounds	Toxicity and drug safety data

Algorithms

All splitting methods (random, scaffold, stratified, time) and featurizations (ECFP, Coulomb matrix, grid, symmetry functions, graph convolutions, weave) are implemented in DeepChem. Hyperparameters were tuned via Gaussian process optimization. Three random seeds were used per experiment.

Models

All 12 models are implemented in DeepChem, built on Scikit-Learn and TensorFlow. No pretrained weights are provided; models are trained from scratch on each dataset.

Evaluation

Metrics include MAE, RMSE, ROC-AUC, and PRC-AUC as specified per dataset. Multi-task datasets report mean metric values across all tasks.

Hardware

The authors used Stanford’s Sherlock and Xstream GPU nodes. Specific GPU types and training times per model are provided in Table S1 of the supplementary material.

Artifacts

Artifact	Type	License	Notes
DeepChem	Code	MIT	Open-source library with all datasets, featurizations, and models

Paper Information

Citation: Wu, Z., Ramsundar, B., Feinberg, E. N., Gomes, J., Geniesse, C., Pappu, A. S., Leswing, K., & Pande, V. (2018). MoleculeNet: a benchmark for molecular machine learning. Chemical Science, 9(2), 513-530. https://doi.org/10.1039/c7sc02664a

@article{wu2018moleculenet,
  title={MoleculeNet: a benchmark for molecular machine learning},
  author={Wu, Zhenqin and Ramsundar, Bharath and Feinberg, Evan N. and Gomes, Joseph and Geniesse, Caleb and Pappu, Aneesh S. and Leswing, Karl and Pande, Vijay},
  journal={Chemical Science},
  volume={9},
  number={2},
  pages={513--530},
  year={2018},
  publisher={Royal Society of Chemistry},
  doi={10.1039/c7sc02664a}
}

GuacaMol: Benchmarking Models for De Novo Molecular Design

Wed, 25 Mar 2026 00:00:00 +0000

A Standardized Benchmark for Molecular Design

GuacaMol is a Resource paper. Its primary contribution is a standardized, open-source benchmarking framework for evaluating models for de novo molecular design. The framework defines 5 distribution-learning benchmarks and 20 goal-directed optimization benchmarks, implemented as a Python package. The authors also provide baseline results for several classical and neural generative models, establishing reference performance levels for future comparisons.

The Need for Consistent Evaluation in Generative Chemistry

By 2018, deep generative models for molecular design (VAEs, RNNs, GANs) had shown promising results, but the field lacked consistent evaluation standards. Different papers used different tasks, different datasets, and different metrics, making it difficult to compare models or assess real progress. Comparative studies between neural approaches and well-established algorithms like genetic algorithms were rare.

In other areas of machine learning, standardized benchmarks (ImageNet for vision, GLUE for NLP) had driven rapid progress by enabling fair comparisons. The de novo design community lacked an equivalent. Additionally, many existing evaluations focused on easily optimizable properties (logP, QED) that could not differentiate between models, since even simple baselines achieved near-perfect scores on those tasks.

Benchmark Design: Distribution Learning and Goal-Directed Optimization

GuacaMol separates evaluation into two independent dimensions, reflecting the two main use cases of generative models.

Distribution-Learning Benchmarks

These five benchmarks assess how well a model learns to generate molecules similar to a training set (a standardized subset of ChEMBL 24):

Validity: Fraction of generated molecules that are chemically valid (parseable by RDKit), measured over 10,000 generated samples.
Uniqueness: Fraction of unique canonical SMILES among 10,000 valid generated molecules.
Novelty: Fraction of generated molecules not present in the training set, measured over 10,000 unique samples.
Fréchet ChemNet Distance (FCD): Measures distributional similarity between generated and reference molecules using hidden representations from ChemNet (trained on biological activity prediction). The FCD score is transformed as:

$$S = \exp(-0.2 \cdot \text{FCD})$$

KL Divergence: Compares distributions of nine physicochemical descriptors (BertzCT, MolLogP, MolWt, TPSA, NumHAcceptors, NumHDonors, NumRotatableBonds, NumAliphaticRings, NumAromaticRings) plus maximum nearest-neighbor ECFP4 similarity. The final score aggregates per-descriptor KL divergences:

$$S = \frac{1}{k} \sum_{i}^{k} \exp(-D_{\text{KL}, i})$$

where $k = 9$ is the number of descriptors.

Goal-Directed Benchmarks

The 20 goal-directed benchmarks evaluate a model’s ability to generate molecules that maximize a given scoring function. These span several categories:

Rediscovery (3 tasks): Regenerate a specific target molecule (Celecoxib, Troglitazone, Thiothixene) using Tanimoto similarity on ECFP4 fingerprints.
Similarity (3 tasks): Generate many molecules similar to a target (Aripiprazole, Albuterol, Mestranol) above a threshold of 0.75.
Isomers (2 tasks): Generate molecules matching a target molecular formula ($\text{C}_{11}\text{H}_{24}$ and $\text{C}_9\text{H}_{10}\text{N}_2\text{O}_2\text{PF}_2\text{Cl}$).
Median molecules (2 tasks): Maximize similarity to two reference molecules simultaneously (camphor/menthol and tadalafil/sildenafil).
Multi-property optimization (7 tasks): Optimize combinations of similarity, physicochemical properties, and structural features for drug-relevant molecules (Osimertinib, Fexofenadine, Ranolazine, Perindopril, Amlodipine, Sitagliptin, Zaleplon).
SMARTS-based (1 task): Target molecules containing specific substructure patterns with constrained physicochemical properties (Valsartan SMARTS).
Scaffold/decorator hop (2 tasks): Modify molecular scaffolds while preserving substituent patterns, or vice versa.

The benchmark score for most goal-directed tasks combines top-1, top-10, and top-100 molecule scores:

$$S = \frac{1}{3}\left(s_1 + \frac{1}{10}\sum_{i=1}^{10} s_i + \frac{1}{100}\sum_{i=1}^{100} s_i\right)$$

where $s_i$ are molecule scores sorted in decreasing order.

Score Modifiers

Raw molecular properties are transformed via modifier functions to restrict scores to [0, 1]:

Gaussian($\mu$, $\sigma$): Targets a specific property value
MinGaussian($\mu$, $\sigma$): Full score below $\mu$, decreasing above
MaxGaussian($\mu$, $\sigma$): Full score above $\mu$, decreasing below
Thresholded($t$): Full score above threshold $t$, linear decrease below

Multi-property objectives use either arithmetic or geometric means to combine individual scores.

Baseline Models and Experimental Setup

The authors evaluate six baseline models spanning different paradigms:

Distribution-learning baselines:

Random sampler: Samples molecules directly from the dataset (provides upper/lower bounds).
SMILES LSTM: 3-layer LSTM (hidden size 1024) trained to predict next SMILES characters.
Graph MCTS: Monte Carlo Tree Search building molecules atom-by-atom.
VAE: Variational autoencoder on SMILES representations.
AAE: Adversarial autoencoder.
ORGAN: Objective-reinforced generative adversarial network.

Goal-directed baselines:

Best of dataset: Scores all training molecules and returns the best (virtual screening baseline).
SMILES LSTM: Same model with 20 iterations of hill-climbing (8192 samples per iteration, top 1024 for fine-tuning).
SMILES GA: Genetic algorithm operating on SMILES strings with grammar-based mutations.
Graph GA: Genetic algorithm operating on molecular graphs with crossover and mutation.
Graph MCTS: Monte Carlo Tree Search with 40 simulations per molecule.

The training dataset is ChEMBL 24 after filtering: salt removal, charge neutralization, SMILES length cap of 100, element restrictions, and removal of molecules similar (ECFP4 > 0.323) to 10 held-out drug molecules used in benchmarks.

Distribution-Learning Results

Benchmark	Random	SMILES LSTM	Graph MCTS	AAE	ORGAN	VAE
Validity	1.000	0.959	1.000	0.822	0.379	0.870
Uniqueness	0.997	1.000	1.000	1.000	0.841	0.999
Novelty	0.000	0.912	0.994	0.998	0.687	0.974
KL divergence	0.998	0.991	0.522	0.886	0.267	0.982
FCD	0.929	0.913	0.015	0.529	0.000	0.863

Goal-Directed Results (Selected)

Benchmark	Best of Dataset	SMILES LSTM	SMILES GA	Graph GA	Graph MCTS
Celecoxib rediscovery	0.505	1.000	0.732	1.000	0.355
Osimertinib MPO	0.839	0.907	0.886	0.953	0.784
Sitagliptin MPO	0.509	0.545	0.689	0.891	0.458
Scaffold Hop	0.738	0.998	0.885	1.000	0.478
Total (20 tasks)	12.144	17.340	14.396	17.983	9.009

Key Findings and Limitations

Main Findings

The Graph GA achieves the highest total score across goal-directed benchmarks (17.983), followed closely by the SMILES LSTM (17.340). This result is notable because genetic algorithms are well-established methods, and the LSTM-based neural approach nearly matches their optimization performance.

However, compound quality tells a different story. When examining the top 100 molecules per task through chemical quality filters (SureChEMBL, Glaxo, PAINS rules), 77% of LSTM-generated molecules pass, matching the Best of ChEMBL baseline. In contrast, Graph GA produces only 40% passing molecules, and Graph MCTS only 22%. This suggests that neural models benefit from pre-training on real molecular distributions, which encodes implicit knowledge about what constitutes a “reasonable” molecule.

ORGAN performs poorly across all distribution-learning tasks, with more than half its generated molecules being invalid. This is consistent with mode collapse, a known problem in GAN training.

Simpler generative models (LSTM, VAE) outperform more complex architectures (ORGAN, AAE) on distribution learning. Graph MCTS struggles with both distribution learning and goal-directed optimization, suggesting that single-molecule search trees are less effective than population-based approaches.

Limitations

The authors explicitly identify several issues:

Compound quality is hard to quantify: The rule-based filters used are acknowledged as “high precision, low recall” surrogates. They catch some problematic molecules but cannot encode the full breadth of medicinal chemistry expertise.
Some benchmarks are too easy: The trivially optimizable tasks (logP, QED, CNS MPO) cannot differentiate between models. All baselines achieve near-perfect scores on these.
Sample efficiency and runtime are not benchmarked: The framework does not penalize models for requiring excessive scoring function calls.
Synthesis accessibility is not addressed: Generated molecules may be valid but impractical to synthesize.

Future Directions

The authors call for harder benchmark tasks, better compound quality metrics, attention to sample efficiency and runtime constraints, and further development of graph-based neural generative models.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Training	ChEMBL 24 (post-processed)	~1.6M molecules	Salt removal, neutralization, SMILES length cap, element restrictions
Evaluation	10 held-out drug molecules	10	Removed from training set via ECFP4 similarity threshold
Quality filters	SureChEMBL, Glaxo, PAINS, in-house rules	N/A	Applied via rd_filters

Algorithms

SMILES LSTM: 3-layer LSTM, hidden size 1024; hill-climbing with 20 iterations, 8192 samples per iteration, top 1024 for fine-tuning
Graph GA: Population of 100, mating pool of 200, crossover + mutation (probability 0.5), 1000 epochs max
SMILES GA: Population of 300, offspring of 600, SMILES grammar-based mutations, 1000 epochs max
Graph MCTS: 40 simulations per molecule, 25 children per step, rollout to 60 atoms, starting from CC

Models

All baseline implementations are released as open-source code. VAE, AAE, and ORGAN implementations are from the MOSES repository.

Evaluation

All distribution-learning benchmarks sample 10,000 molecules. Goal-directed benchmarks use combinations of top-1, top-10, and top-100 scores. Compound quality is assessed via the percentage of top-100 molecules passing chemical filters.

Hardware

Hardware requirements are not specified in the paper.

Artifacts

Artifact	Type	License	Notes
GuacaMol	Code	MIT	Benchmarking framework and scoring functions
GuacaMol Baselines	Code	MIT	Baseline model implementations
ChEMBL dataset	Dataset	CC-BY-SA 3.0	Post-processed ChEMBL 24 for benchmarks
FCD package	Code	LGPL-3.0	Fréchet ChemNet Distance implementation

Paper Information

Citation: Brown, N., Fiscato, M., Segler, M. H. S., & Vaucher, A. C. (2019). GuacaMol: Benchmarking Models for De Novo Molecular Design. Journal of Chemical Information and Modeling, 59(3), 1096-1108. https://doi.org/10.1021/acs.jcim.8b00839

Additional Resources:

@article{brown2019guacamol,
  title={GuacaMol: Benchmarking Models for de Novo Molecular Design},
  author={Brown, Nathan and Fiscato, Marco and Segler, Marwin H. S. and Vaucher, Alain C.},
  journal={Journal of Chemical Information and Modeling},
  volume={59},
  number={3},
  pages={1096--1108},
  year={2019},
  publisher={American Chemical Society},
  doi={10.1021/acs.jcim.8b00839}
}

DOCKSTRING: Docking-Based Benchmarks for Drug Design

Wed, 25 Mar 2026 00:00:00 +0000

A Three-Part Resource for Docking-Based ML Benchmarks

DOCKSTRING is a Resource paper that delivers three integrated components for benchmarking machine learning models in drug discovery using molecular docking. The primary contributions are: (1) an open-source Python package wrapping AutoDock Vina for deterministic docking from SMILES strings, (2) a dataset of over 15 million docking scores and poses covering 260,000+ molecules docked against 58 medically relevant protein targets, and (3) a suite of benchmark tasks spanning regression, virtual screening, and de novo molecular design. The paper additionally provides baseline results across classical and deep learning methods.

Why Existing Molecular Benchmarks Fall Short

ML methods for drug discovery are frequently evaluated using simple physicochemical properties such as penalized logP or QED (quantitative estimate of druglikeness). These properties are computationally cheap and easy to optimize, but they do not depend on the interaction between a candidate compound and a protein target. As a result, strong performance on logP or QED benchmarks does not necessarily translate to strong performance on real drug design tasks.

Molecular docking offers a more realistic evaluation objective because docking scores depend on the 3D structure of the ligand-target complex. Docking is routinely used by medicinal chemists to estimate binding affinities during hit discovery and lead optimization. Several prior efforts attempted to bring docking into ML benchmarking, but each had limitations:

VirtualFlow and DockStream require manually prepared target files and domain expertise.
TDC and Cieplinski et al. provide SMILES-to-score wrappers but lack proper ligand protonation and randomness control, and cover very few targets (one and four, respectively).
DUD-E is easily overfit by ML models that memorize actives vs. decoys.
GuacaMol and MOSES rely on physicochemical properties or similarity functions that miss 3D structural subtleties.
MoleculeNet compiles experimental datasets but does not support on-the-fly label computation needed for transfer learning or de novo design.

DOCKSTRING addresses all of these gaps: it standardizes the docking procedure, automates ligand and target preparation, controls randomness for reproducibility, and provides a large, diverse target set.

Core Innovation: Standardized End-to-End Docking Pipeline

The key innovation is a fully automated, deterministic docking pipeline that produces reproducible scores from a SMILES string in four lines of Python code. The pipeline consists of three stages:

Target Preparation. 57 of the 58 protein targets originate from the Directory of Useful Decoys Enhanced (DUD-E). PDB files are standardized with Open Babel, polar hydrogens are added, and conversion to PDBQT format is performed with AutoDock Tools. Search boxes are derived from crystallographic ligands with 12.5 A padding and a minimum side length of 30 A. The 58th target (DRD2, dopamine receptor D2) was prepared separately following the same protocol.

Ligand Preparation. Ligands are protonated at pH 7.4 with Open Babel, embedded into 3D conformations using the ETKDG algorithm in RDKit, refined with the MMFF94 force field, and assigned Gasteiger partial charges. Stereochemistry of determined stereocenters is maintained, while undetermined stereocenters are assigned randomly but consistently across runs.

Docking. AutoDock Vina runs with default exhaustiveness (8), up to 9 binding modes, and an energy range of 3 kcal/mol. The authors verified that fixing the random seed yields docking score variance of less than 0.1 kcal/mol across runs, making the pipeline fully deterministic.

The three de novo design objective functions incorporate a QED penalty to enforce druglikeness:

$$ f_{\text{F2}}(l) = s(l, \text{F2}) + 10(1 - \text{QED}(l)) $$

$$ f_{\text{PPAR}}(l) = \max_{t \in \text{PPAR}} s(l, t) + 10(1 - \text{QED}(l)) $$

$$ f_{\text{JAK2}}(l) = s(l, \text{JAK2}) - \min(s(l, \text{LCK}), -8.1) + 10(1 - \text{QED}(l)) $$

The F2 task optimizes binding to a single protease. The Promiscuous PPAR task requires strong binding to three nuclear receptors simultaneously. The Selective JAK2 task is adversarial, requiring strong JAK2 binding while avoiding LCK binding (two kinases with a score correlation of 0.80).

Experimental Setup: Regression, Virtual Screening, and De Novo Design

Dataset Construction

The dataset combines molecules from ExCAPE-DB (which curates PubChem and ChEMBL bioactivity assays). The authors selected all molecules with active labels against targets having at least 1,000 experimental actives, plus 150,000 inactive-only molecules. After discarding 1.8% of molecules that failed ligand preparation, the final dataset contains 260,155 compounds docked against 58 targets, producing over 15 million docking scores and poses. The dataset required over 500,000 CPU hours to generate.

Cluster analysis using DBSCAN (Jaccard distance threshold of 0.25 on RDKit fingerprints) found 52,000 clusters, and Bemis-Murcko scaffold decomposition identified 102,000 scaffolds, confirming high molecular diversity. Train/test splitting follows cluster labels to prevent data leakage.

Regression Baselines

Five targets of varying difficulty were selected: PARP1 (easy), F2 (easy-medium), KIT (medium), ESR2 (hard), and PGR (hard). Baselines include Ridge, Lasso, XGBoost, exact GP, sparse GP, MPNN, and Attentive FP.

Target	Ridge	Lasso	XGBoost	GP (exact)	GP (sparse)	MPNN	Attentive FP
logP	0.640	0.640	0.734	0.707	0.716	0.953	1.000
QED	0.519	0.483	0.660	0.640	0.598	0.901	0.981
ESR2	0.421	0.416	0.497	0.441	0.508	0.506	0.627
F2	0.672	0.663	0.688	0.705	0.744	0.798	0.880
KIT	0.604	0.594	0.674	0.637	0.684	0.755	0.806
PARP1	0.706	0.700	0.723	0.743	0.772	0.815	0.910
PGR	0.242	0.245	0.345	0.291	0.387	0.324	0.678

Values are mean $R^2$ over three runs. Attentive FP achieves the best performance on every target but remains well below perfect prediction on the harder targets, confirming that docking score regression is a meaningful benchmark.

Virtual Screening Baselines

Models trained on PARP1, KIT, and PGR docking scores rank all molecules in ZINC20 (~1 billion compounds). The top 5,000 predictions are docked, and the enrichment factor (EF) is computed relative to a 0.1 percentile activity threshold.

Target	Threshold	FSS	Ridge	Attentive FP
KIT	-10.7	239.2	451.6	766.5
PARP1	-12.1	313.1	325.9	472.2
PGR	-10.1	161.4	120.5	461.3

The maximum possible EF is 1,000. Attentive FP substantially outperforms fingerprint similarity search (FSS) and Ridge regression across all targets.

De Novo Design Baselines

Four optimization methods were tested: SELFIES GA, Graph GA, GP-BO with UCB acquisition ($\beta = 10$), and GP-BO with expected improvement (EI), each with a budget of 5,000 objective function evaluations. Without QED penalties, all methods easily surpass the best training set molecules but produce large, lipophilic, undrug-like compounds. With QED penalties, the tasks become substantially harder: GP-BO with EI is the only method that finds 25 molecules better than the training set across all three tasks.

The Selective JAK2 task proved hardest due to the high correlation between JAK2 and LCK scores. Pose analysis of the top de novo molecule revealed a dual binding mode: type V inhibitor behavior in JAK2 (binding distant N- and C-terminal lobe regions) and type I behavior in LCK (hinge-binding), suggesting a plausible selectivity mechanism.

Key Findings and Limitations

Key findings:

Docking scores are substantially harder to predict than logP or QED, making them more suitable for benchmarking high-performing ML models. Graph neural networks (Attentive FP) achieve near-perfect $R^2$ on logP but only 0.63-0.91 on docking targets.
In-distribution regression difficulty does not necessarily predict out-of-distribution virtual screening difficulty. PARP1 is easiest for regression, but KIT is easiest for virtual screening.
Adding a QED penalty to de novo design objectives transforms trivially solvable tasks into meaningful benchmarks. The adversarial Selective JAK2 objective, which exploits correlated docking scores, may be an effective way to avoid docking score biases toward large and lipophilic molecules.
Docking scores from related protein targets are highly correlated, supporting the biological meaningfulness of the dataset and enabling multiobjective and transfer learning tasks.

Limitations acknowledged by the authors:

Docking scores are approximate heuristics. They use static binding sites and force fields with limited calibration for certain metal ions. DOCKSTRING benchmarks should not substitute for rational drug design and experimental validation.
The pipeline relies on AutoDock Vina specifically; other docking programs may produce different rankings.
Top de novo molecules for F2 and Promiscuous PPAR contain conjugated ring structures uncommon in successful drugs.
Platform support is primarily Linux, with noted scoring inconsistencies on macOS.

Future directions mentioned include multiobjective tasks (transfer learning, few-shot learning), improved objective functions for better pharmacokinetic properties and synthetic feasibility, and multifidelity optimization tasks combining docking with more expensive computational methods.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Ligand source	ExCAPE-DB (PubChem + ChEMBL)	260,155 molecules	Actives against 58 targets + 150K inactive-only
Docking scores	DOCKSTRING dataset	15M+ scores and poses	Full matrix across all molecule-target pairs
Virtual screening library	ZINC20	~1 billion molecules	Used for out-of-distribution evaluation
Target structures	DUD-E + PDB 6CM4 (DRD2)	58 targets	Kinases (22), enzymes (12), nuclear receptors (9), proteases (7), GPCRs (5), cytochromes (2), chaperone (1)

Algorithms

Docking engine: AutoDock Vina with default exhaustiveness (8), up to 9 binding modes, energy range of 3 kcal/mol
Ligand preparation: Open Babel (protonation at pH 7.4), RDKit ETKDG (3D embedding), MMFF94 (force field refinement), Gasteiger charges
Regression models: Ridge, Lasso, XGBoost (hyperparameters via 20-configuration random search with 5-fold CV), exact GP and sparse GP (Tanimoto kernel on fingerprints), MPNN, Attentive FP (DeepChem defaults, 10 epochs)
Optimization: Graph GA (population 250, offspring 25, mutation rate 0.01), SELFIES GA (same population/offspring settings), GP-BO with UCB ($\beta = 10$) or EI (batch size 5, 1000 offspring, 25 generations per iteration)

Evaluation

Metric	Setting	Notes
$R^2$ (coefficient of determination)	Regression	Cluster-split train/test
EF (enrichment factor)	Virtual screening	Top 5,000 from ZINC20, 0.1 percentile threshold
Objective value trajectory	De novo design	5,000 function evaluation budget

Hardware

The dataset required over 500,000 CPU hours to compute, using the University of Cambridge Research Computing Service (EPSRC and DiRAC funded). Per-target docking takes approximately 15 seconds on 8 CPUs.

Artifacts

Artifact	Type	License	Notes
DOCKSTRING Python package	Code	Apache 2.0	Wraps AutoDock Vina; available via conda-forge and PyPI
DOCKSTRING dataset	Dataset	Apache 2.0	15M+ docking scores and poses for 260K molecules x 58 targets
Benchmark baselines	Code	Apache 2.0	Regression, virtual screening, and de novo design baseline implementations

Paper Information

Citation: García-Ortegón, M., Simm, G. N. C., Tripp, A. J., Hernández-Lobato, J. M., Bender, A., & Bacallado, S. (2022). DOCKSTRING: Easy Molecular Docking Yields Better Benchmarks for Ligand Design. Journal of Chemical Information and Modeling, 62(15), 3486-3502. https://doi.org/10.1021/acs.jcim.1c01334

Publication: Journal of Chemical Information and Modeling, 2022

Additional Resources:

Citation

@article{garciaortegon2022dockstring,
  title={{DOCKSTRING}: Easy Molecular Docking Yields Better Benchmarks for Ligand Design},
  author={Garc{\'\i}a-Orteg{\'o}n, Miguel and Simm, Gregor N. C. and Tripp, Austin J. and Hern{\'a}ndez-Lobato, Jos{\'e} Miguel and Bender, Andreas and Bacallado, Sergio},
  journal={Journal of Chemical Information and Modeling},
  volume={62},
  number={15},
  pages={3486--3502},
  year={2022},
  publisher={American Chemical Society},
  doi={10.1021/acs.jcim.1c01334}
}

Tartarus: Realistic Inverse Molecular Design Benchmarks

Mon, 23 Mar 2026 00:00:00 +0000

A Resource for Realistic Molecular Design Evaluation

This is a Resource paper. Its primary contribution is Tartarus, a modular benchmarking platform for inverse molecular design that provides physically grounded evaluation tasks across four application domains: organic photovoltaics, organic emitters, protein ligands, and chemical reaction substrates. Each task pairs a curated reference dataset with a computational simulation workflow that evaluates proposed molecular structures using established methods from computational chemistry (force fields, semi-empirical quantum chemistry, density functional theory, and molecular docking).

The Problem with Existing Molecular Design Benchmarks

Inverse molecular design, the challenge of crafting molecules with specific optimal properties, is central to drug, catalyst, and materials discovery. Many algorithms have been proposed for this task, but the benchmarks used to evaluate them have significant limitations:

Penalized logP, one of the most common benchmarks, depends heavily on molecule size and chain composition, limiting its informativeness.
QED maximization has reached saturation, with numerous models achieving near-perfect scores.
GuacaMol often yields near-perfect scores across models, obscuring meaningful performance differences. Gao et al. (2022) traced this to unlimited property evaluations, with imposed limits revealing much larger disparities.
MOSES evaluates distribution-matching ability, but the emergence of SELFIES and simple algorithms has made these tasks relatively straightforward.
Molecular docking benchmarks are gaining popularity, but tend to favor reactive or unstable molecules and typically cover only drug design.

These benchmarks share a common weakness: they rely on cheap, approximate property estimators (often QSAR models or simple heuristics) rather than physics-based simulations. This makes them poor proxies for real molecular design campaigns, where properties must be validated through computational or experimental workflows. Tartarus addresses this by providing benchmark tasks grounded in established simulation methods.

Physics-Based Simulation Workflows as Benchmark Oracles

The core innovation in Tartarus is the use of computational chemistry simulation pipelines as objective functions for benchmarking. Rather than relying on learned property predictors, each benchmark task runs a full simulation workflow to evaluate proposed molecules:

Organic Photovoltaics (OPV): Starting from a SMILES string, the workflow generates 3D coordinates with Open Babel, performs conformer search with CREST at the GFN-FF level, optimizes geometry at GFN2-xTB, and computes HOMO/LUMO energies. Power conversion efficiency (PCE) is estimated via the Scharber model for single-junction organic solar cells. HOMO and LUMO energies are calibrated against DFT results from the Harvard Clean Energy Project Database using Theil-Sen regression:

$$ E_{\text{HOMO, calibrated}} = E_{\text{HOMO, GFN2-xTB}} \cdot 0.8051 + 2.5377 \text{ eV} $$

$$ E_{\text{LUMO, calibrated}} = E_{\text{LUMO, GFN2-xTB}} \cdot 0.8788 + 3.7913 \text{ eV} $$

Organic Emitters (OLED): The workflow uses conformer search via CREST, geometry optimization at GFN0-xTB, and TD-DFT single-point calculations at the B3LYP/6-31G* level with PySCF to extract singlet-triplet gaps, oscillator strengths, and vertical excitation energies.
Protein Ligands: The workflow generates 3D coordinates, applies structural filters (Lipinski’s Rule of Five, reactive moiety checks), and performs molecular docking using QuickVina2 with re-scoring via smina against three protein targets: 1SYH (ionotropic glutamate receptor), 6Y2F (SARS-CoV-2 main protease), and 4LDE (beta-2 adrenoceptor).
Chemical Reaction Substrates: The workflow models the intramolecular double hydrogen transfer in syn-sesquinorbornenes using the SEAM force field approach at the GFN-FF/GFN2-xTB level to compute activation and reaction energies.

Each benchmark also includes a curated reference dataset for training generative models and a standardized evaluation protocol: train on 80% of the dataset, use 20% for hyperparameter optimization, then optimize structures starting from the best reference molecule with a constrained budget of 5,000 proposed compounds, a 24-hour runtime cap, and five independent repetitions.

Benchmark Tasks, Datasets, and Model Comparisons

Models Evaluated

Eight generative models spanning major algorithm families were tested:

VAEs: SMILES-VAE and SELFIES-VAE
Flow models: MoFlow
Reinforcement learning: REINVENT
LSTM-based hill climbing: SMILES-LSTM-HC and SELFIES-LSTM-HC
Genetic algorithms: GB-GA and JANUS

Organic Photovoltaics Results

The reference dataset (CEP_SUB) contains approximately 25,000 molecules from the Harvard Clean Energy Project Database. Two objectives combine PCE with synthetic accessibility (SAscore):

Model	PCE_PCBM - SAscore	PCE_PCDTBT - SAscore
Dataset	7.57	31.71
SMILES-VAE	7.44 +/- 0.28	10.23 +/- 11.14
SELFIES-VAE	7.05 +/- 0.66	29.24 +/- 0.65
MoFlow	7.08 +/- 0.31	29.81 +/- 0.37
SMILES-LSTM-HC	6.69 +/- 0.40	31.79 +/- 0.15
SELFIES-LSTM-HC	7.40 +/- 0.41	30.71 +/- 1.20
REINVENT	7.48 +/- 0.11	30.47 +/- 0.44
GB-GA	7.78 +/- 0.02	30.24 +/- 0.80
JANUS	7.59 +/- 0.14	31.34 +/- 0.74

GB-GA achieves the best score on the first task (7.78), while SMILES-LSTM-HC leads on the second (31.79). Most models can marginally improve PCE but struggle to simultaneously improve PCE and reduce SAscore.

Organic Emitters Results

The reference dataset (GDB-13_SUB) contains approximately 380,000 molecules filtered for conjugated pi-systems from GDB-13. Three objectives target singlet-triplet gap minimization, oscillator strength maximization, and a combined multi-objective:

Model	Delta E(S1-T1)	f12	Multi-objective
Dataset	0.020	2.97	-0.04
SMILES-VAE	0.071 +/- 0.003	0.50 +/- 0.27	-0.57 +/- 0.33
SELFIES-VAE	0.016 +/- 0.001	0.36 +/- 0.31	0.17 +/- 0.10
MoFlow	0.013 +/- 0.001	0.81 +/- 0.11	-0.04 +/- 0.06
GB-GA	0.012 +/- 0.002	2.14 +/- 0.45	0.07 +/- 0.03
JANUS	0.008 +/- 0.001	2.07 +/- 0.16	0.02 +/- 0.05

Only JANUS, GB-GA, and SELFIES-VAE generate compounds comparable to or improving upon the best training molecules. JANUS achieves the lowest singlet-triplet gap (0.008 eV), while SELFIES-VAE achieves the highest multi-objective fitness (0.17). Some proposed structures contain reactive moieties, likely because stability is not explicitly penalized in the objective functions.

Protein Ligand Results

The reference dataset contains approximately 152,000 molecules from the DTP Open Compound Collection, filtered for drug-likeness. Docking is performed against three protein targets using both QuickVina2 and smina re-scoring:

Model	1SYH (smina)	6Y2F (smina)	4LDE (smina)	SR (1SYH)
Dataset	-10.2	-8.2	-13.1	100.0%
SMILES-VAE	-10.4 +/- 0.6	-8.9 +/- 0.8	-11.1 +/- 0.4	12.3%
SELFIES-VAE	-10.9 +/- 0.3	-10.1 +/- 0.4	-11.9 +/- 0.2	34.8%
REINVENT	-12.1 +/- 0.2	-11.4 +/- 0.3	-13.7 +/- 0.5	77.8%
GB-GA	-12.0 +/- 0.2	-11.0 +/- 0.2	-13.8 +/- 0.4	72.6%
JANUS	-11.9 +/- 0.2	-11.9 +/- 0.4	-13.6 +/- 0.5	68.4%

No single model consistently achieves the best docking score across all three targets. REINVENT leads on 1SYH, JANUS on 6Y2F, and GB-GA on 4LDE. Both VAE models show low success rates for structural filter compliance (12-39%), while REINVENT, GAs, and LSTMs achieve 68-78%.

Chemical Reaction Substrates Results

The reference dataset (SNB-60K) contains approximately 60,000 syn-sesquinorbornene derivatives generated via STONED-SELFIES mutations. Four objectives target activation energy, reaction energy, and two combined metrics:

Model	Delta E(activation)	Delta E(reaction)	Delta E(act) + Delta E(rxn)	-Delta E(act) + Delta E(rxn)
Dataset	64.94	-34.39	56.48	-95.25
SMILES-VAE	76.81 +/- 0.25	-10.96 +/- 0.71	71.01 +/- 0.62	-90.94 +/- 1.04
MoFlow	70.12 +/- 2.13	-20.21 +/- 4.13	63.21 +/- 0.69	-92.82 +/- 3.06
GB-GA	56.04 +/- 3.07	-41.39 +/- 5.76	45.20 +/- 6.78	-100.07 +/- 1.35
JANUS	47.56 +/- 2.19	-45.37 +/- 7.90	39.22 +/- 3.99	-97.14 +/- 1.13

Only JANUS and GB-GA consistently outperform the best reference compounds. Both VAE models fail to surpass the dataset baseline on any objective. JANUS achieves the best single-objective scores for activation energy (47.56) and reaction energy (-45.37), and the best combined score (39.22).

Key Findings and Limitations

Central Finding: Algorithm Performance is Domain-Dependent

The most important result from Tartarus is that no single generative model consistently outperforms the others across all benchmark domains. This has several implications:

Genetic algorithms (GB-GA and JANUS) show the most consistently strong performance across benchmarks, despite being among the simplest approaches and requiring minimal pre-conditioning time (seconds vs. hours for deep models).
VAE-based models (SMILES-VAE and SELFIES-VAE) show the weakest overall performance, often failing to surpass the best molecules in the reference datasets. Their reliance on the available training data appears to limit their effectiveness.
REINVENT performs competitively on protein ligand tasks but shows weaker performance on other benchmarks.
Representation matters: SELFIES-based models generally outperform their SMILES-based counterparts (e.g., SELFIES-VAE vs. SMILES-VAE), consistent with SELFIES providing 100% validity guarantees.

Timing Analysis

Training time varies dramatically across models. Both VAEs require over 9 hours of GPU training, with estimated CPU-only training times of approximately 25 days. REINVENT and MoFlow train in under 1 hour. Both GAs complete pre-conditioning in seconds and require no GPU.

Limitations Acknowledged by the Authors

Benchmark domains covered are not comprehensive and need expansion.
3D generative models are not well supported, as proposed conformers are ignored in favor of simulation-derived geometries.
The chemical reaction substrate benchmark requires specialized geometries (reactant, product, transition state) that most 3D generative models cannot produce.
Results depend heavily on both model hyperparameters and benchmark settings (compute budget, number of evaluations).
Objective functions may need revision when undesired structures are promoted.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
OPV Training	CEP_SUB (Harvard Clean Energy Project subset)	~25,000 molecules	From HIPS/neural-fingerprint repository
Emitter Training	GDB-13_SUB (filtered GDB-13)	~380,000 molecules	Conjugated pi-system filter applied
Ligand Training	DTP Open Compound Collection (filtered)	~152,000 molecules	Drug-likeness and structural filters applied
Reaction Training	SNB-60K (STONED-SELFIES mutations)	~60,000 molecules	Generated from syn-sesquinorbornene core

Algorithms

All eight algorithms are implemented in the Tartarus repository with configuration files and installation instructions. The evaluation protocol specifies: 80/20 train/validation split, population size of 5,000, 24-hour runtime cap, five independent runs per model.

Models

Pre-trained model checkpoints are not provided. Training must be performed from scratch using the provided reference datasets and hyperparameter configurations documented in the Supporting Information.

Evaluation

Properties are evaluated through physics-based simulation workflows (not learned surrogates). Each workflow accepts a SMILES string and returns computed properties. Key software dependencies include: Open Babel, CREST, xTB, PySCF, QuickVina2, smina, and RDKit.

Hardware

Training and sampling benchmarks were conducted using 24 CPU cores (AMD Rome 7532 @ 2.40 GHz) and a single Tesla A100 GPU. Simulations were run on the Beluga, Narval, Niagara, Cedar, and Sherlock supercomputing clusters.

Artifact	Type	License	Notes
Tartarus GitHub	Code	Unknown	Benchmark tasks, simulation workflows, model configs
Zenodo Archive	Dataset	Unknown	Reference datasets for all four benchmark domains
Discord Community	Other	N/A	Discussion and collaboration channel

Paper Information

Citation: Nigam, A., Pollice, R., Tom, G., Jorner, K., Willes, J., Thiede, L. A., Kundaje, A., & Aspuru-Guzik, A. (2023). Tartarus: A Benchmarking Platform for Realistic And Practical Inverse Molecular Design. Advances in Neural Information Processing Systems 36, 3263-3306.

Publication: NeurIPS 2023

Additional Resources:

Citation

@inproceedings{nigam2023tartarus,
  title={Tartarus: A Benchmarking Platform for Realistic And Practical Inverse Molecular Design},
  author={Nigam, AkshatKumar and Pollice, Robert and Tom, Gary and Jorner, Kjell and Willes, John and Thiede, Luca A. and Kundaje, Anshul and Aspuru-Guzik, Al{\'a}n},
  booktitle={Advances in Neural Information Processing Systems},
  volume={36},
  pages={3263--3306},
  year={2023}
}

SMINA Docking Benchmark for De Novo Drug Design Models

Mon, 23 Mar 2026 00:00:00 +0000

A Docking-Based Benchmark for De Novo Drug Design

This is a Resource paper. Its primary contribution is a standardized benchmark for evaluating generative models in de novo drug design. Rather than introducing a new generative method, the paper provides a reusable evaluation framework built around molecular docking, a widely used computational proxy for predicting protein-ligand binding. The benchmark uses SMINA (a fork of AutoDock Vina) to score generated molecules against eight protein targets, offering a more realistic evaluation than commonly used proxy metrics like logP or QED.

Why Existing Benchmarks Fall Short

De novo drug design methods are typically evaluated using simple proxy tasks that do not reflect the complexity of real drug discovery. The octanol-water partition coefficient (logP) can be trivially optimized by producing unrealistic molecules. The QED drug-likeness score suffers from the same issue. Neural network-based bioactivity predictors are similarly exploitable.

As Coley et al. (2020) note: “The current evaluations for generative models do not reflect the complexity of real discovery problems.”

More realistic evaluation approaches exist in adjacent domains (photovoltaics, excitation energies), where physical calculations are used to both train and evaluate models. Yet de novo drug design has largely relied on the same simplistic proxies. This gap between proxy task performance and real-world utility motivates the development of a docking-based benchmark that, while still a proxy, captures more of the structural complexity involved in protein-ligand interactions.

Benchmark Design: SMINA Docking with the Vinardo Scoring Function

The benchmark is defined by three components: (1) docking software that computes a ligand’s pose in the binding site, (2) a scoring function that evaluates the pose, and (3) a training set of compounds with precomputed docking scores.

The concrete instantiation uses SMINA v. 2017.11.9 with the Vinardo scoring function:

$$S = -0.045 \cdot G + 0.8 \cdot R - 0.035 \cdot H - 0.6 \cdot B$$

where $S$ is the docking score, $G$ is the gauss term, $R$ is repulsion, $H$ is the hydrophobic term, and $B$ is the non-directional hydrogen bond term. The gauss and repulsion terms measure steric interactions between the ligand and the protein, while the hydrophobic and hydrogen bond terms capture favorable non-covalent contacts.

The benchmark includes three task variants:

Docking Score Function: Optimize the full Vinardo docking score (lower is better).
Repulsion: Minimize only the repulsion component, defined as:

$$ R(a_1, a_2) = \begin{cases} d(a_1, a_2)^2 & d(a_1, a_2) < 0 \\ 0 & \text{otherwise} \end{cases} $$

where $d(a_1, a_2)$ is the inter-atomic distance minus the sum of van der Waals radii.

Hydrogen Bonding: Maximize the hydrogen bond term:

$$ B(a_1, a_2) = \begin{cases} 0 & (a_1, a_2) \text{ do not form H-bond} \\ 1 & d(a_1, a_2) < -0.6 \\ 0 & d(a_1, a_2) \geq 0 \\ \frac{d(a_1, a_2)}{-0.6} & \text{otherwise} \end{cases} $$

Scores are averaged over the top 5 binding poses for stability. Generated compounds are filtered by Lipinski’s Rule of Five and a minimum molecular weight of 100. Each model must generate 250 unique molecules per target.

Training data comes from ChEMBL, covering eight drug targets: 5-HT1B, 5-HT2B, ACM2, CYP2D6, ADRB1, MOR, A2A, and D2. Dataset sizes range from 1,082 (ADRB1) to 10,225 (MOR) molecules.

Experimental Evaluation of Three Generative Models

Models Tested

Three popular generative models were evaluated:

CVAE (Chemical Variational Autoencoder): A VAE operating on SMILES strings.
GVAE (Grammar Variational Autoencoder): Extends CVAE by enforcing grammatical correctness of generated SMILES.
REINVENT: A recurrent neural network trained first on ChEMBL in a supervised manner, then fine-tuned with reinforcement learning using docking scores as rewards.

For CVAE and GVAE, molecules are generated by sampling from the latent space and taking 50 gradient steps to optimize an MLP that predicts the docking score. For REINVENT, a random forest model predicts docking scores from ECFP fingerprints, and the reward combines this prediction with the QED score.

Baselines

Two baselines provide context:

Training set: The top 50%, 10%, and 1% of docking scores from the ChEMBL training set.
ZINC subset: A random sample of ~9.2 million drug-like molecules from ZINC, with the same percentile breakdowns.

Diversity is measured as the mean Tanimoto distance (using 1024-bit ECFP with radius 2) between all pairs of generated molecules.

Key Results

Task	Model	5-HT1B Score	5-HT1B Diversity
Docking Score	CVAE	-4.647	0.907
Docking Score	GVAE	-4.955	0.901
Docking Score	REINVENT	-9.774	0.506
Docking Score	ZINC (10%)	-9.894	0.862
Docking Score	ZINC (1%)	-10.496	0.861
Docking Score	Train (10%)	-10.837	0.749

On the full docking score task, CVAE and GVAE fail to match even the mean ZINC docking score. REINVENT performs substantially better (e.g., -9.774 on 5-HT1B) but still falls short of the top 10% ZINC scores (-9.894) in most cases. The exception is ACM2, where REINVENT’s score (-9.775) exceeds the ZINC 10% threshold (-8.282).

On the repulsion task, all three models fail to outperform the top 10% ZINC scores. On the hydrogen bonding task (the easiest), GVAE and REINVENT nearly match the top 1% ZINC scores, suggesting that optimizing individual scoring components is more tractable than the full docking score.

A consistent finding across all experiments is that REINVENT generates substantially less diverse molecules than the training set (e.g., 0.506 vs. 0.787 mean Tanimoto distance on 5-HT1B). The t-SNE visualizations show generated molecules clustering in a single dense region, separate from the training data, regardless of optimization target.

The paper also notes a moderately strong correlation between docking scores and molecular weight or the number of rotatable bonds. Generated compounds achieve better docking scores at the same molecular weight after optimization, suggesting the models learn some structural preferences rather than simply exploiting molecular size.

Limitations of Current Generative Models for Drug Design

The main finding is negative: popular generative models for de novo drug design struggle to generate molecules that dock well when trained on realistically sized datasets (1,000 to 10,000 compounds). Even the best-performing model (REINVENT) generally cannot outperform the top 10% of a random ZINC subset on the full docking score task.

The authors acknowledge several limitations:

Docking is itself a proxy: The SMINA docking score is only an approximation of true binding affinity. The fact that even this simpler proxy is challenging should raise concerns about these models’ readiness for real drug discovery pipelines.
Limited model selection: Only three models were tested (CVAE, GVAE, REINVENT). The authors note that CVAE and GVAE were not designed for small training sets, and REINVENT may not represent the state of the art in all respects.
ML-based scoring surrogate: All models use an ML model (MLP or random forest) to predict docking scores during generation, rather than running SMINA directly. This introduces an additional approximation layer.
No similarity constraints: The benchmark does not impose constraints on the distance between generated and training molecules. A trivial baseline is to simply return the training set.

On a more positive note, the tested models perform well on the simplest subtask (hydrogen bonding), suggesting that optimizing docking scores from limited data is attainable but challenging. The benchmark has already been adopted by other groups, notably Nigam et al. (2021) for evaluating their JANUS genetic algorithm.

Future directions include adding similarity constraints, extending to additional protein targets, and using the benchmark to evaluate newer structure-based generative models that employ equivariant neural networks.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Training/Evaluation	ChEMBL (8 targets)	1,082-10,225 molecules per target	90/10 train/test split
Baseline	ZINC 15 subset	~9.2M drug-like molecules	In-stock, standard reactivity, drug-like
Protein structures	Protein Data Bank	8 structures	Cleaned with Schrodinger modeling package

Algorithms

CVAE/GVAE: Fine-tuned 5 epochs on target data, then 50 gradient steps in latent space to optimize MLP-predicted score
REINVENT: Pretrained on ChEMBL, fine-tuned with RL; reward = random forest prediction * QED score
All docking performed with SMINA v. 2017.11.9 using Vinardo scoring function in score_only mode
Scores averaged over top 5 binding poses
Filtering: Lipinski Rule of Five, minimum molecular weight 100

Evaluation

Metric	Description	Notes
Mean docking score	Average over 250 generated molecules	Lower is better for docking score and repulsion
Diversity	Mean Tanimoto distance (ECFP, r=2)	Higher is more diverse
ZINC percentile baselines	Top 50%, 10%, 1% from random ZINC subset	Task considered “solved” if generated score exceeds ZINC 1%

Hardware

Not specified in the paper.

Artifacts

Artifact	Type	License	Notes
smina-docking-benchmark	Code	MIT	Benchmark code, data, evaluation notebooks

Paper Information

Citation: Cieplinski, T., Danel, T., Podlewska, S., & Jastrzebski, S. (2023). Generative Models Should at Least Be Able to Design Molecules That Dock Well: A New Benchmark. Journal of Chemical Information and Modeling, 63(11), 3238-3247. https://doi.org/10.1021/acs.jcim.2c01355

Publication: Journal of Chemical Information and Modeling 2023

Additional Resources:

GitHub Repository

Citation

@article{cieplinski2023generative,
  title={Generative Models Should at Least Be Able to Design Molecules That Dock Well: A New Benchmark},
  author={Cieplinski, Tobiasz and Danel, Tomasz and Podlewska, Sabina and Jastrzebski, Stanislaw},
  journal={Journal of Chemical Information and Modeling},
  volume={63},
  number={11},
  pages={3238--3247},
  year={2023},
  publisher={American Chemical Society},
  doi={10.1021/acs.jcim.2c01355}
}

Exposing Limitations of Molecular ML with Activity Cliffs

Mon, 16 Mar 2026 00:00:00 +0000

A Benchmark for Activity Cliff Prediction

This is a Systematization paper ($\Psi_{\text{Systematization}}$) with a significant Resource component ($\Psi_{\text{Resource}}$).

The paper systematically benchmarks 24 machine learning and deep learning approaches on their ability to predict bioactivity for activity cliff compounds: pairs of structurally similar molecules that exhibit large differences in potency. These cases violate the similarity principle (similar structure implies similar activity) and represent a practical failure mode for molecular property prediction in drug discovery. The authors release MoleculeACE, an open-source benchmarking platform for evaluating ML models on activity cliffs.

The similarity principle underpins most molecular ML: structurally similar compounds should have similar properties. Activity cliffs are the exceptions, where small structural changes cause large potency shifts (e.g., a single substituent change causing a 10x difference in $K_i$).

Despite their importance for hit-to-lead optimization, activity cliffs have received limited attention in ML benchmarking. Standard metrics like RMSE computed over entire test sets can mask poor predictions on cliff compounds. A model might achieve low overall error while systematically mispredicting these edge cases, which are precisely the molecules that matter most for medicinal chemistry applications.

The authors identify 7-52% of compounds as activity cliff molecules across their 30 target datasets, showing this is not a rare phenomenon.

Defining and Detecting Activity Cliffs

The authors use three complementary similarity metrics to identify activity cliffs:

Substructure similarity: Tanimoto coefficient on extended connectivity fingerprints (ECFPs), capturing shared radial substructures
Scaffold similarity: Tanimoto coefficient on ECFPs computed from molecular graph frameworks, detecting core/decoration differences
SMILES similarity: Levenshtein distance on canonical SMILES strings, capturing character-level insertions, deletions, and translocations

Pairs with $\geq 90%$ similarity on any one of the three metrics and $> 10\times$ difference in bioactivity ($K_i$ or $\text{EC}_{50}$) are classified as activity cliff pairs. This union-based approach (rather than requiring agreement across all metrics) captures different types of structural relationships relevant to medicinal chemistry.

24 Methods Across 30 Drug Targets

The benchmark evaluates 16 traditional ML configurations (4 algorithms $\times$ 4 descriptor types) and 8 deep learning approaches across 30 curated ChEMBL v29 datasets (48,707 total molecules).

Traditional ML algorithms: KNN, RF, GBM, SVM, each combined with ECFPs, MACCS keys, WHIM descriptors, or physicochemical properties.

Deep learning methods: MPNN, GCN, GAT, Attentive FP (graph-based), plus LSTM, CNN, Transformer/ChemBERTa (SMILES-based), and an MLP on ECFPs.

Performance is measured with both standard RMSE and a dedicated $\text{RMSE}_{\text{cliff}}$ computed only on activity cliff compounds in the test set:

$$ \text{RMSE}_{\text{cliff}} = \sqrt{\frac{\sum_{j=1}^{n_c} (\hat{y}_j - y_j)^2}{n_c}} $$

Key results:

Molecular descriptors matter more than algorithms: The choice of descriptor (ECFPs vs. MACCS vs. WHIM vs. physicochemical) had a larger impact on $\text{RMSE}_{\text{cliff}}$ than the choice of ML algorithm ($p < 0.05$, Wilcoxon rank-sum test with Benjamini-Hochberg correction).
SVM + ECFPs wins on average: The best overall method for activity cliff prediction, though the difference from RF + ECFPs or GBM + ECFPs was not statistically significant.
Deep learning underperforms: All graph and SMILES-based deep learning methods performed worse than a simple MLP on ECFPs. Among deep learning, LSTM with transfer learning (pretrained on 36K molecules) was the best, outperforming the ChemBERTa transformer pretrained on 10M compounds.
Large case-by-case variation: $\text{RMSE}_{\text{cliff}}$ ranged from 0.62 to 1.60 log units across datasets, with no method consistently best. Deep learning methods showed the highest variance across targets.

Simple Descriptors Beat Complex Architectures on Cliffs

The core finding is that activity cliffs expose a gap in learned molecular representations. Despite graph neural networks and transformers being able to learn directly from molecular structure, they fail to capture the subtle structural differences that drive activity cliffs.

Key observations:

RMSE and $\text{RMSE}_{\text{cliff}}$ correlate ($r = 0.81$ on average), so optimizing overall error usually helps with cliffs too. But this correlation breaks down for some targets (e.g., CLK4), where methods with similar RMSE can have very different $\text{RMSE}_{\text{cliff}}$.
Training set size matters for the RMSE/$\text{RMSE}_{\text{cliff}}$ correlation: Datasets with $> 1000$ training molecules show $r > 0.80$ between the two metrics. In low-data regimes, the correlation weakens, making dedicated cliff evaluation more important.
No relationship between % cliff compounds and model performance, and no target-family-specific effects were found.
Transfer learning helped SMILES models (LSTM) but not graph models: Self-supervised pretraining strategies (context prediction, infomax, edge prediction, masking) did not improve GNN performance, consistent with findings from other studies.

The MoleculeACE platform provides standardized data curation, activity cliff detection, and cliff-specific evaluation, enabling researchers to assess new methods against this benchmark.

Reproducibility Details

Data

Purpose	Source	Size	Notes
Benchmarking	ChEMBL v29	48,707 molecules (35,632 unique) across 30 targets	Curated for duplicates, salts, outliers
Smallest dataset	JAK1	615 molecules	7% activity cliffs
Largest dataset	DRD3	3,657 molecules	39% activity cliffs

Algorithms

Activity cliff detection: Pairwise similarity $\geq 0.9$ (Tanimoto on ECFPs, scaffold ECFPs, or Levenshtein on SMILES) with $> 10\times$ potency difference
Splitting: Spectral clustering on ECFPs (5 clusters), 80/20 stratified split preserving cliff proportion
Hyperparameter optimization: Bayesian optimization with Gaussian process, max 50 combinations, 5-fold cross-validation
SMILES augmentation: 10-fold for all SMILES-based methods
Transfer learning: LSTM pretrained on 36,281 merged training molecules (next-character prediction); ChemBERTa pretrained on 10M PubChem compounds

Models

Traditional ML: KNN, RF, GBM, SVM (scikit-learn v1.0.2)
Descriptors: ECFPs (1024-bit, radius 2), MACCS keys (166-bit), WHIM (114 descriptors), physicochemical (11 properties)
GNNs: MPNN, GCN, GAT, AFP (PyTorch Geometric v2.0.4), with graph multiset transformer pooling
SMILES models: LSTM (4 layers, 5.8M params), 1D CNN, ChemBERTa transformer
Total models trained: 720 (24 methods $\times$ 30 targets)

Evaluation

Metric	Scope	Details
RMSE	All test molecules	Standard root-mean-square error on $\text{pK}_i$ / $\text{pEC}_{50}$
$\text{RMSE}_{\text{cliff}}$	Activity cliff compounds only	RMSE restricted to cliff molecules in test set

Artifacts

Artifact	Type	License	Notes
MoleculeACE	Code + Data	MIT	Benchmark platform with all 30 curated datasets
Curated datasets	Data	MIT	Processed ChEMBL bioactivity data

Paper Information

Citation: van Tilborg, D., Alenicheva, A., & Grisoni, F. (2022). Exposing the Limitations of Molecular Machine Learning with Activity Cliffs. Journal of Chemical Information and Modeling, 62(23), 5938-5951. https://doi.org/10.1021/acs.jcim.2c01073

Publication: Journal of Chemical Information and Modeling 2022

Additional Resources:

Citation

@article{vantilborg2022activity,
  title={Exposing the Limitations of Molecular Machine Learning with Activity Cliffs},
  author={van Tilborg, Derek and Alenicheva, Alisa and Grisoni, Francesca},
  journal={Journal of Chemical Information and Modeling},
  volume={62},
  number={23},
  pages={5938--5951},
  year={2022},
  publisher={American Chemical Society},
  doi={10.1021/acs.jcim.2c01073}
}

ZINC-22: A Multi-Billion Scale Database for Ligand Discovery

Sat, 27 Sep 2025 00:00:00 +0000

Key Contribution: Scaling Make-on-Demand Libraries

ZINC-22 addresses the critical infrastructure challenges of managing multi-billion-scale libraries of make-on-demand chemical compounds through a federated database architecture, the CartBlanche web interface, and cloud distribution systems that enable modern virtual screening.

Overview

ZINC-22 is a multi-billion scale public database of commercially available chemical compounds designed for virtual screening. It contains over 37 billion make-on-demand molecules and utilizes a distributed infrastructure capable of managing database indexing limits. For structural biology pipelines, it provides 4.5 billion ready-to-dock 3D conformations alongside pre-calculated pH-specific protonation states, tautomers, and AMSOL partial charges.

Dataset Examples

ZINC-22’s 2D Tranche Browser showing the organization of 37.2 billion molecules by physicochemical properties

Dataset Subsets

Subset	Count	Description
2D Database	37B+	Complete 2D chemical structures from make-on-demand catalogs (Enamine REAL, Enamine REAL Space, WuXi GalaXi, Mcule Ultimate)
3D Database	4.5B+	Ready-to-dock 3D conformations with pre-calculated charges and solvation energies
Custom Tranches	Variable	User-selected molecular subsets via Tranche Browser (e.g., lead-like, fragment-like)

Use Cases

ZINC-22 is designed for ultra-large virtual screening (ULVS), analog searching, and molecular docking campaigns. The Tranche Browser enables targeted subset selection (e.g., lead-like, fragment-like) for screening, and the CartBlanche interface supports both interactive and programmatic access to the database. The authors note that as the database grows, docking can identify better-fitting molecules.

Dataset	Relationship	Link
ZINC-20	Predecessor
Enamine REAL	Source catalog
WuXi GalaXi	Source catalog

Strengths

Massive scale: 37+ billion purchasable compounds from major vendors (Enamine, WuXi, Mcule)
Federated architecture: Supports asynchronous building and horizontal scaling to trillion-molecule growth
Platform access: CartBlanche GUI provides a shopping cart metaphor for compound acquisition
Privacy protection: Dual public/private server clusters protect patentability of undisclosed catalogs
Chemical diversity: Linear growth (1 new scaffold per 10 molecules added), with 96.3M+ unique Bemis-Murcko scaffolds
Ready-to-dock: 3D models include pre-calculated charges, protonation states, and solvation energies
Cloud distribution: Available via AWS Open Data, Oracle OCI, and UCSF servers
Scale-aware search: SmallWorld (similarity) and Arthor (substructure) tools partitioned to address specific constraints of billion-scale queries
Organized access: Tranche system enables targeted selection of chemical space
Open access: Entire database freely available to academic and commercial users

Limitations

Data Transfer Bottlenecks: Distributing 4.5 billion 3D alignments in standard rigid format (like db2 flexibase) requires roughly 1 Petabyte of storage. Transferring this takes months over standard gigabit connections, effectively mandating cloud-based compilation and rendering local copies impractical.
Search Result Caps: Interactive Arthor searches are capped at 20,000 molecules to maintain a reliable public service. Users needing more results can use the asynchronous Arthor search tool via TLDR, which sends results by email.
Enumeration Ceiling: Scaling relies entirely on PostgreSQL sharding. To continue using rigid docking tools, the database must fully enumerate structural states. The authors acknowledge that hardware limitations will likely cap full database enumeration well before the 10-trillion molecule mark, forcing future pipelines to accommodate unenumerated combinatorial fragment spaces.
Download Workflow: Individual 3D molecule downloads are unavailable directly; researchers must rebuild them via the TLDR tool.
Vendor Updates: There is difficulty removing discontinued vendor molecules due to the federated structure.

Technical Notes

Hardware & Software

Compute infrastructure:

1,700 cores across 14 computers for parallel processing
174 independent PostgreSQL 12.0 databases (110 ‘Sn’ for ZINC-ID, 64 ‘Sb’ for Supplier Codes)
Distributed across Amazon AWS, Oracle OCI, and UCSF servers

Software stack:

PostgreSQL 12.2
Python 3.6.8
RDKit 2020.03
Celery task queue with Redis for background processing
All code available on GitHub: docking-org/zinc22-2d, zinc22-3d

Data Organization & Access

Tranche system: Molecules organized into “Tranches” based on 4 dimensions:

Heavy Atom Count
Lipophilicity (LogP)
Charge
File Format

This enables downloading specific chemical neighborhoods (e.g., neutral lead-like molecules) without accessing the entire database.

Search infrastructure: Searching at the billion-molecule scale actively exceeds rapid-access computer memory limits. ZINC-22 splits retrieval between two distinct algorithms:

SmallWorld: Handles whole-molecule similarity using Graph Edit Distance (GED). GED defines the minimum cost of operations (node/edge insertions, deletions, or substitutions) required to transform graph $G_1$ into graph $G_2$:

$$ \text{GED}(G_1, G_2) = \min_{(e_1, …, e_k) \in \mathcal{P}(G_1, G_2)} \sum_{i=1}^k c(e_i) $$

Because SmallWorld searches pre-calculated anonymous graphs, it evaluates close neighbors in near $\mathcal{O}(1)$ time and scales sub-linearly, though it struggles with highly distant structural matches.
Arthor: Provides exact substructure and pattern matching. It scales linearly $\mathcal{O}(N)$ with database size and successfully finds distant hits (e.g., PAINS filters), but performance heavily degrades if the index exceeds available RAM.
CartBlanche: Web interface wrapping these search tools with shopping cart functionality.

3D Generation Pipeline

The 3D database construction pipeline involves multiple specialized tools:

ChemAxon JChem: Protonation state and tautomer generation at physiological pH
Corina: Initial 3D structure generation
Omega: Conformation sampling
AMSOL 7.1: Calculation of atomic partial charges and desolvation energies
Strain calculation: Relative energies of conformations

At sustained throughput, the pipeline builds approximately 11 million molecules per day, each with hundreds of pre-calculated conformations.

Chemical Diversity Analysis

A core debate in billion-scale library generation involves whether continuous enumeration merely yields repetitive derivatives. Analysis of Bemis-Murcko (BM) scaffolds demonstrates that chemical diversity in ZINC-22 continues to grow, but scales sub-linearly based on a power law. Specifically, the authors observe a $\log$ increase in BM scaffolds for every two $\log$ increase in database size:

$$ \log(\text{Scaffolds}_{BM}) \propto 0.5 \log(\text{Molecules}) $$

This suggests that while diversity does not saturate, it grows proportionally to the square root of the library size ($\mathcal{O}(\sqrt{N})$). The majority of this scaffold novelty stems from compounds with the highest heavy atom counts (HAC 24-25), which contribute roughly twice as many unique core structures as the combined HAC 06-23 subset.

Vendor Integration

ZINC-22 is built from five source catalogs with the following approximate sizes:

Enamine REAL Database: 5 billion compounds
Enamine REAL Space: 29 billion compounds
WuXi GalaXi: 2.5 billion compounds
Mcule Ultimate: 128 million compounds
ZINC20 in-stock: 4 million compounds (incorporated as layer “g”)

This focus on purchasable, make-on-demand molecules distinguishes ZINC-22 from theoretical chemical space databases. ZINC20 continues to be maintained separately for smaller catalogs and in-stock compounds.

Reproducibility Details

Artifact	Type	License	Notes
CartBlanche web interface	Dataset	Free access	Web GUI for searching and downloading ZINC-22
docking-org/zinc22-2d	Code	BSD-3-Clause	2D curation and loading pipeline
docking-org/zinc22-3d	Code	Unknown	3D building pipeline
docking-org/cartblanche22	Code	Unknown	CartBlanche22 web application
AWS Open Data / Oracle OCI	Dataset	Free access	Cloud-hosted 3D database mirrors

Data Availability: The compiled database is openly accessible and searchable through the CartBlanche web interface. Subsets can be downloaded, and programmatic access is provided via curl, wget, and Globus.
Code & Algorithms: The source code for database construction, parallel processing, and querying is open-source.
- 2D Pipeline: docking-org/zinc22-2d
- 3D Pipeline: docking-org/zinc22-3d
- CartBlanche: docking-org/cartblanche22
- TLDR modules: docking-org/TLDR and docking-org/tldr-modules (repositories no longer available)
Software Dependencies: While the orchestration code is public, the 3D structure generation relies on commercial software that requires separate licenses (CORINA, OpenEye OMEGA, ChemAxon JChem). This limits end-to-end reproducibility for researchers without access to these tools.
Hardware Limitations: Recreating the entire 37+ billion molecule database from raw vendor catalogs requires approximately 1,700 CPU cores and petabytes of data transfer, restricting full recreation to large institutional clusters or substantial cloud compute budgets.

Paper Information

Tingle, B. I., Tang, K. G., Castanon, M., Gutierrez, J. J., Khurelbaatar, M., Dandarchuluun, C., Moroz, Y. S., and Irwin, J. J. (2023). ZINC-22: A Free Multi-Billion-Scale Database of Tangible Compounds for Ligand Discovery. Journal of Chemical Information and Modeling, 63(4), 1166–1176. https://doi.org/10.1021/acs.jcim.2c01253

@article{Tingle_2023,
    title={ZINC-22: A Free Multi-Billion-Scale Database of Tangible Compounds for Ligand Discovery},
    volume={63},
    ISSN={1549-960X},
    url={http://dx.doi.org/10.1021/acs.jcim.2c01253},
    DOI={10.1021/acs.jcim.2c01253},
    number={4},
    journal={Journal of Chemical Information and Modeling},
    publisher={American Chemical Society (ACS)},
    author={Tingle, Benjamin I. and Tang, Khanh G. and Castanon, Mar and Gutierrez, John J. and Khurelbaatar, Munkhzul and Dandarchuluun, Chinzorig and Moroz, Yurii S. and Irwin, John J.},
    year={2023},
    month={Feb},
    pages={1166--1176}
}

Drug-Design on Hunter Heidenreich | ML Research Scientist

REINVENT: Reinforcement Learning for Mol. Design

Augmented Episodic Likelihood for Goal-Directed Generation

De Novo Design Needs Flexible, Data-Driven Approaches

The Augmented Episodic Likelihood Framework

Three Experiments: Sulphur Avoidance, Celecoxib Analogues, and DRD2 Activity

Prior Network Architecture

Experiment 1: Learning to Avoid Sulphur

Experiment 2: Similarity-Guided Generation (Celecoxib Analogues)

Experiment 3: Target Activity (DRD2)

Anchored Policy Learning Prevents Reward Exploitation

Reproducibility Details

Data

Algorithms

Models

Evaluation

Hardware

Paper Information

PharMolixFM: Multi-Modal All-Atom Molecular Models

A Unified Framework for All-Atom Molecular Foundation Models

Challenges in Multi-Modal Atomic Modeling

Multi-Modal Denoising with Task-Specific Priors

Three Generative Model Variants

Network Architecture

Docking and Drug Design Experiments

Protein-Small-Molecule Docking

Structure-Based Drug Design

Inference Scaling Law

Competitive Docking with Faster Inference, but Limited Task Scope

Reproducibility Details

Data

Algorithms

Models

Evaluation

Hardware

Artifacts

Paper Information

ORGAN: Objective-Reinforced GANs for Molecule Design

Combining GANs and Reinforcement Learning for Goal-Directed Sequence Generation

Exposure Bias and Mode Collapse in Discrete Sequence Generation

Mixed Reward: Interpolating Between Adversarial and Objective Signals

Molecular and Musical Melody Generation Experiments

Architecture

Molecular Generation Setup

Molecular Generation Results

Music Generation Setup

Music Results

Capacity Ceilings, Trade-offs, and Future Directions

Reproducibility Details

Data

Algorithms

Models

Evaluation

Hardware

Artifacts

Paper Information

MolecularRNN: Graph-Based Molecular Generation and RL

A Graph Recurrent Model for Molecular Generation with Property Optimization

Why Generate Molecules as Graphs Rather Than Strings

Core Innovation: Extending GraphRNN with Chemical Constraints and RL

Autoregressive Graph Generation

Valency-Based Rejection Sampling

Property Optimization via Policy Gradient

Experimental Setup: Pretraining and Property Optimization

Pretraining

Generation Quality at Scale

Property Optimization Results

Distribution-Level Evaluation and Learned Chemical Patterns

Reproducibility Details

Data

Algorithms

Models

Evaluation

Hardware

Paper Information

Memory-Assisted RL for Diverse De Novo Mol. Design

A Memory Module for Diverse Molecular Generation via RL

Policy Collapse Limits RL-Based De Novo Design

Core Innovation: Hash-Table Memory Unit for Reward Modification

Integration with REINVENT