Evaluation, Benchmarks & Surveys on Hunter Heidenreich | ML Research Scientist

RNNs vs Transformers for Molecular Generation Tasks

Thu, 26 Mar 2026 00:00:00 +0000

An Empirical Comparison of Sequence Architectures for Molecular Generation

This is an Empirical paper that systematically compares two dominant sequence modeling architectures, recurrent neural networks (RNNs) and the Transformer, for chemical language modeling. The primary contribution is a controlled experimental comparison across three generative tasks of increasing complexity, combined with an evaluation of two molecular string representations (SMILES and SELFIES). The paper does not propose a new method; instead, it provides practical guidance on when each architecture is more appropriate for molecular generation.

Why Compare RNNs and Transformers for Molecular Design?

Exploring unknown molecular space and designing molecules with target properties is a central goal in computational drug design. Language models trained on molecular string representations (SMILES, SELFIES) have shown the capacity to learn complex molecular distributions. RNN-based models, including LSTM and GRU variants, were the first widely adopted architectures for this task. Models like CharRNN, ReLeaSE, and conditional RNNs demonstrated success in generating focused molecular libraries. More recently, self-attention-based Transformer models (Mol-GPT, LigGPT) have gained popularity due to their parallelizability and ability to capture long-range dependencies.

Despite the widespread adoption of Transformers across NLP, it was not clear whether they uniformly outperform RNNs for molecular generation. Prior work by Dollar et al. showed that RNN-based models achieved higher validity than Transformer-based models in some settings. Flam-Shepherd et al. demonstrated that RNN language models could learn complex molecular distributions across challenging generative tasks. This paper extends that comparison by adding the Transformer architecture to the same set of challenging tasks and evaluating both SMILES and SELFIES representations.

Experimental Design: Three Tasks, Two Architectures, Two Representations

The core experimental design uses a 2x2 setup: two architectures (RNN and Transformer) crossed with two molecular representations (SMILES and SELFIES), yielding four model variants: SM-RNN, SF-RNN, SM-Transformer, and SF-Transformer.

Three generative tasks

The three tasks, drawn from Flam-Shepherd et al., are designed with increasing complexity:

Penalized LogP task: Generate molecules with high penalized LogP scores (LogP minus synthetic accessibility and long-cycle penalties). The dataset is built from ZINC15 molecules with penalized LogP > 4.0. Molecule sequences are relatively short (50-75 tokens).
Multidistribution task: Learn a multimodal molecular weight distribution constructed from four distinct subsets: GDB13 (MW <= 185), ZINC (185 <= MW <= 425), Harvard Clean Energy Project (460 <= MW <= 600), and POLYMERS (MW > 600). This tests the ability to capture multiple modes simultaneously.
Large-scale task: Generate large molecules from PubChem with more than 100 heavy atoms and MW ranging from 1250 to 5000. This tests long-sequence generation capability.

Model configuration

Models are compared with matched parameter counts (5.2-5.3M to 36.4M parameters). Hyperparameter optimization uses random search over learning rate [0.0001, 0.001], hidden units (500-1000 for RNNs, 376-776 for Transformers), layer number [3, 5], and dropout [0.0, 0.5]. A regex-based tokenizer replaces character-by-character tokenization, reducing token lengths from 10,000 to under 3,000 for large molecules.

Evaluation metrics

The evaluation covers multiple dimensions:

Standard metrics: validity, uniqueness, novelty
Molecular properties: FCD, LogP, SA, QED, Bertz complexity (BCT), natural product likeness (NP), molecular weight (MW)
Wasserstein distance: measures distributional similarity between generated and training molecules for each property
Tanimoto similarity: structural and scaffold similarity between generated and training molecules
Token length (TL): comparison of generated vs. training sequence lengths

For each task, 10,000 molecules are generated and evaluated.

Key Results Across Tasks

Penalized LogP task

Model	FCD	LogP	SA	QED	BCT	NP	MW	TL
SM-RNN	0.56	0.12	0.02	0.01	16.61	0.09	5.90	0.43
SF-RNN	1.63	0.25	0.42	0.02	36.43	0.23	2.35	0.40
SM-Transformer	0.83	0.18	0.02	0.01	23.77	0.09	7.99	0.84
SF-Transformer	1.97	0.22	0.47	0.02	44.43	0.28	5.04	0.53

RNN-based models achieve smaller Wasserstein distances across most properties. The authors attribute this to LogP being computed as a sum of atomic contributions (a local property), which aligns with RNNs’ strength in capturing local structural features. RNNs also generated ring counts closer to the training distribution (4.10 for SM-RNN vs. 4.04 for SM-Transformer, with training data at 4.21). The Transformer performed better on global structural similarity (higher Tanimoto similarity to training data).

Multidistribution task

Model	FCD	LogP	SA	QED	BCT	NP	MW	TL
SM-RNN	0.16	0.07	0.03	0.01	18.34	0.02	7.07	0.81
SF-RNN	1.46	0.38	0.55	0.03	110.72	0.24	10.00	1.58
SM-Transformer	0.16	0.16	0.03	0.01	39.94	0.02	10.03	1.28
SF-Transformer	1.73	0.37	0.63	0.04	107.46	0.30	17.57	2.40

Both SMILES-based models captured all four modes of the MW distribution well. While RNNs had smaller overall Wasserstein distances, the Transformer fitted the higher-MW modes better. This aligns with the observation that longer molecular sequences (which correlate with higher MW) favor the Transformer’s global attention mechanism over the RNN’s sequential processing.

Large-scale task

Model	FCD	LogP	SA	QED	BCT	NP	MW	TL
SM-RNN	0.46	1.89	0.20	0.01	307.09	0.03	105.29	12.05
SF-RNN	1.65	1.78	0.43	0.01	456.98	0.14	100.79	15.26
SM-Transformer	0.36	1.64	0.07	0.01	172.93	0.02	59.04	7.41
SF-Transformer	1.91	2.82	0.47	0.01	464.75	0.18	92.91	11.57

The Transformer demonstrates a clear advantage on large molecules. SM-Transformer achieves substantially lower Wasserstein distances than SM-RNN across nearly all properties, with particularly large improvements in BCT (172.93 vs. 307.09) and MW (59.04 vs. 105.29). The Transformer also produces better Tanimoto similarity scores and more accurate token length distributions.

Standard metrics across all tasks

Task	Metric	SM-RNN	SF-RNN	SM-Transformer	SF-Transformer
LogP	Valid	0.90	1.00	0.89	1.00
LogP	Uniqueness	0.98	0.99	0.98	0.99
LogP	Novelty	0.75	0.71	0.71	0.71
Multi	Valid	0.95	1.00	0.97	1.00
Multi	Uniqueness	0.96	1.00	1.00	1.00
Multi	Novelty	0.91	0.98	0.91	0.98
Large	Valid	0.84	1.00	0.88	1.00
Large	Uniqueness	0.99	0.99	0.98	0.99
Large	Novelty	0.85	0.92	0.86	0.94

SELFIES achieves 100% validity across all tasks by construction, while SMILES validity drops for large molecules. The Transformer achieves slightly higher validity than the RNN for SMILES-based models, particularly on the large-scale task (0.88 vs. 0.84).

Conclusions and Practical Guidelines

The central finding is that neither architecture universally dominates. The choice between RNNs and Transformers should depend on the characteristics of the molecular data:

RNNs are preferred when molecular properties depend on local structural features (e.g., LogP, ring counts) and when sequences are relatively short. They better capture local fragment distributions.
Transformers are preferred when dealing with large molecules (high MW, long sequences) where global attention can capture the overall distribution more effectively. RNNs suffer from information obliteration on long sequences.
SMILES outperforms SELFIES on property distribution metrics across nearly all tasks and models. While SELFIES guarantees 100% syntactic validity, its generated molecules show worse distributional fidelity to training data. The authors argue that validity is a less important concern than property fidelity, since invalid SMILES can be filtered easily.

The authors acknowledge that longer sequences remain challenging for both architectures. For Transformers, the quadratic growth of the attention matrix limits scalability. For RNNs, the vanishing gradient problem limits effective context length.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Task 1	ZINC15 (penalized LogP > 4.0)	Not specified	High penalized LogP molecules
Task 2	GDB-13 + ZINC + CEP + POLYMERS	~200K	Multimodal MW distribution
Task 3	PubChem (>100 heavy atoms)	Not specified	MW range 1250-5000

Data processing code available at https://github.com/danielflamshep/genmoltasks (from the original Flam-Shepherd et al. study).

Algorithms

Tokenization: Regex-based tokenizer (not character-by-character)
Hyperparameter search: Random search over learning rate [0.0001, 0.001], hidden units, layers [3, 5], dropout [0.0, 0.5]
Selection: Top 20% by sum of valid + unique + novelty, then final selection on all indicators
Generation: 10K molecules per model per task

Models

Model	Parameters	Architecture
RNN variants	5.2M - 36.4M	RNN (LSTM/GRU)
Transformer variants	5.3M - 36.4M	Transformer decoder

Evaluation

Wasserstein distance for property distributions (FCD, LogP, SA, QED, BCT, NP, MW, TL), Tanimoto similarity (molecular and scaffold), validity, uniqueness, novelty.

Hardware

Not specified in the paper.

Artifacts

Artifact	Type	License	Notes
trans_language	Code	Not specified	Transformer implementation by the authors
genmoltasks	Code/Data	Apache-2.0	Dataset construction from Flam-Shepherd et al.

Paper Information

Citation: Chen, Y., Wang, Z., Zeng, X., Li, Y., Li, P., Ye, X., & Sakurai, T. (2023). Molecular language models: RNNs or transformer? Briefings in Functional Genomics, 22(4), 392-400. https://doi.org/10.1093/bfgp/elad012

@article{chen2023molecular,
  title={Molecular language models: RNNs or transformer?},
  author={Chen, Yangyang and Wang, Zixu and Zeng, Xiangxiang and Li, Yayang and Li, Pengyong and Ye, Xiucai and Sakurai, Tetsuya},
  journal={Briefings in Functional Genomics},
  volume={22},
  number={4},
  pages={392--400},
  year={2023},
  publisher={Oxford University Press},
  doi={10.1093/bfgp/elad012}
}

Review: Deep Learning for Molecular Design (2019)

Thu, 26 Mar 2026 00:00:00 +0000

A Systematization of Deep Generative Models for Molecular Design

This is a Systematization paper that organizes and compares the rapidly growing literature on deep generative modeling for molecules. Published in 2019, it catalogs 45 papers from the preceding two years, classifying them by architecture (RNNs, VAEs, GANs, reinforcement learning) and molecular representation (SMILES strings, context-free grammars, graph tensors, 3D voxels). The review provides mathematical foundations for each technique, identifies cross-cutting themes, and proposes a framework for reward function design that addresses diversity, novelty, stability, and synthesizability.

The Challenge of Navigating Vast Chemical Space

The space of potential drug-like molecules has been estimated to contain between $10^{23}$ and $10^{60}$ compounds, while only about $10^{8}$ have ever been synthesized. Traditional approaches to molecular design rely on combinatorial methods, mixing known scaffolds and functional groups, but these generate many unstable or unsynthesizable candidates. High-throughput screening (HTS) and virtual screening (HTVS) help but remain computationally expensive. The average cost to bring a new drug to market exceeds one billion USD, with a 13-year average timeline from discovery to market.

By 2016, deep generative models had shown strong results in producing original images, music, and text. The “molecular autoencoder” of Gomez-Bombarelli et al. (2016/2018) first applied these techniques to molecular generation, triggering an explosion of follow-up work. By the time of this review, the landscape had grown complex enough, with many architectures, representation schemes, and no agreed-upon benchmarking standards, to warrant systematic organization.

Molecular Representations and Architecture Taxonomy

The review’s core organizational contribution is a two-axis taxonomy: molecular representations on one axis and deep learning architectures on the other.

Molecular Representations

The review categorizes representations into 3D and 2D graph-based schemes:

3D representations include raw voxels (placing nuclear charges on a grid), smoothed voxels (Gaussian blurring around nuclei), and tensor field networks. These capture full geometric information but suffer from high dimensionality, sparsity, and difficulty encoding rotation/translation invariance.

2D graph representations include:

SMILES strings: The dominant representation, encoding molecular graphs as ASCII character sequences via depth-first traversal. Non-unique (each molecule with $N$ heavy atoms has at least $N$ SMILES representations), but invertible and widely supported.
Canonical SMILES: Unique but potentially encode grammar rules rather than chemical structure.
Context-free grammars (CFGs): Decompose SMILES into grammar rules to improve validity rates, though not to 100%.
Tensor representations: Store atom types in a vertex feature matrix $X \in \mathbb{R}^{N \times |\mathcal{A}|}$ and bond types in an adjacency tensor $A \in \mathbb{R}^{N \times N \times Y}$.
Graph operations: Directly build molecular graphs by adding atoms and bonds, guaranteeing 100% chemical validity.

Deep Learning Architectures

Recurrent Neural Networks (RNNs) generate SMILES strings character by character, typically using LSTM or GRU units. Training uses maximum likelihood estimation (MLE) with teacher forcing:

$$ L^{\text{MLE}} = -\sum_{s \in \mathcal{X}} \sum_{t=2}^{T} \log \pi_{\theta}(s_{t} \mid S_{1:t-1}) $$

Thermal rescaling of the output distribution controls the diversity-validity tradeoff via a temperature parameter $T$. RNNs achieved SMILES validity rates of 94-98%.

Variational Autoencoders (VAEs) learn a continuous latent space by maximizing the evidence lower bound (ELBO):

$$ \mathcal{L}_{\theta,\phi}(x) = \mathbb{E}_{z \sim q_{\phi}(z|x)}[\log p_{\theta}(x|z)] - D_{\text{KL}}[q_{\phi}(z|x), p(z)] $$

The first term encourages accurate reconstruction while the KL divergence term regularizes the latent distribution toward a standard Gaussian prior $p(z) = \mathcal{N}(z, 0, I)$. Variants include grammar VAEs (GVAEs), syntax-directed VAEs, junction tree VAEs, and adversarial autoencoders (AAEs) that replace the KL term with adversarial training.

Generative Adversarial Networks (GANs) train a generator against a discriminator using the minimax objective:

$$ \min_{G} \max_{D} V(D, G) = \mathbb{E}_{x \sim p_{d}(x)}[\log D(x)] + \mathbb{E}_{z \sim p_{z}(z)}[\log(1 - D(G(z)))] $$

The review shows that with an optimal discriminator, the generator objective reduces to minimizing the Jensen-Shannon divergence, which captures both forward and reverse KL divergence terms. This provides a more “balanced” training signal than MLE alone. The Wasserstein GAN (WGAN) uses the Earth mover’s distance for more stable training:

$$ W(p, q) = \inf_{\gamma \in \Pi(p,q)} \mathbb{E}_{(x,y) \sim \gamma} |x - y| $$

Reinforcement Learning recasts molecular generation as a sequential decision problem. The policy gradient (REINFORCE) update is:

$$ \nabla J(\theta) = \mathbb{E}\left[G_{t} \frac{\nabla_{\theta} \pi_{\theta}(a_{t} \mid y_{1:t-1})}{\pi_{\theta}(a_{t} \mid y_{1:t-1})}\right] $$

To prevent RL fine-tuning from causing the generator to “drift” away from viable chemical structures, an augmented reward function incorporates the prior likelihood:

$$ R’(S) = [\sigma R(S) + \log P_{\text{prior}}(S) - \log P_{\text{current}}(S)]^{2} $$

Cataloging 45 Models and Their Design Choices

Rather than running new experiments, the review’s methodology involves systematically cataloging and comparing 45 published models. Table 2 in the paper lists each model’s architecture, representation, training dataset, and dataset size. Key patterns include:

RNN-based models (16 entries): Almost exclusively use SMILES, trained on ZINC or ChEMBL datasets with 0.1M-1.7M molecules.
VAE variants (20 entries): The most diverse category, spanning SMILES VAEs, grammar VAEs, junction tree VAEs, graph-based VAEs, and 3D VAEs. Training sets range from 10K to 72M molecules.
GAN models (7 entries): Include ORGAN, RANC, ATNC, MolGAN, and CycleGAN approaches. Notably, GANs appear to work with fewer training samples.
Other approaches (2 entries): Pure RL methods from Zhou et al. and Stahl et al. that do not require pretraining on a dataset.

The review also catalogs 13 publicly available datasets (Table 3), ranging from QM9 (133K molecules with quantum chemical properties) to GDB-13 (977M combinatorially generated molecules) and ZINC15 (750M+ commercially available compounds).

Metrics and Reward Function Design

A significant contribution is the systematic treatment of reward functions. The review argues that generated molecules should satisfy six desiderata: diversity, novelty, stability, synthesizability, non-triviality, and good properties. Key metrics formalized include:

Diversity using Tanimoto similarity over fingerprints:

$$ r_{\text{diversity}} = 1 - \frac{1}{|\mathcal{G}|} \sum_{(x_{1}, x_{2}) \in \mathcal{G} \times \mathcal{G}} D(x_{1}, x_{2}) $$

Novelty measured as the fraction of generated molecules not appearing in a hold-out test set:

$$ r_{\text{novel}} = 1 - \frac{|\mathcal{G} \cap \mathcal{T}|}{|\mathcal{T}|} $$

Synthesizability primarily assessed via the SA score, sometimes augmented with ring penalties and medicinal chemistry filters.

The review also discusses the Fréchet ChemNet Distance as an analog of FID for molecular generation, and notes the emergence of standardized benchmarking platforms including MOSES, GuacaMol, and DiversityNet.

Key Findings and Future Directions

The review identifies several major trends and conclusions:

Shift from SMILES to graph-based representations. SMILES-based methods struggle with validity (the molecular autoencoder VAE achieved only 0.7-75% valid SMILES depending on sampling strategy). Methods that work directly on molecular graphs with chemistry-preserving operations achieve 100% validity, and the review predicts this trend will continue.

Advantages of adversarial and RL training over MLE. The mathematical analysis shows that MLE only optimizes forward KL divergence, which can lead to models that place probability mass where the data distribution is zero. GAN training optimizes the Jensen-Shannon divergence, which balances forward and reverse KL terms. RL approaches, particularly pure RL without pretraining, showed competitive performance with much less training data.

Genetic algorithms remain competitive. The review notes that the latest genetic algorithm approaches (Grammatical Evolution) could match deep learning methods for molecular optimization under some metrics, and at 100x lower computational cost in some comparisons. This serves as an important baseline calibration.

Reward function design is underappreciated. Early models generated unstable molecules with labile groups (enamines, hemiaminals, enol ethers). Better reward functions that incorporate synthesizability, diversity, and stability constraints significantly improved practical utility.

Need for standardized benchmarks. The review identifies a lack of agreement on evaluation methodology as a major barrier to progress, noting that published comparisons are often subtly biased toward novel methods.

Limitations

As a review paper from early 2019, the work predates several important developments: transformer-based architectures (which would soon dominate), SELFIES representations, diffusion models for molecules, and large-scale pretrained chemical language models. The review focuses primarily on drug-like small molecules and does not deeply cover protein design or materials optimization.

Reproducibility Details

Data

This is a review paper that does not present new experimental results. The paper catalogs 13 publicly available datasets used across the reviewed works:

Purpose	Dataset	Size	Notes
Training/Eval	GDB-13	977M	Combinatorially generated library
Training/Eval	ZINC15	750M+	Commercially available compounds
Training/Eval	GDB-17	50M	Combinatorially generated library
Training/Eval	ChEMBL	2M	Curated bioactive molecules
Training/Eval	QM9	133,885	Small organic molecules with DFT properties
Training/Eval	PubChemQC	3.98M	PubChem compounds with DFT data

Algorithms

The review provides mathematical derivations for MLE training (Eq. 1), VAE ELBO (Eqs. 9-13), AAE objectives (Eqs. 15-16), GAN objectives (Eqs. 19-22), WGAN (Eq. 24), REINFORCE gradient (Eq. 7), and numerous reward function formulations (Eqs. 26-36).

Evaluation

Key evaluation frameworks discussed:

Fréchet ChemNet Distance (molecular analog of FID)
MOSES benchmarking platform
GuacaMol benchmarking suite
Validity rate, uniqueness, novelty, and internal diversity metrics

Paper Information

Citation: Elton, D. C., Boukouvalas, Z., Fuge, M. D., & Chung, P. W. (2019). Deep Learning for Molecular Design: A Review of the State of the Art. Molecular Systems Design & Engineering, 4(4), 828-849. https://doi.org/10.1039/C9ME00039A

@article{elton2019deep,
  title={Deep Learning for Molecular Design -- A Review of the State of the Art},
  author={Elton, Daniel C. and Boukouvalas, Zois and Fuge, Mark D. and Chung, Peter W.},
  journal={Molecular Systems Design \& Engineering},
  volume={4},
  number={4},
  pages={828--849},
  year={2019},
  publisher={Royal Society of Chemistry},
  doi={10.1039/C9ME00039A}
}

Re-evaluating Sample Efficiency in Molecule Generation

Thu, 26 Mar 2026 00:00:00 +0000

An Empirical Re-evaluation of Generative Model Benchmarks

This is an Empirical paper. The primary contribution is a critical reassessment of the Practical Molecular Optimization (PMO) benchmark for de novo molecule generation. Rather than proposing a new generative model, the authors modify existing benchmark metrics to account for chemical desirability (molecular weight, LogP, topological novelty) and molecular diversity. They then re-evaluate all 25 generative models from the original PMO benchmark plus the recently proposed Augmented Hill-Climb (AHC) method.

Sample Efficiency and Chemical Quality in Drug Design

Deep generative models for de novo molecule generation often require large numbers of oracle evaluations (up to $10^5$ samples) to optimize toward a target objective. This is a practical limitation when using computationally expensive scoring functions like molecular docking. The PMO benchmark by Gao et al. addressed this by reformulating performance as maximizing an objective within a fixed budget of 10,000 oracle calls, finding REINVENT to be the most sample-efficient model across 23 tasks.

However, the authors identify a key limitation: the PMO benchmark measures only sample efficiency without considering the chemical quality of proposed molecules. Investigating the top-performing REINVENT model on the JNK3 task, they find that 4 of 5 replicate runs produce molecules with molecular weight and LogP distributions far outside the training data (ZINC250k). The resulting molecules contain large structures with repeating substructures that are undesirable from a medicinal chemistry perspective. This disconnect between benchmark performance and practical utility motivates the modified evaluation metrics.

Modified Metrics: Property Filters and Diversity Requirements

The core innovation is the introduction of three modified AUC Top-10 metrics that extend the original PMO benchmark evaluation:

AUC Top-10 (Filtered): Molecules are excluded if their molecular weight or LogP falls beyond 4 standard deviations from the mean of the ZINC250k pre-training dataset ($\mu \pm 4\sigma$, covering approximately 99.99% of a normal distribution). Molecules with more than 10% de novo (unobserved in ZINC250k) ECFP4 fingerprint bits are also filtered out. This ensures the generative model does not drift beyond its applicability domain.

AUC Top-10 (Diverse): The top 10 molecules are selected iteratively, where a molecule is only added if its Tanimoto similarity (by ECFP4 fingerprints) to any previously selected compound does not exceed 0.35. This threshold corresponds to an approximately 80% probability that more-similar molecules belong to the same bioactivity class, enforcing that distinct candidates possess different profiles.

AUC Top-10 (Combined): Applies both property filters and diversity filters simultaneously, providing the most stringent evaluation of practical performance.

Benchmark Setup and Generative Models Evaluated

Implementation Details

The authors re-implement the PMO benchmark using the original code and data (MIT license) with no changes beyond adding AHC and the new metrics. For Augmented Hill-Climb, the architecture follows REINVENT: an embedding layer of size 128 and 3 layers of Gated Recurrent Units (GRU) with size 512. The prior is trained on ZINC250k using SMILES notation with batch size 128 for 5 epochs.

Two AHC variants are benchmarked:

SMILES-AHC: Hyperparameters optimized via the standard PMO procedure, yielding batch size 256, $\sigma = 120$, $K = 0.25$
SMILES-AHC*: Uses $\sigma = 60$, chosen based on prior knowledge that lower $\sigma$ values maintain better regularization and chemical quality

Both omit diversity filters and non-unique penalization for standardized comparison, despite these being shown to improve performance in prior work.

Models Compared

The benchmark includes 25 generative models from the original PMO paper spanning diverse architectures: REINVENT (RNN + RL), Graph GA (graph-based genetic algorithm), GP BO (Gaussian process Bayesian optimization), SMILES GA (SMILES-based genetic algorithm), SELFIES-based VAEs, and others. The 23 objective tasks derive primarily from the GuacaMol benchmark.

Re-ranked Results and Augmented Hill-Climb Performance

The modified metrics substantially re-order the ranking of generative models:

SMILES-AHC achieves top performance on AUC Top-10 (Combined)*, where both property filters and diversity are enforced. The use of domain-informed hyperparameter selection ($\sigma = 60$) proves critical.
SMILES-AHC (data-driven hyperparameters) ranks first when accounting for property filters alone, diversity alone, or both combined, demonstrating that the AHC algorithm itself provides strong performance even without manual tuning.
REINVENT retains its first-place rank under property filters alone, suggesting that the minority of compounds staying within acceptable property space still perform well. However, it drops when diversity is also required.
Evolutionary algorithms (Graph GA, GP BO, SMILES GA) drop significantly under the new metrics. This is expected because rule-based methods are not constrained by the ZINC250k distribution and tend to propose molecules that diverge from drug-like chemical space.
Both AHC variants excel on empirically difficult tasks, including isomer-based tasks, Zaleplon MPO, and Sitagliptin MPO, where other methods struggle.

Limitations

The authors acknowledge several limitations:

Results are preliminary because generative models have not undergone hyperparameter optimization against the new metrics
Property filter thresholds are subjective, and the 10% de novo ECFP4 bit threshold was chosen by visual inspection
Comparing rule-based models against distribution-based models using ZINC250k similarity introduces a bias toward distribution-based approaches
Six objective task reference molecules sit in the lowest 0.01% of ZINC250k property space, raising questions about whether distribution-based models can reasonably optimize for these objectives
Property filters and diversity could alternatively be incorporated directly into the objective function as additional oracles, though this would not necessarily produce the same results

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pre-training	ZINC250k	~250K molecules	Subset of ZINC15, provided by PMO benchmark
Evaluation	PMO benchmark tasks	23 objectives	Derived primarily from GuacaMol

Algorithms

Augmented Hill-Climb: RL strategy from Thomas et al. (2022), patience of 5
Hyperparameters (SMILES-AHC): batch size 256, $\sigma = 120$, $K = 0.25$
Hyperparameters (SMILES-AHC)*: $\sigma = 60$ (domain-informed selection)
Prior training: 5 epochs, batch size 128, SMILES notation
Oracle budget: 10,000 evaluations per task
Replicates: 5 per model per task

Models

Architecture: Embedding (128) + 3x GRU (512), following REINVENT
All 25 PMO benchmark models re-evaluated using original implementations

Evaluation

Metric	Description	Notes
AUC Top-10 (Original)	Area under curve of average top 10 molecules	Standard PMO metric
AUC Top-10 (Filtered)	Original with MW/LogP and ECFP4 novelty filters	$\mu \pm 4\sigma$ from ZINC250k
AUC Top-10 (Diverse)	Top 10 selected with Tanimoto < 0.35 diversity	ECFP4 fingerprints
AUC Top-10 (Combined)	Both filters and diversity applied	Most stringent metric

Hardware

Hardware requirements are not specified in the paper. The benchmark uses 10,000 oracle evaluations per task with 5 replicates, which is computationally modest compared to standard generative model training.

Artifacts

Artifact	Type	License	Notes
MolScore	Code	MIT	Scoring and benchmarking framework by the first author
PMO Benchmark	Code	MIT	Original benchmark code and data

Paper Information

Citation: Thomas, M., O’Boyle, N. M., Bender, A., & de Graaf, C. (2022). Re-evaluating sample efficiency in de novo molecule generation. arXiv preprint arXiv:2212.01385.

@misc{thomas2022reevaluating,
  title={Re-evaluating sample efficiency in de novo molecule generation},
  author={Thomas, Morgan and O'Boyle, Noel M. and Bender, Andreas and de Graaf, Chris},
  year={2022},
  eprint={2212.01385},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  doi={10.48550/arxiv.2212.01385}
}

Inverse Molecular Design with ML Generative Models

Thu, 26 Mar 2026 00:00:00 +0000

A Foundational Systematization of Inverse Molecular Design

This paper is a Systematization of the nascent field of inverse molecular design using machine learning generative models. Published in Science in 2018, it organizes and contextualizes the rapidly emerging body of work on using deep generative models (variational autoencoders, generative adversarial networks, and reinforcement learning) to navigate chemical space and propose novel molecules with targeted properties. Rather than introducing a new method, the paper synthesizes the conceptual framework connecting molecular representations, generative architectures, and inverse design objectives, establishing a reference point for the field at a critical early stage.

The Challenge of Navigating Chemical Space

The core problem is the sheer scale of chemical space. For pharmacologically relevant small molecules alone, the number of possible structures is estimated at $10^{60}$. Traditional approaches to materials discovery rely on trial and error or high-throughput virtual screening (HTVS), both of which are fundamentally limited by the need to enumerate and evaluate candidates from a predefined library.

The conventional materials discovery pipeline, from concept to commercial product, historically takes 15 to 20 years, involving iterative cycles of simulation, synthesis, device integration, and characterization. Inverse design offers a conceptual alternative: start from a desired functionality and search for molecular structures that satisfy it. This inverts the standard paradigm where a molecule is proposed first and its properties are computed or measured afterward.

The key distinction the authors draw is between discriminative and generative models. A discriminative model learns $p(y|x)$, the conditional probability of properties $y$ given a molecule $x$. A generative model instead learns the joint distribution $p(x,y)$, which can be conditioned to yield either the direct design problem $p(y|x)$ or the inverse design problem $p(x|y)$.

Three Pillars: VAEs, GANs, and Reinforcement Learning

The review organizes inverse molecular design approaches around three generative paradigms and the molecular representations they operate on.

Molecular Representations

The paper surveys representations across three broad categories:

Discrete (text-based): SMILES strings encode molecular structure as 1D text following a grammar syntax. Their adoption has been driven by the availability of NLP deep learning tools.
Continuous (vectors/tensors): Coulomb matrices, bag of bonds, fingerprints, symmetry functions, and electronic density representations. These expose different physical symmetries (permutational, rotational, reflectional, translational invariance).
Weighted graphs: Molecules as undirected graphs where atoms are nodes and bonds are edges, with vectorized features on edges and nodes (bonding type, aromaticity, charge, distance).

An ideal representation for inverse design should be invertible, meaning it supports mapping back to a synthesizable molecular structure. SMILES strings and molecular graphs are invertible, while many continuous representations require lookup tables or auxiliary methods.

Variational Autoencoders (VAEs)

VAEs encode molecules into a continuous latent space and decode latent vectors back to molecular representations. The key insight is that by constraining the encoder to produce latent vectors following a Gaussian distribution, the model gains the ability to interpolate between molecules and sample novel structures. The latent space encodes a geometry: nearby points decode to similar molecules, and gradient-based optimization over this continuous space enables direct property optimization.

The VAE loss function combines a reconstruction term with a KL divergence regularizer:

$$\mathcal{L} = \mathbb{E}_{q(z|x)}[\log p(x|z)] - D_{KL}(q(z|x) | p(z))$$

where $q(z|x)$ is the encoder (approximate posterior), $p(x|z)$ is the decoder, and $p(z)$ is the prior (typically Gaussian).

Semi-supervised variants jointly train on molecules and properties, reorganizing latent space so molecules with similar properties cluster together. Gomez-Bombarelli et al. demonstrated local and global optimization across generated distributions using Bayesian optimization over latent space.

The review traces the evolution from character-level SMILES VAEs to grammar-aware and syntax-directed variants that improve the generation of syntactically valid structures.

Generative Adversarial Networks (GANs)

GANs pit a generator against a discriminator in an adversarial training framework. The generator learns to produce synthetic molecules from noise, while the discriminator learns to distinguish synthetic from real molecules. Training convergence for GANs is challenging, suffering from mode collapse and generator-discriminator imbalance.

For molecular applications, dealing with discrete SMILES data introduces nondifferentiability, addressed through workarounds like SeqGAN’s policy gradient approach and boundary-seeking GANs.

Reinforcement Learning (RL)

RL treats molecule generation as a sequential decision process where an agent (the generator) takes actions (adding characters to a SMILES string) to maximize a reward (desired molecular properties). Since rewards can only be assigned after sequence completion, Monte Carlo Tree Search (MCTS) is used to simulate possible completions and weight paths based on their success.

Applications include generation of drug-like molecules and retrosynthesis planning. Notable examples cited include RL for optimizing putative JAK2 inhibitors and molecules active against dopamine receptor type 2.

Hybrid Approaches

The review highlights that these paradigms are not exclusive. Examples include druGAN (adversarial autoencoder) and ORGANIC (combined GAN and RL), which leverage strengths of multiple frameworks.

Survey of Applications and Design Paradigms

Being a review paper, this work does not present new experiments but surveys existing applications across domains:

Drug Discovery: Most generative model applications at the time of writing targeted pharmaceutical properties, including solubility, melting temperature, synthesizability, and target activity. Popova et al. optimized for JAK2 inhibitors, and Olivecrona et al. targeted dopamine receptor type 2.

Materials Science: HTVS had been applied to organic photovoltaics (screening by frontier orbital energies and conversion efficiency), organic redox flow batteries (redox potential and solubility), organic LEDs (singlet-triplet gap), and inorganic materials via the Materials Project.

Chemical Space Exploration: Evolution strategies had been applied to map chemical space, with structured search procedures incorporating genotype representations and mutation operations. Bayesian sampling with sequential Monte Carlo and gradient-based optimization of properties with respect to molecular systems represented alternative inverse design strategies.

Graph-Based Generation: The paper notes the emerging extension of VAEs to molecular graphs (junction tree VAE) and message passing networks for incremental graph construction, though the graph isomorphism approximation problem remained a practical challenge.

Future Directions and Open Challenges

The authors identify several open directions for the field:

Closed-Loop Discovery: The ultimate goal is to concurrently propose, create, and characterize new materials with simultaneous data flow between components. At the time of writing, very few examples of successful closed-loop approaches existed.

Active Learning: Combining inverse design with Bayesian optimization enables models that adapt as they explore chemical space, expanding in regions of high uncertainty and discovering molecular regions with desirable properties as a function of composition.

Representation Learning: No single molecular representation works optimally for all properties. Graph and hierarchical representations were identified as areas needing further study. Representations that encode relevant physics tend to generalize better.

Improved Architectures: Memory-augmented sequence generation models, Riemannian optimization methods exploiting latent space geometry, multi-level VAEs for structured latent spaces, and inverse RL for learning reward functions were highlighted as promising research directions.

Integration into Education: The authors advocate for integrating ML into curricula across chemical, biochemical, medicinal, and materials sciences.

Limitations

As a review paper from 2018, this work captures the field at an early stage. Several limitations are worth noting:

The survey is dominated by SMILES-based approaches, reflecting the state of the field at the time. Graph-based and 3D-aware generative models were just emerging.
Quantitative benchmarking of generative models was not yet standardized. The review does not provide systematic comparisons across methods.
The synthesis feasibility of generated molecules receives limited attention. The gap between computationally generated candidates and experimentally realizable molecules was (and remains) a significant challenge.
Transformer-based architectures, which would come to dominate chemical language modeling, are not discussed, as the Transformer had only been published the year prior.

Reproducibility Details

As a review/perspective paper, this work does not introduce new models, datasets, or experiments. The reproducibility assessment applies to the cited primary works rather than the review itself.

Key Cited Methods and Their Resources

Method	Authors	Type	Availability
Automatic Chemical Design (VAE)	Gomez-Bombarelli et al.	Code + Data	Published in ACS Central Science
Grammar VAE	Kusner et al.	Code	arXiv:1703.01925
Junction Tree VAE	Jin et al.	Code	arXiv:1802.04364
ORGANIC	Sanchez-Lengeling et al.	Code	ChemRxiv preprint
SeqGAN	Yu et al.	Code	AAAI 2017
Neural Message Passing	Gilmer et al.	Code	arXiv:1704.01212

Paper Information

Citation: Sánchez-Lengeling, B., & Aspuru-Guzik, A. (2018). Inverse molecular design using machine learning: Generative models for matter engineering. Science, 361(6400), 360-365. https://doi.org/10.1126/science.aat2663

@article{sanchez-lengeling2018inverse,
  title={Inverse molecular design using machine learning: Generative models for matter engineering},
  author={S{\'a}nchez-Lengeling, Benjamin and Aspuru-Guzik, Al{\'a}n},
  journal={Science},
  volume={361},
  number={6400},
  pages={360--365},
  year={2018},
  publisher={American Association for the Advancement of Science},
  doi={10.1126/science.aat2663}
}

Generative AI Survey for De Novo Molecule and Protein Design

Thu, 26 Mar 2026 00:00:00 +0000

A Systematization of Generative AI for Drug Design

This is a Systematization paper that provides a broad survey of generative AI methods applied to de novo drug design. The survey organizes the field into two overarching themes: small molecule generation and protein generation. Within each theme, the authors identify subtasks, catalog datasets and benchmarks, describe model architectures, and compare the performance of leading methods using standardized metrics. The paper covers over 200 references and provides 12 comparative benchmark tables.

The primary contribution is a unified organizational framework that allows both micro-level comparisons within each subtask and macro-level observations across the two application domains. The authors highlight parallel developments in both fields, particularly the shift from sequence-based to structure-based approaches and the growing dominance of diffusion models.

The Challenge of Navigating De Novo Drug Design

The drug design process requires creating ligands that interact with specific biological targets. These range from small molecules (tens of atoms) to large proteins (monoclonal antibodies). Traditional discovery methods are computationally expensive, with preclinical trials costing hundreds of millions of dollars and taking 3-6 years. The chemical space of potential drug-like compounds is estimated at $10^{23}$ to $10^{60}$, making brute-force exploration infeasible.

AI-driven generative methods have gained traction in recent years, with over 150 AI-focused biotech companies initiating small-molecule drugs in the discovery phase and 15 in clinical trials. The rate of AI-fueled drug design processes has expanded by almost 40% each year.

The rapid development of the field, combined with its inherent complexity, creates barriers for new researchers. Several prior surveys exist, but they focus on specific aspects: molecule generation, protein generation, antibody generation, or specific model architectures like diffusion models. This survey takes a broader approach, covering both molecule and protein generation under a single organizational framework.

Unified Taxonomy: Two Themes, Seven Subtasks

The survey’s core organizational insight is structuring de novo drug design into two themes with distinct subtasks, while identifying common architectural patterns across them.

Generative Model Architectures

The survey covers four main generative model families used across both molecule and protein generation:

Variational Autoencoders (VAEs) encode inputs into a latent distribution and decode from sampled points. The encoder maps input $x$ to a distribution parameterized by mean $\mu_\phi(x)$ and variance $\sigma^2_\phi(x)$. Training minimizes reconstruction loss plus KL divergence:

$$\mathcal{L} = \mathcal{L}_{\text{recon}} + \beta \mathcal{L}_{\text{KL}}$$

where the KL loss is:

$$\mathcal{L}_{\text{KL}} = -\frac{1}{2} \sum_{k} \left(1 + \log(\sigma_k^{(i)2}) - \mu_k^{(i)2} - \sigma_k^{(i)2}\right)$$

Generative Adversarial Networks (GANs) use a generator-discriminator game. The generator $G$ creates instances from random noise $z$ sampled from a prior $p_z(z)$, while the discriminator $D$ distinguishes real from synthetic data:

$$\min_{G} \max_{D} \mathbb{E}_x[\log D(x; \theta_d)] + \mathbb{E}_{z \sim p(z)}[\log(1 - D(G(z; \theta_g); \theta_d))]$$

Flow-Based Models generate data by applying an invertible function $f: z_0 \mapsto x$ to transform a simple latent distribution (Gaussian) to the target distribution. The log-likelihood is computed using the change-of-variable formula:

$$\log p(x) = \log p_0(z) + \log \left| \det \frac{\partial f}{\partial z} \right|$$

Diffusion Models gradually add Gaussian noise over $T$ steps in a forward process and learn to reverse the noising via a denoising neural network. The forward step is:

$$x_{t+1} = \sqrt{1 - \beta_t} x_t + \sqrt{\beta_t} \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)$$

The training loss minimizes the difference between the true noise and the predicted noise:

$$L_t = \mathbb{E}_{t \sim [1,T], x_0, \epsilon_t} \left[ | \epsilon_t - \epsilon_\theta(x_t, t) |^2 \right]$$

Graph neural networks (GNNs), particularly equivariant GNNs (EGNNs), are commonly paired with these generative methods to handle 2D/3D molecular and protein inputs. Diffusion and flow-based models are often paired with GNNs for processing 2D/3D-based input, while VAEs and GANs are typically used for 1D input.

Small Molecule Generation: Tasks, Datasets, and Models

Target-Agnostic Molecule Design

The goal is to generate a set of novel, valid, and stable molecules without conditioning on any specific biological target. Models are evaluated on atom stability, molecule stability, validity, uniqueness, novelty, and QED (Quantitative Estimate of Drug-Likeness).

Datasets: QM9 (small stable molecules from GDB-17) and GEOM-Drug (more complex, drug-like molecules).

The field has shifted from SMILES-based VAEs (CVAE, GVAE, SD-VAE) to 2D graph methods (JTVAE) and then to 3D diffusion-based models. Current leading methods on QM9:

Model	Type	At Stb. (%)	Mol Stb. (%)	Valid (%)	Val/Uniq. (%)
MiDi	EGNN, Diffusion	99.8	97.5	97.9	97.6
MDM	EGNN, VAE, Diffusion	99.2	89.6	98.6	94.6
JODO	EGNN, Diffusion	99.2	93.4	99.0	96.0
GeoLDM	VAE, Diffusion	98.9	89.4	93.8	92.7
EDM	EGNN, Diffusion	98.7	82.0	91.9	90.7

EDM provided an initial baseline using diffusion with an equivariant GNN. GCDM introduced attention-based geometric message-passing. MDM separately handles covalent bond edges and Van der Waals forces, and also addresses diversity through an additional distribution-controlling noise variable. GeoLDM maps molecules to a lower-dimensional latent space for more efficient diffusion. MiDi uses a “relaxed” EGNN and jointly models 2D and 3D information through a graph representation capturing both spatial and connectivity data.

On the larger GEOM-Drugs dataset, performance drops for most models:

Model	At Stb. (%)	Mol Stb. (%)	Valid (%)	Val/Uniq. (%)
MiDi	99.8	91.6	77.8	77.8
MDM	–	62.2	99.5	99.0
GeoLDM	84.4	–	99.3	–
EDM	81.3	–	–	–

MiDi distinguishes itself for generating more stable complex molecules, though at the expense of validity. Models generally perform well on QM9 but show room for improvement on more complex GEOM-Drugs molecules.

Target-Aware Molecule Design

Target-aware generation produces molecules for specific protein targets, using either ligand-based (LBDD) or structure-based (SBDD) approaches. SBDD methods have become more prevalent as protein structure information becomes increasingly available.

Datasets: CrossDocked2020 (22.5M ligand-protein pairs), ZINC20, Binding MOAD.

Metrics: Vina Score (docking energy), High Affinity Percentage, QED, SA Score (synthetic accessibility), Diversity (Tanimoto similarity).

Model	Type	Vina	Affinity (%)	QED	SA	Diversity
DiffSBDD	EGNN, Diffusion	-7.333	–	0.467	0.554	0.758
Luo et al.	SchNet	-6.344	29.09	0.525	0.657	0.720
TargetDiff	EGNN, Diffusion	-6.3	58.1	0.48	0.58	0.72
LiGAN	CNN, VAE	-6.144	21.1	0.39	0.59	0.66
Pocket2Mol	EGNN, MLP	-5.14	48.4	0.56	0.74	0.69

DrugGPT is an LBDD autoregressive model using transformers on tokenized protein-ligand pairs. Among the SBDD models, LiGAN introduces a 3D CNN-VAE framework, Pocket2Mol emphasizes binding pocket geometry using an EGNN with geometric vector MLP layers, and Luo et al. model atomic probabilities in the binding site using SchNet. TargetDiff performs diffusion on an EGNN and optimizes binding affinity by reflecting low atom type entropy. DiffSBDD applies an inpainting approach by masking and replacing segments of ligand-protein complexes. DiffSBDD leads in Vina score and diversity, while TargetDiff leads in high affinity. Interestingly, diffusion-based methods are outperformed by Pocket2Mol on drug-likeness metrics (QED and SA).

Molecular Conformation Generation

Conformation generation involves producing 3D structures from 2D connectivity graphs. Models are evaluated on Coverage (COV, percentage of ground-truth conformations “covered” within an RMSD threshold) and Matching (MAT, average RMSD to closest ground-truth conformation).

Datasets: GEOM-QM9, GEOM-Drugs, ISO17.

Model	Type	GEOM-QM9 COV (%)	GEOM-QM9 MAT	GEOM-Drugs COV (%)	GEOM-Drugs MAT
Torsional Diff.	Diffusion	92.8	0.178	72.7*	0.582
DGSM	MPNN, Diffusion	91.49	0.2139	78.73	1.0154
GeoDiff	GFN, Diffusion	90.07	0.209	89.13	0.8629
ConfGF	GIN, Diffusion	88.49	0.2673	62.15	1.1629
GeoMol	MPNN	71.26	0.3731	67.16	1.0875

*Torsional Diffusion uses a 0.75 A threshold instead of the standard 1.25 A for GEOM-Drugs coverage, leading to a deflated score. It outperforms GeoDiff and GeoMol when evaluated at the same threshold.

Torsional Diffusion operates in the space of torsion angles rather than Cartesian coordinates, allowing for improved representation and fewer denoising steps. GeoDiff uses Euclidean-space diffusion, treating each atom as a particle and incorporating Markov kernels that preserve E(3) equivariance through a graph field network (GFN) layer.

Protein Generation: From Sequence to Structure

Protein Representation Learning

Representation learning creates embeddings for protein inputs to support downstream tasks. Models are evaluated on contact prediction, fold classification (at family, superfamily, and fold levels), and stability prediction (Spearman’s $\rho$).

Key models include: UniRep (mLSTM RNN), ProtBERT (BERT applied to amino acid sequences), ESM-1B (33-layer, 650M parameter transformer), MSA Transformer (pre-trained on MSA input), and GearNET (Geo-EGNN using 3D structure with directed edges). OntoProtein and KeAP incorporate knowledge graphs for direct knowledge injection.

Protein Structure Prediction

Given an amino acid sequence, models predict 3D point coordinates for each residue. Evaluated using RMSD, GDT-TS, TM-score, and LDDT on CASP14 and CAMEO benchmarks.

AlphaFold2 is the landmark model, integrating MSA and pair representations through transformers with invariant point attention (IPA). ESMFold uses ESM-2 language model representations instead of MSAs, achieving faster processing. RoseTTAFold uses a three-track neural network learning from 1D sequence, 2D distance map, and 3D backbone coordinate information simultaneously. EigenFold uses diffusion, representing the protein as a system of harmonic oscillators.

Model	Type	CAMEO RMSD	CAMEO TMScore	CAMEO GDT-TS	CAMEO lDDT	CASP14 TMScore
AlphaFold2	Transformer	3.30	0.87	0.86	0.90	0.38
ESMFold	Transformer	3.99	0.85	0.83	0.87	0.68
RoseTTAFold	Transformer	5.72	0.77	0.71	0.79	0.37
EigenFold	Diffusion	7.37	0.75	0.71	0.78	–

Sequence Generation (Inverse Folding)

Given a fixed protein backbone structure, models generate amino acid sequences that will fold into that structure. The space of valid sequences is between $10^{65}$ and $10^{130}$.

Evaluated using Amino Acid Recovery (AAR), diversity, RMSD, nonpolar loss, and perplexity (PPL):

$$\text{PPL} = \exp\left(\frac{1}{N} \sum_{i=1}^{N} \log P(x_i | x_1, x_2, \ldots x_{i-1})\right)$$

ProteinMPNN is the current top performer, generating the most accurate sequences and leading in AAR, RMSD, and nonpolar loss. It uses a message-passing neural network with a flexible, order-agnostic autoregressive approach.

Model	Type	AAR (%)	Div.	RMSD	Non.	Time (s)
ProteinMPNN	MPNN	48.7	0.168	1.019	1.061	112
ESM-IF1	Transformer	47.7	0.184	1.265	1.201	1980
GPD	Transformer	46.2	0.219	1.758	1.333	35
ABACUS-R	Transformer	45.7	0.124	1.482	0.968	233280
3D CNN	CNN	44.5	0.272	1.62	1.027	536544
PiFold	GNN	42.8	0.141	1.592	1.464	221
ProteinSolver	GNN	24.6	0.186	5.354	1.389	180

Results are from the independent benchmark by Yu et al. GPD remains the fastest method, generating sequences around three times faster than ProteinMPNN. Current SOTA models recover fewer than half of target amino acid residues, indicating room for improvement.

Backbone Design

Backbone design creates protein structures from scratch, representing the core of de novo protein design. Models generate coordinates for backbone atoms (nitrogen, alpha-carbon, carbonyl, oxygen) and use external tools like Rosetta for side-chain packing.

Two evaluation paradigms exist: context-free generation (evaluated by self-consistency TM, or scTM) and context-given generation (inpainting, evaluated by AAR, PPL, RMSD).

ProtDiff represents residues as 3D Cartesian coordinates and uses particle-filtering diffusion. FoldingDiff instead uses an angular representation (six angles per residue) with a BERT-based DDPM. LatentDiff embeds proteins into a latent space using an equivariant autoencoder, then applies equivariant diffusion, analogous to GeoLDM for molecules. These early models work well for short proteins (up to 128 residues) but struggle with longer structures.

Frame-based methods address this scaling limitation. Genie uses Frenet-Serret frames with paired residue representations and IPA for noise prediction. FrameDiff parameterizes backbone structures on the $SE(3)^N$ manifold of frames using a score-based generative model. RFDiffusion is the current leading model, combining RoseTTAFold structure prediction with diffusion. It fine-tunes RoseTTAFold weights on a masked input sequence and random noise coordinates, using “self-conditioning” on predicted structures. Protpardelle co-designs sequence and structure by creating a “superposition” over possible sidechain states and collapsing them during each iterative diffusion step.

Model	Type	scTM (%)	Design. (%)	PPL	AAR (%)	RMSD
RFDiffusion	Diffusion	–	95.1	–	–	–
Protpardelle	Diffusion	85	–	–	–	–
FrameDiff	Diffusion	84	48.3	–	–	–
Genie	Diffusion	81.5	79.0	–	–	–
LatentDiff	EGNN, Diffusion	31.6	–	–	–	–
FoldingDiff	Diffusion	14.2	–	–	–	–
ProtDiff	EGNN, Diffusion	11.8	–	–	12.47*	8.01*

*ProtDiff context-given results are tested only on beta-lactamase metalloproteins from PDB.

Antibody Design

The survey covers antibody structure prediction, representation learning, and CDR-H3 generation. Antibodies are Y-shaped proteins with complementarity-determining regions (CDRs), where CDR-H3 is the most variable and functionally important region.

For CDR-H3 generation, models have progressed from sequence-based (LSTM) to structure-based (RefineGNN) and sequence-structure co-design approaches (MEAN, AntiDesigner, DiffAb). dyMEAN is the current leading model, providing an end-to-end method incorporating structure prediction, docking, and CDR generation into a single framework. MSA alignment cannot be used for antibody input, which makes general models like AlphaFold2 inefficient for antibody prediction. Specialized models like IgFold use sequence embeddings from AntiBERTy with invariant point attention to achieve faster antibody structure prediction.

Peptide Design

The survey briefly covers peptide generation, including models for therapeutic peptide generation (MMCD), peptide-protein interaction prediction (PepGB), peptide representation learning (PepHarmony), peptide sequencing (AdaNovo), and signal peptide prediction (PEFT-SP).

Current Trends, Challenges, and Future Directions

Current Trends

The survey identifies several parallel trends across molecule and protein generation:

Shift from sequence to structure: In molecule generation, graph-based diffusion models (GeoLDM, MiDi, TargetDiff) now dominate. In protein generation, structure-based representation learning (GearNET) and diffusion-based backbone design (RFDiffusion) have overtaken sequence-only methods.
Dominance of E(3) equivariant architectures: EGNNs appear across nearly all subtasks, reflecting the physical requirement that molecular and protein properties should be invariant to rotation and translation.
Structure-based over ligand-based approaches: In target-aware molecule design, SBDD methods that use 3D protein structures demonstrate clear advantages over LBDD approaches that operate on amino acid sequences alone.

Challenges

For small molecule generation:

Complexity: Models perform well on simple QM9 but struggle with complex GEOM-Drugs molecules.
Applicability: Generating molecules with high binding affinity to targets remains difficult.
Explainability: Methods are black-box, offering no insight into why generated molecules have desired properties.

For protein generation:

Benchmarking: Protein generative tasks lack a standard evaluative procedure, with variance between each model’s metrics and testing conditions.
Performance: SOTA models still struggle with fold classification, gene ontology, and antibody CDR-H3 generation.

The authors also note that many generative tasks are evaluated using predictive models (e.g., classifier networks for binding affinity or molecular properties). Improvements to these classification methods would lead to more precise alignment with real-world biological applications.

Future Directions

The authors identify increasing performance in existing tasks, defining more applicable tasks (especially in molecule-protein binding, antibody generation), and exploring entirely new areas of research as key future directions.

Reproducibility Details

As a survey paper, this work does not produce new models, datasets, or experimental results. All benchmark numbers reported are from the original papers cited.

Data

The survey catalogs the following key datasets across subtasks:

Subtask	Datasets	Notes
Target-agnostic molecule	QM9, GEOM-Drug	QM9 from GDB-17; GEOM-Drug for complex molecules
Target-aware molecule	CrossDocked2020, ZINC20, Binding MOAD	CrossDocked2020 most used (22.5M pairs)
Conformation generation	GEOM-QM9, GEOM-Drugs, ISO17	Conformer sets for molecules
Protein structure prediction	PDB, CASP14, CAMEO	CASP biennial blind evaluation
Protein sequence generation	PDB, UniRef, UniParc, CATH, TS500	CATH for domain classification
Backbone design	PDB, AlphaFoldDB, SCOP, CATH	AlphaFoldDB for expanded structural coverage
Antibody structure	SAbDab, RAB	SAbDab: all antibody structures from PDB
Antibody CDR generation	SAbDab, RAB, SKEMPI	SKEMPI for affinity optimization

Artifacts

Artifact	Type	License	Notes
GenAI4Drug	Code	Not specified	Organized repository of all covered sources

Paper Information

Citation: Tang, X., Dai, H., Knight, E., Wu, F., Li, Y., Li, T., & Gerstein, M. (2024). A survey of generative AI for de novo drug design: New frontiers in molecule and protein generation. Briefings in Bioinformatics, 25(4), bbae338. https://doi.org/10.1093/bib/bbae338

Publication: Briefings in Bioinformatics, Volume 25, Issue 4, 2024.

Additional Resources:

@article{tang2024survey,
  title={A survey of generative AI for de novo drug design: new frontiers in molecule and protein generation},
  author={Tang, Xiangru and Dai, Howard and Knight, Elizabeth and Wu, Fang and Li, Yunyang and Li, Tianxiao and Gerstein, Mark},
  journal={Briefings in Bioinformatics},
  volume={25},
  number={4},
  pages={bbae338},
  year={2024},
  doi={10.1093/bib/bbae338}
}

Chemical Language Models for De Novo Drug Design Review

Thu, 26 Mar 2026 00:00:00 +0000

A Systematization of Chemical Language Models for Drug Design

This paper is a Systematization (minireview) that surveys the landscape of chemical language models (CLMs) for de novo drug design. It organizes the field along three axes: molecular string representations, deep learning architectures, and generation strategies (distribution learning, goal-directed, and conditional). The review also highlights experimental validations, current gaps, and future opportunities.

Why Chemical Language Models Matter for Drug Design

De novo drug design faces an enormous combinatorial challenge: the “chemical universe” is estimated to contain up to $10^{60}$ drug-like small molecules. Exhaustive enumeration is infeasible, and traditional design algorithms rely on hand-crafted assembly rules. Chemical language models address this by borrowing natural language processing techniques to learn the “chemical language,” generating molecules as string representations (SMILES, SELFIES, DeepSMILES) that satisfy both syntactic validity (chemically valid structures) and semantic correctness (desired pharmacological properties).

CLMs have gained traction because string representations are readily available for most molecular databases, generation is computationally cheap (one molecule per forward pass through a sequence model), and the same architecture can be applied to diverse tasks (property prediction, de novo generation, reaction prediction). At the time of this review, CLMs had produced experimentally validated bioactive molecules in several prospective studies, establishing them as practical tools for drug discovery.

Molecular String Representations: SMILES, DeepSMILES, and SELFIES

The review covers three main string representations used as input/output for CLMs:

SMILES (Simplified Molecular Input Line Entry Systems) converts hydrogen-depleted molecular graphs into strings where atoms are denoted by atomic symbols, bonds and branching by punctuation, and ring openings/closures by numbers. SMILES are non-univocal (multiple valid strings per molecule), and canonicalization algorithms are needed for unique representations. Multiple studies show that using randomized (non-canonical) SMILES for data augmentation improves CLM performance, with diminishing returns beyond 10- to 20-fold augmentation.

DeepSMILES modifies SMILES to improve machine-readability by replacing the paired ring-opening/closure digits with a count-based system and using closing parentheses only (no opening ones). This reduces the frequency of syntactically invalid strings but does not eliminate them entirely.

SELFIES (Self-Referencing Embedded Strings) use a formal grammar that guarantees 100% syntactic validity of decoded molecules. Every SELFIES string maps to a valid molecular graph. However, SELFIES can produce chemically unrealistic molecules (e.g., highly strained ring systems), and the mapping between string edits and molecular changes is less intuitive than for SMILES.

The review notes a key tradeoff: SMILES offer a richer, more interpretable language with well-studied augmentation strategies, while SELFIES guarantee validity at the cost of chemical realism and edit interpretability.

CLM Architectures and Training Strategies

Architectures

The review describes the main architectures used in CLMs:

Recurrent Neural Networks (RNNs), particularly LSTMs and GRUs, dominated early CLM work. These models process SMILES character-by-character and generate new strings autoregressively via next-token prediction. RNNs are computationally efficient and well-suited to the sequential nature of molecular strings.

Variational Autoencoders (VAEs) encode molecules into a continuous latent space and decode them back into strings. This enables smooth interpolation between molecules and latent-space optimization, but generated strings may be syntactically invalid.

Generative Adversarial Networks (GANs) have been adapted for molecular string generation (e.g., ORGAN), though they face training instability and mode collapse challenges that limit their adoption.

Transformers have emerged as an increasingly popular alternative, offering parallelized training and the ability to capture long-range dependencies in molecular strings. The review notes the growing relevance of Transformer-based CLMs, particularly for large-scale pretraining.

Generation Strategies

The review organizes CLM generation into three categories:

Distribution learning: The model learns to reproduce the statistical distribution of a training set of molecules. No explicit scoring function is used during generation. The generated molecules are evaluated post-hoc by comparing their property distributions to the training set. This approach is end-to-end but provides no direct indication of individual molecule quality.
Goal-directed generation: A pretrained CLM is steered toward molecules optimizing a specified scoring function (e.g., predicted bioactivity, physicochemical properties). Common approaches include reinforcement learning (REINVENT and variants), hill-climbing, and Bayesian optimization. Scoring functions provide direct quality signals but can introduce biases, shortcuts, and limited structural diversity.
Conditional generation: An intermediate approach that learns a joint semantic space between molecular structures and desired properties. The desired property profile serves as an input “prompt” for generation (e.g., a protein target, gene expression signature, or 3D shape). This bypasses the need for external scoring functions but has seen limited experimental application.

Transfer Learning and Chemical Space Exploration

Transfer learning is the dominant paradigm for CLM-driven chemical space exploration. A large-scale pretraining step (on $10^5$ to $10^6$ molecules via next-character prediction) is followed by fine-tuning on a smaller set of molecules with desired properties (often 10 to $10^2$ molecules). Key findings from the literature:

The minimum training set size depends on target molecule complexity and heterogeneity.
SMILES augmentation is most beneficial with small training sets (fewer than 10,000 molecules) and plateaus for large, structurally complex datasets.
Fine-tuning with as few as 10 to 100 molecules has produced experimentally validated bioactive designs.
Hyperparameter tuning has relatively little effect on overall CLM performance.

Evaluating CLM Designs and Experimental Validation

The review identifies evaluation as a critical gap. CLMs are often benchmarked on “toy” properties such as calculated logP, molecular weight, or QED (quantitative estimate of drug-likeness). These metrics capture the ability to satisfy predefined criteria but fail to reflect real-world drug discovery complexity and may lead to trivial solutions.

Existing benchmarks (GuacaMol, MOSES) enable comparability across independently developed approaches but do not fully address the quality of generated compounds. The review emphasizes that experimental validation is the ultimate test. At the time of writing, only a few prospective applications had been published:

Dual modulator of retinoid X and PPAR receptors (EC50 ranging from 0.06 to 2.3 uM)
Inhibitor of Pim1 kinase and CDK4 (manually modified from generated design)
Natural-product-inspired RORgamma agonist (EC50 = 0.68 uM)
Molecules designed via combined generative AI and on-chip synthesis

The scarcity of experimental validations reflects the interdisciplinary expertise required and the time/cost of chemical synthesis.

Gaps, Limitations, and Future Directions

The review identifies several key gaps and opportunities:

Scoring function limitations: Current scoring functions struggle with activity cliffs and non-additive structure-activity relationships. Conditional generation methods may help overcome these limitations by learning direct structure-property mappings.

Structure-based design: Generating molecules that match electrostatic and shape features of protein binding pockets holds promise for addressing unexplored targets. However, prospective applications have been limited, potentially due to bias in existing protein-ligand affinity datasets.

Synthesizability: Improving the ability of CLMs to propose synthesizable molecules is expected to increase practical relevance. Automated synthesis platforms may help but could also limit accessible chemical space.

Few-shot learning: Large-scale pretrained CLMs combined with few-shot learning approaches are expected to boost prospective applications.

Extensions beyond small molecules: Extending chemical languages to more complex molecular entities (proteins with non-natural amino acids, crystals, supramolecular chemistry) is an open frontier.

Failure modes: Several studies have documented failure modes in goal-directed generation, including model shortcuts (exploiting scoring function artifacts), limited structural diversity, and generation of chemically unrealistic molecules.

Interdisciplinary collaboration: The review emphasizes that bridging deep learning, cheminformatics, and medicinal chemistry expertise is essential for translating CLM designs into real-world drug candidates.

Reproducibility Details

Data

This is a review paper and does not present novel experimental data. The paper surveys results from the literature.

Algorithms

No novel algorithms are introduced. The review categorizes existing approaches (RNNs, VAEs, GANs, Transformers) and generation strategies (distribution learning, goal-directed, conditional).

Models

No new models are presented. The paper references existing implementations including REINVENT, ORGAN, and various RNN-based and Transformer-based CLMs.

Evaluation

The review discusses existing benchmarks:

GuacaMol: Benchmarking suite for de novo molecular design
MOSES: Benchmarking platform for molecular generation models
QED: Quantitative estimate of drug-likeness
Various physicochemical property metrics (logP, molecular weight)

Hardware

Not applicable (review paper).

Paper Information

Citation: Grisoni, F. (2023). Chemical language models for de novo drug design: Challenges and opportunities. Current Opinion in Structural Biology, 79, 102527. https://doi.org/10.1016/j.sbi.2023.102527

Publication: Current Opinion in Structural Biology, Volume 79, April 2023

@article{grisoni2023chemical,
  title={Chemical language models for de novo drug design: Challenges and opportunities},
  author={Grisoni, Francesca},
  journal={Current Opinion in Structural Biology},
  volume={79},
  pages={102527},
  year={2023},
  publisher={Elsevier},
  doi={10.1016/j.sbi.2023.102527}
}

Avoiding Failure Modes in Goal-Directed Generation

Thu, 26 Mar 2026 00:00:00 +0000

Reinterpreting Goal-Directed Generation Failures as QSAR Model Issues

This is an Empirical study that challenges a widely cited finding about failure modes in goal-directed molecular generation. Renz et al. (2019) had shown that when molecules are optimized against a machine learning scoring function, control models trained on the same data distribution assign much lower scores to the generated molecules. This was interpreted as evidence that generation algorithms exploit model-specific biases. Langevin et al. demonstrate that this divergence is already present in the original data distribution and is attributable to disagreement among the QSAR classifiers, not to flaws in the generation algorithms themselves.

Why QSAR Model Agreement Matters for Molecular Generation

Goal-directed generation uses a scoring function (typically a QSAR model) to guide the design of molecules that maximize predicted activity. In the experimental framework from Renz et al., three Random Forest classifiers are trained: an optimization model $C_{opt}$ on Split 1, a model control $C_{mc}$ on Split 1 with a different random seed, and a data control $C_{dc}$ on Split 2. Each returns a confidence score ($S_{opt}$, $S_{mc}$, $S_{dc}$). The expectation is that molecules with high $S_{opt}$ should also score highly under $S_{mc}$ and $S_{dc}$, since all three models are trained on the same data distribution for the same target.

Renz et al. observed that during optimization, $S_{mc}$ and $S_{dc}$ diverge from $S_{opt}$, reaching substantially lower values. This was interpreted as goal-directed generation exploiting biases unique to the optimization model. The recommendation was to halt generation when control scores stop increasing, requiring a held-out dataset for a control model, which may not be feasible in low-data regimes.

The key insight of Langevin et al. is that nobody had checked whether this score disagreement existed before generation even began. If the classifiers already disagree on high-scoring molecules in the original dataset, the divergence during generation is expected behavior, not evidence of algorithmic failure.

Pre-Existing Classifier Disagreement Explains the Divergence

The core contribution is showing that the gap between optimization and control scores is a property of the QSAR models, not of the generation algorithms.

The authors introduce a held-out test set (10% of the data, used for neither training split) and augment it via Topliss tree enumeration to produce structural analogs for smoother statistical estimates. On this held-out set, they compute the Mean Average Difference (MAD) between $S_{opt}$ and control scores as a function of $S_{opt}$:

$$ \text{MAD}(x) = \frac{1}{|\{i : S_{opt}(x_i) \geq x\}|} \sum_{S_{opt}(x_i) \geq x} |S_{opt}(x_i) - S_{dc}(x_i)| $$

On the three original datasets (DRD2, EGFR, JAK2), the MAD between $S_{opt}$ and $S_{dc}$ grows substantially with $S_{opt}$, reaching approximately 0.3 for the highest-scoring molecules. For EGFR, even the top molecules (with $S_{opt}$ between 0.5 and 0.6) have $S_{dc}$ below 0.2. This disagreement exists entirely within the original data distribution, before any generative algorithm is applied.

The authors formalize this with tolerance intervals. At each generation time step $t$, the distribution of optimization scores is $P_t[S_{opt}(x)]$. From the held-out set, the conditional distributions $P[S_{dc}(x) | S_{opt}(x)]$ and $P[S_{mc}(x) | S_{opt}(x)]$ are estimated empirically. The expected control scores at time $t$ are then:

$$ \mathbb{E}[S_{dc}] = \int P[S_{dc}(x) | S_{opt}(x)] \cdot P_t[S_{opt}(x)] , dS_{opt} $$

By sampling from these distributions, the authors construct 95% tolerance intervals for the expected control scores at each time step. The observed trajectories of $S_{mc}$ and $S_{dc}$ during generation fall within these intervals, demonstrating that the divergence is fully explained by pre-existing classifier disagreement.

Experimental Setup: Original Reproduction and Corrected Experiments

Reproduction of Renz et al.

The original experimental framework uses three datasets from ChEMBL: DRD2 (842 molecules, 59 actives), EGFR (842 molecules, 40 actives), and JAK2 (667 molecules, 140 actives). These are small, noisy, and chemically diverse. Three goal-directed generation algorithms are tested:

Algorithm	Type	Mechanism
Graph GA	Genetic algorithm on molecular graphs	Mutation and crossover of molecular graphs
SMILES-LSTM	Recurrent neural network	Hill-climbing fine-tuning on best molecules
MSO	Particle swarm in CDDD latent space	Multiple swarm optimization

All algorithms are run for 151 epochs with 10 runs each. The reproduction confirms the findings of Renz et al.: $S_{mc}$ and $S_{dc}$ diverge from $S_{opt}$ during optimization.

Tolerance interval analysis

The held-out set is augmented using Topliss tree enumeration on phenyl rings, providing structural analogs that are reasonable from a medicinal chemistry perspective. The optimization score range is divided into 25 equal bins, and for each molecule at each time step, 10 samples from the conditional control score distribution are drawn to construct empirical tolerance intervals.

Corrected experiments with adequate models

To test whether generation algorithms actually exploit biases when the classifiers agree, the authors construct two tasks where optimization and control models correlate well:

ALDH1 dataset: 464 molecules from LIT-PCBA, split using similarity-based pairing to maximize intra-pair chemical similarity. This ensures both splits sample similar chemistry.
Modified JAK2: The same JAK2 dataset but with Random Forest hyperparameters adjusted (200 trees instead of 100, minimum 3 samples per leaf instead of 1) to reduce overfitting to spurious correlations.

In both cases, $S_{opt}$, $S_{mc}$, and $S_{dc}$ agree well on the held-out test set. The starting population for generation is set to the held-out test set (rather than random ChEMBL molecules) to avoid building in a distribution shift.

Findings: No Algorithmic Failure When Models Agree

On the corrected experimental setups (ALDH1 and modified JAK2), there is no major divergence between optimization and control scores during generation. The three algorithms produce molecules that score similarly under all three classifiers.

Key findings:

Pre-existing disagreement explains divergence: On all three original datasets, the divergence between $S_{opt}$ and control scores during generation falls within the tolerance intervals predicted from the initial data distribution alone. The generation algorithms are not exploiting model-specific biases beyond what already exists in the data.
Split similarity bias is also pre-existing: Renz et al. observed that generated molecules are more similar to Split 1 (used to train $C_{opt}$) than Split 2. The authors show this bias is already present in the top-5 percentile of the held-out set: on EGFR and DRD2, high-scoring molecules are inherently more similar to Split 1.
Appropriate model design resolves the issue: When Random Forest hyperparameters are chosen to avoid overfitting (more trees, higher minimum samples per leaf), or when data splits are constructed to be chemically balanced, the classifiers agree and the generation algorithms behave as expected.
Quality problems remain independent: Even when optimization and control scores align, the generated molecules can still be poor drug candidates (unreactive, unsynthesizable, containing unusual fragments). The score divergence issue and the chemical quality issue are separate problems.

Limitations acknowledged by the authors

The study focuses on Random Forest classifiers with ECFP fingerprints. The behavior of other model types (e.g., graph neural networks) and descriptor types is not fully explored, though supplementary results show similar patterns with physico-chemical descriptors and Atom-Pair fingerprints.
The corrected ALDH1 task uses a relatively small dataset (464 molecules) with careful split construction. Scaling this approach to larger, more heterogeneous datasets is not demonstrated.
The authors note that their results do not prove generation algorithms never exploit biases; they show that the specific evidence from Renz et al. can be explained without invoking algorithmic failure.
The problem of low-quality generated molecules (poor synthesizability, unusual fragments) remains unresolved and is acknowledged as an open question.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Original tasks	DRD2, EGFR, JAK2	842, 842, 667 molecules	Extracted from ChEMBL; small with few actives
New task	ALDH1	464 molecules (173 with purine substructure)	Extracted from LIT-PCBA; similarity-based split
Augmentation	Topliss tree analogs	~10x augmentation of held-out set	Structural analogs via phenyl ring enumeration

Algorithms

Three goal-directed generation algorithms from the original Renz et al. study:

Graph GA: Genetic algorithm on molecular graphs (Jensen, 2019)
SMILES-LSTM: Hill-climbing on LSTM-generated SMILES (Segler et al., 2018)
MSO: Multi-Swarm Optimization in CDDD latent space (Winter et al., 2019)

All run for 151 epochs, 10 runs each.

Models

Random Forest classifiers (scikit-learn) with:

ECFP fingerprints (radius 2, 1024 bits, RDKit)
Default parameters for original tasks
Modified parameters for JAK2 correction: 200 trees, min 3 samples per leaf

Evaluation

Metric	Purpose	Notes
Mean Average Difference (MAD)	Measures disagreement between optimization and control scores	Computed as function of $S_{opt}$ on held-out set
95% tolerance intervals	Expected range of control scores given optimization scores	Empirical, constructed from held-out set
Tanimoto similarity	Split bias assessment	Morgan fingerprints, radius 2, 1024 bits
ROC-AUC	Classifier predictive performance	Used to verify models have comparable accuracy

Hardware

Not specified in the paper.

Artifacts

Artifact	Type	License	Notes
Code and datasets	Code	Apache-2.0	Fork of Renz et al. codebase with modifications

Paper Information

Citation: Langevin, M., Vuilleumier, R., & Bianciotto, M. (2022). Explaining and avoiding failure modes in goal-directed generation of small molecules. Journal of Cheminformatics, 14, 20. https://doi.org/10.1186/s13321-022-00601-y

@article{langevin2022explaining,
  title={Explaining and avoiding failure modes in goal-directed generation of small molecules},
  author={Langevin, Maxime and Vuilleumier, Rodolphe and Bianciotto, Marc},
  journal={Journal of Cheminformatics},
  volume={14},
  number={1},
  pages={20},
  year={2022},
  publisher={Springer},
  doi={10.1186/s13321-022-00601-y}
}

SPECTRA: Evaluating Generalizability of Molecular AI

Wed, 25 Mar 2026 00:00:00 +0000

A Spectral Framework for Evaluating Molecular ML Generalizability

This is a Method paper that introduces SPECTRA (SPECtral framework for model evaluaTion on moleculaR dAtasets), a systematic approach for evaluating how well machine learning models generalize on molecular sequencing data. The primary contribution is a framework that generates train-test splits with controlled, decreasing levels of overlap, producing a spectral performance curve (SPC) and a single summary metric, the area under the spectral performance curve (AUSPC), for comparing model generalizability across tasks and architectures.

Why Existing Molecular Benchmarks Overestimate Generalizability

Deep learning has achieved high performance on molecular sequencing benchmarks, but a persistent gap exists between benchmark performance and real-world deployment. The authors identify the root cause: existing evaluation approaches use either metadata-based (MB) splits or similarity-based (SB) splits, both of which provide an incomplete picture of generalizability.

MB splits partition data by metadata properties (e.g., temporal splits, random splits) without controlling sequence similarity between train and test sets. This means high train-test similarity can inflate performance metrics. SB splits control similarity at a single threshold, but the model’s behavior at other similarity levels remains unknown.

For example, the TAPE benchmark’s remote homology family split has 97% cross-split overlap, while the superfamily split has 71%. Model accuracy drops by 50% between these two points, yet the full curve of performance degradation is never characterized. This gap between evaluated and real-world overlap levels leads to overoptimistic deployment expectations, as demonstrated by the case of rifampicin resistance prediction in M. tuberculosis, where commercial genotypic assays later proved unreliable in specific geographic regions.

The SPECTRA Framework: Spectral Properties, Graphs, and Performance Curves

SPECTRA takes three inputs: a molecular sequencing dataset, a machine learning model, and a spectral property definition. A spectral property (SP) is a molecular sequence property expected to influence model generalizability for a specific task. For sequence-to-sequence datasets, the spectral property is typically sequence identity (proportion of aligned positions > 0.3). For mutational scan datasets, it is defined by sample barcodes (string representations of mutations present in each sample).

Spectral Property Graph Construction

SPECTRA constructs a spectral property graph (SPG) where nodes represent samples and edges connect samples that share the spectral property. The goal is to generate train-test splits with controlled levels of cross-split overlap by finding approximate maximal independent sets of this graph.

Finding the exact maximal independent set is NP-Hard, so SPECTRA uses a greedy randomized algorithm parameterized by a spectral parameter $\mathbf{SP} \in [0, 1]$:

Randomly order SPG vertices
Select the first vertex and delete each neighbor with probability equal to $\mathbf{SP}$
Continue until no vertices remain

When $\mathbf{SP} = 0$, this produces a random split (maximum cross-split overlap). When $\mathbf{SP} = 1$, it approximates the maximal independent set (minimum cross-split overlap). For each spectral parameter value (incremented by 0.05 from 0 to 1), three splits with different random seeds are generated.

The Spectral Performance Curve and AUSPC

The model is trained and evaluated on each split. Plotting test performance against the spectral parameter produces the spectral performance curve (SPC). The area under this curve, the AUSPC, serves as a single summary metric for model generalizability that captures behavior across the full spectrum of train-test overlap.

Handling Mutational Scan Datasets

For mutational scan datasets where sample barcodes map to multiple samples, SPECTRA introduces two modifications: (1) weighting nodes in the SPG by the number of samples they represent, and (2) running a subset sum algorithm to ensure 80/20 train-test splits by sample count.

Evaluation Across 18 Datasets and 19 Models

The authors apply SPECTRA to 18 molecular sequencing datasets spanning three benchmarks (TAPE, PEER, ProteinGym) plus PDBBind, evaluating 19 models including CNNs, LSTMs, GNNs (GearNet), LLMs (ESM2), diffusion models (DiffDock), variational autoencoders (EVE), and logistic regression.

Benchmark Datasets

The core evaluation covers five primary tasks:

Task	Dataset	Type	Metric	Samples
Rifampicin resistance (RIF)	TB clinical isolates	MSD	AUROC	17,474
Isoniazid resistance (INH)	TB clinical isolates	MSD	AUROC	26,574
Pyrazinamide resistance (PZA)	TB clinical isolates	MSD	AUROC	12,146
Fluorescence prediction	GFP variants	MSD	Spearman’s $\rho$	54,024
Vaccine escape	SARS-CoV-2 RBD	MSD	Spearman’s $\rho$	438,046

Additional benchmarks include remote homology detection, secondary structure prediction, subcellular localization, and protein-ligand binding (PDBBind, Astex diverse set, Posebusters).

Models Evaluated

Eight models were evaluated in depth across the five primary tasks: logistic regression, CNN, ESM2 (pretrained), ESM2-Finetuned, GearNet, GearNet-Finetuned, EVE, and SeqDesign. Additional models (LSTM, ResNet, DeepSF, Transformer, HHblits, Equibind, DiffDock, TankBind, Transception, MSA Transformer, ESM1v, Progen2) were evaluated on specific benchmark tasks.

Existing Splits as Points on the SPC

SPECTRA reveals that existing benchmark splits correspond to specific points on the spectral performance curve. For instance:

Task	Benchmark Split	Cross-Split Overlap	Spectral Parameter
Remote homology	TAPE family	97%	0.025
Remote homology	TAPE superfamily	71%	0.475
Secondary structure	CASP12	48%	0.5
Protein-ligand binding	Equibind temporal	76%	0.55
Protein-ligand binding	LPPDBind similarity	91%	0.275
Protein-ligand binding	Posebusters	70%	0.575

Performance Degradation and Foundation Model Insights

Universal Performance Decline

All evaluated models demonstrate decreased performance as cross-split overlap decreases. Logistic regression drops from AUROC > 0.9 to 0.5 for rifampicin resistance. ESM2-Finetuned decreases from Spearman’s $\rho > 0.9$ to less than 0.4 for GFP fluorescence prediction.

No single model achieves the highest AUSPC across all tasks. CNN maintains AUSPC > 0.6 across all tasks but is surpassed by ESM2-Finetuned and ESM2 on rifampicin resistance. Some models retain reasonable performance even at $\mathbf{SP} = 1$ (minimal overlap): ESM2, ESM2-Finetuned, and CNN maintain AUROC > 0.7 for RIF and PZA at this extreme.

Uncovering Hidden Spectral Properties

SPECTRA can detect unconsidered spectral properties through high variance in model performance at fixed spectral parameters. For rifampicin resistance, the CNN shows high variance at $\mathbf{SP} = 0.9$, $0.95$, and $1.0$ (standard deviations of 0.09, 0.10, and 0.08 respectively).

The authors trace this to the rifampicin resistance determining region (RRDR), a 26-amino-acid region of the rpoB gene. They define diff-RRDR as:

$$ \text{diff-RRDR} = \left(\max\left(\text{position}_{\text{train}}\right) - \max\left(\text{position}_{\text{test}}\right)\right) + \left(\min\left(\text{position}_{\text{train}}\right) - \min\left(\text{position}_{\text{test}}\right)\right) $$

diff-RRDR correlates with CNN performance variance (Spearman’s $\rho = -0.51$, p-value $= 1.79 \times 10^{-5}$) but not with ESM2 performance. The authors attribute this to ESM2’s larger context window (512 positions vs. CNN’s 12), making it more invariant to positional shifts in resistance-determining mutations.

Foundation Model Generalizability

For protein foundation models, SPECTRA reveals that AUSPC correlates with the similarity between task-specific datasets and the pretraining dataset. ESM2’s AUSPC varies from 0.91 (RIF) to 0.26 (SARS-CoV-2). The correlation between UniRef50 overlap and AUSPC is strong (Spearman’s $\rho = 0.9$, p-value $= 1.4 \times 10^{-27}$).

This finding holds across multiple foundation models (Transception, MSA Transformer, ESM1v, Progen2) evaluated on five ProteinGym datasets (Spearman’s $\rho = 0.9$, p-value $= 0.04$). Fine-tuning improves AUSPC for tasks with low pretraining overlap (PZA, SARS-CoV-2, GFP).

Computational Cost

Generating SPECTRA splits ranges from 5 minutes (amyloid beta aggregation) to 9 hours (PDBBind). Generating spectral performance curves ranges from 1 hour (logistic regression) to 5 days (ESM2-Finetuned). The authors recommend releasing SPECTRA splits alongside new benchmarks to amortize this cost.

Limitations and Future Directions

The authors acknowledge several limitations:

Spectral property selection is pivotal: The choice of spectral property must be biologically informed and task-specific. Standardized definitions across the community are needed.
Computational cost: Running SPECTRA is expensive, especially for large models. The authors mitigate this with multi-core CPU parallelization and multi-GPU training.
Not a model ranking tool: SPECTRA is designed for understanding generalizability patterns, not for ranking models. Proper ranking requires averaging AUSPCs across many tasks in a standardized benchmark.
Spectral parameter vs. cross-split overlap: The minimal achievable cross-split overlap varies across tasks, so SPECTRA plots performance against the spectral parameter rather than overlap directly. This means the AUSPC reflects relative impact on performance per unit decrease in overlap.

The authors envision SPECTRA as a foundation for next-generation molecular benchmarks that explicitly characterize generalizability across the full spectrum of distribution shift, applicable beyond molecular data to small molecule therapeutics, inverse protein folding, and patient-level clinical datasets.

Reproducibility Details

Data

All data used in this study is publicly available.

Purpose	Dataset	Size	Notes
Evaluation	TB RIF resistance	17,474 isolates	From Green et al. (2022)
Evaluation	TB INH resistance	26,574 isolates	From Green et al. (2022)
Evaluation	TB PZA resistance	12,146 isolates	From Green et al. (2022)
Evaluation	GFP fluorescence	54,024 samples	From Sarkisyan et al. (2016)
Evaluation	SARS-CoV-2 escape	438,046 samples	From Greaney et al. (2021)
Benchmark	TAPE (remote homology, secondary structure)	Various	From Rao et al. (2019)
Benchmark	PEER (subcellular localization)	13,949 samples	From Xu et al. (2022)
Benchmark	ProteinGym (amyloid, RRM)	Various	From Notin et al. (2022)
Benchmark	PDBBind (protein-ligand binding)	14,993-16,742 complexes	From Wang et al. (2005)

Data is also available on Harvard Dataverse.

Algorithms

Spectral property comparison uses Biopython pairwise alignment (match=1, mismatch=-2, gap=-2.5) with a 0.3 similarity threshold for sequence-to-sequence datasets
Greedy randomized maximal independent set approximation for split generation
Spectral parameter incremented in 0.05 steps from 0 to 1
Three random seeds per spectral parameter value
80/20 train-test split ratio enforced via subset sum for mutational scan datasets

Models

ESM2: 650M parameter version from Lin et al. (2023)
ESM2-Finetuned: First 30 layers frozen, masked language head replaced with linear prediction layer
GearNet and GearNet-Finetuned: Protein structures generated via ESMFold
CNN: Architecture from Green et al. (2022), one-hot encoded sequences
Logistic regression: One-hot encoded mutational barcodes
EVE and SeqDesign: MSAs constructed via Jackhmmer against UniRep100

Evaluation

Metric	Task	Notes
AUROC	TB resistance (RIF, INH, PZA)	Binary classification
Spearman’s $\rho$	GFP fluorescence, SARS-CoV-2 escape	Regression tasks
Accuracy	Remote homology, secondary structure, subcellular localization	Per-label/class accuracy
RMSE	Protein-ligand binding	Predicted vs. actual complex
AUSPC	All tasks	Area under spectral performance curve

Hardware

Most models: 1x Tesla A10 GPU
ESM2-Finetuned: 4x Tesla A100 GPUs on Azure cluster
Hyperparameter optimization: Weights & Biases random search over learning rate
All code in PyTorch

Artifacts

Artifact	Type	License	Notes
SPECTRA Code	Code	MIT	Framework implementation and reproduction scripts
Harvard Dataverse	Dataset	CC0 1.0	All datasets and generated splits

Paper Information

Citation: Ektefaie, Y., Shen, A., Bykova, D., Marin, M. G., Zitnik, M., & Farhat, M. (2024). Evaluating generalizability of artificial intelligence models for molecular datasets. Nature Machine Intelligence, 6(12), 1512-1524. https://doi.org/10.1038/s42256-024-00931-6

@article{ektefaie2024evaluating,
  title={Evaluating generalizability of artificial intelligence models for molecular datasets},
  author={Ektefaie, Yasha and Shen, Andrew and Bykova, Daria and Marin, Maximillian G. and Zitnik, Marinka and Farhat, Maha},
  journal={Nature Machine Intelligence},
  volume={6},
  number={12},
  pages={1512--1524},
  year={2024},
  publisher={Nature Publishing Group},
  doi={10.1038/s42256-024-00931-6}
}

PMO: Benchmarking Sample-Efficient Molecular Design

Wed, 25 Mar 2026 00:00:00 +0000

A Standardized Benchmark for Molecular Optimization

This is a Resource paper that introduces PMO (Practical Molecular Optimization), an open-source benchmark for evaluating molecular optimization algorithms with a focus on sample efficiency. The primary contribution is not a new algorithm but a comprehensive evaluation framework that exposes blind spots in how the field measures progress. By benchmarking 25 methods across 23 oracle functions under a fixed budget of 10,000 oracle calls, the authors provide a standardized protocol for transparent and reproducible comparison of molecular design methods.

The Missing Dimension: Oracle Budget in Molecular Design

Molecular optimization is central to drug and materials discovery, and the field has seen rapid growth in computational methods. Despite this progress, the authors identify three persistent problems with how methods are evaluated:

Lack of oracle budget control: Most papers do not report how many candidate molecules were evaluated by the oracle to achieve their results, despite this number spanning orders of magnitude. In practice, the most valuable oracles (wet-lab experiments, high-accuracy simulations) are expensive, making sample efficiency critical.
Trivial or self-designed oracles: Many papers only report on easy objectives like QED or penalized LogP, or introduce custom tasks that make cross-method comparison impossible.
Insufficient handling of randomness: Many algorithms are stochastic, yet existing benchmarks examined no more than five methods and rarely reported variance across independent runs.

Prior benchmarks such as GuacaMol, Therapeutics Data Commons (TDC), and Tripp et al.’s analysis each suffer from at least one of these issues. PMO addresses all three simultaneously.

The PMO Benchmark Design

The core innovation of PMO is its evaluation protocol rather than any single algorithmic contribution. The benchmark enforces three design principles:

Oracle budget constraint: All methods are limited to 10,000 oracle calls. This is deliberately much smaller than the unconstrained budgets typical in the literature, reflecting the practical reality that experimental evaluations are costly.

AUC-based metric: Instead of reporting only the final top-K score, PMO uses the area under the curve (AUC) of top-K average property value versus oracle calls:

$$ \text{AUC Top-}K = \int_{0}^{N} \bar{f}_{K}(n) , dn $$

where $\bar{f}_{K}(n)$ is the average property value of the top $K$ molecules found after $n$ oracle calls, and $N = 10{,}000$. The paper uses $K = 10$. This metric rewards methods that reach high property values quickly, not just those that eventually converge given enough budget. All AUC values are min-max scaled to [0, 1].

Standardized data: All methods use only the ZINC 250K dataset (approximately 250,000 molecules) whenever a database is required, ensuring a level playing field.

The benchmark includes 23 oracle functions: QED, DRD2, GSK3-beta, JNK3, and 19 oracles from GuacaMol covering multi-property objectives (MPOs) based on similarity, molecular weight, CLogP, and other pharmaceutically relevant criteria. All oracle scores are normalized to [0, 1].

25 Methods Across Nine Algorithm Families

The benchmark evaluates 25 molecular optimization methods organized along two dimensions: molecular assembly strategy (SMILES, SELFIES, atom-level graphs, fragment-level graphs, synthesis-based) and optimization algorithm (GA, MCTS, BO, VAE, GAN, score-based modeling, hill climbing, RL, gradient ascent). Each method was hyperparameter-tuned on two held-out tasks (zaleplon_mpo and perindopril_mpo) and then evaluated across all 23 oracles for 5 independent runs.

The following table summarizes the top 10 methods by sum of mean AUC Top-10 across all 23 tasks:

Rank	Method	Assembly	Sum AUC Top-10
1	REINVENT	SMILES	14.196
2	Graph GA	Fragments	13.751
3	SELFIES-REINVENT	SELFIES	13.471
4	GP BO	Fragments	13.156
5	STONED	SELFIES	13.024
6	LSTM HC	SMILES	12.223
7	SMILES GA	SMILES	12.054
8	SynNet	Synthesis	11.498
9	DoG-Gen	Synthesis	11.456
10	DST	Fragments	10.989

The bottom five methods by overall ranking were GFlowNet-AL, Pasithea, JT-VAE, Graph MCTS, and MolDQN.

REINVENT is ranked first across all six metrics considered (AUC Top-1, AUC Top-10, AUC Top-100, Top-1, Top-10, Top-100). Graph GA is consistently second. Both methods were released several years before many of the methods they outperform, yet they are rarely used as baselines in newer work.

Key Findings: Older Methods Win and SELFIES Offers Limited Advantage

The benchmark yields several findings with practical implications:

No method solves optimization within realistic budgets. None of the 25 methods can optimize the included objectives within hundreds of oracle calls (the scale at which experimental evaluations would be feasible), except for trivially easy oracles like QED, DRD2, and osimertinib_mpo.

Older algorithms remain competitive. REINVENT (2017) and Graph GA (2019) outperform all newer methods tested, including those published at top AI conferences. The absence of standardized benchmarking had obscured this fact.

SMILES versus SELFIES. SELFIES was designed to guarantee syntactically valid molecular strings, but head-to-head comparisons show that SELFIES-based variants of language model methods (REINVENT, LSTM HC, VAE) generally do not outperform their SMILES counterparts. Modern language models learn SMILES grammar well enough that syntactic invalidity is no longer a practical issue. The one exception is genetic algorithms, where SELFIES-based GAs (STONED) outperform SMILES-based GAs, likely because SELFIES provides more intuitive mutation operations.

Model-based methods need careful design. Model-based variants (GP BO relative to Graph GA, GFlowNet-AL relative to GFlowNet) do not consistently outperform their model-free counterparts. GP BO outperformed Graph GA in 12 of 23 tasks but underperformed on sum, and GFlowNet-AL underperformed GFlowNet in nearly every task. The bottleneck is the quality of the predictive surrogate model, and naive surrogate integration can actually hurt performance.

Oracle landscape determines method suitability. Clustering analysis of relative AUC Top-10 scores reveals clear patterns. String-based GAs excel on isomer-type oracles (which are sums of atomic contributions), while RL-based and fragment-based methods perform better on similarity-based MPOs. This suggests there is no single best algorithm, and method selection should be informed by the optimization landscape.

Hyperparameter tuning and multiple runs are essential. Optimal hyperparameters differed substantially from default values in original papers. For example, REINVENT’s performance is highly sensitive to its sigma parameter, and the best value under the constrained-budget setting is much larger than originally suggested. Methods like Graph GA and GP BO also show high variance across runs, underscoring the importance of reporting distributional outcomes rather than single-run results.

Limitations

The authors acknowledge several limitations: they cannot exhaustively tune every hyperparameter or include every variant of each method; the conclusion may be biased toward similarity-based oracles (which dominate the 23 tasks); important quantities like synthesizability and diversity are not thoroughly evaluated; and oracle calls from pre-training data in model-based methods are counted against the budget, which may disadvantage methods that could leverage prior data collection. For a follow-up study that adds property filters and diversity requirements to the PMO evaluation, see Re-evaluating Sample Efficiency.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Molecule library	ZINC 250K	~250,000 molecules	Used for screening, pre-training generative models, and fragment extraction
Oracle functions	TDC / GuacaMol	23 tasks	All scores normalized to [0, 1]

Algorithms

25 molecular optimization methods spanning 9 algorithm families and 5 molecular assembly strategies. Each method was hyperparameter-tuned on 2 held-out tasks (zaleplon_mpo, perindopril_mpo) using 3 independent runs, then evaluated on all 23 tasks with 5 independent runs each.

Evaluation

Metric	Description	Notes
AUC Top-K	Area under curve of top-K average vs. oracle calls	Primary metric; K=10; min-max scaled to [0, 1]
Top-K	Final top-K average property value at 10K calls	Secondary metric
Sum rank	Sum of AUC Top-10 across all 23 tasks	Used for overall ranking

Hardware

The paper states hardware details are in Appendix C.2. The benchmark runs on standard compute infrastructure and does not require GPUs for most methods. Specific compute requirements vary by method.

Artifacts

Artifact	Type	License	Notes
mol_opt	Code	MIT	Full benchmark implementation with all 25 methods
Benchmark results	Dataset	Unknown	All experimental results from the paper
TDC	Dataset	MIT	Oracle functions and evaluation infrastructure

Citation

@inproceedings{gao2022sample,
  title={Sample Efficiency Matters: A Benchmark for Practical Molecular Optimization},
  author={Gao, Wenhao and Fu, Tianfan and Sun, Jimeng and Coley, Connor W.},
  booktitle={Advances in Neural Information Processing Systems},
  volume={35},
  pages={21342--21357},
  year={2022}
}

Paper Information

Citation: Gao, W., Fu, T., Sun, J., & Coley, C. W. (2022). Sample Efficiency Matters: A Benchmark for Practical Molecular Optimization. Advances in Neural Information Processing Systems, 35, 21342-21357. https://arxiv.org/abs/2206.12411

Publication: NeurIPS 2022

Additional Resources:

MolScore: Scoring and Benchmarking for Drug Design

Wed, 25 Mar 2026 00:00:00 +0000

A Unified Resource for Generative Molecular Design

MolScore is a Resource paper that introduces an open-source Python framework for scoring, evaluating, and benchmarking generative models in de novo drug design. The primary contribution is the software itself: a modular, configurable platform that consolidates functionality previously scattered across multiple tools (GuacaMol, MOSES, MolOpt, REINVENT, TDC) into a single package. MolScore provides scoring functions for molecular optimization, evaluation metrics for assessing the quality of generated molecules, and a benchmark mode for standardized comparison of generative models.

The Fragmented Landscape of Generative Model Evaluation

Generative models for molecular design have proliferated rapidly, but evaluating and comparing them remains difficult. Existing benchmarks each address only part of the problem:

GuacaMol provides 20 fixed optimization objectives but cannot separate top-performing models on most tasks, and custom objectives require code modification.
MOSES focuses on distribution-learning metrics but does not support molecular optimization.
MolOpt extends benchmark evaluation to 25 generative approaches but lacks evaluation of the quality of generated chemistry.
Docking benchmarks (smina-docking-benchmark, DOCKSTRING, TDC) test structure-based scoring but often lack proper ligand preparation, leading generative models to exploit non-holistic objectives by generating large or greasy molecules.
REINVENT provides configurable scoring functions but is tightly coupled to its own generative model architecture.

No single tool offered configurable objectives, comprehensive evaluation metrics, generative-model-agnostic design, and graphical user interfaces together. This fragmentation forces practitioners to write custom glue code and makes reproducible comparison across methods difficult.

Modular Architecture for Scoring, Evaluation, and Benchmarking

MolScore is split into two sub-packages:

molscore: Molecule Scoring

The molscore sub-package handles iterative scoring of SMILES generated by any generative model. The workflow for each iteration:

Parse and validate SMILES via RDKit, canonicalize, and check intra-batch uniqueness.
Cross-reference against previously generated molecules to reuse cached scores (saving compute for expensive scoring functions like docking).
Run user-specified scoring functions on valid, unique molecules (invalid molecules receive a score of 0).
Transform each score to a 0-1 range using configurable transformation functions (normalize, linear threshold, Gaussian threshold, step threshold).
Aggregate transformed scores into a single desirability score using configurable aggregation (weighted sum, product, geometric mean, arithmetic mean, Pareto front, or auto-weighted variants).
Optionally apply diversity filters to penalize non-diverse molecules, or use any scoring function as a multiplicative filter.

The full objective is specified in a single JSON configuration file, with a Streamlit GUI provided for interactive configuration writing. The available scoring functions span:

Category	Examples
Descriptors	RDKit descriptors, linker descriptors, penalized logP
Similarity	Fingerprint similarity, ROCS, Open3DAlign, substructure matching
Predictive models	Scikit-learn models, PIDGINv5 (2,337 ChEMBL31 targets), ChemProp, ADMET-AI
Docking	Glide, PLANTS, GOLD, OEDock, Smina, Gnina, Vina, rDock
Synthesizability	SA score, RA Score, AiZynthFinder, reaction filters

Most scoring functions support multiprocessing, and computationally expensive functions (docking, ligand preparation) can be distributed across compute clusters via Dask.

moleval: Molecule Evaluation

The moleval sub-package computes performance metrics on generated molecules relative to reference datasets. It extends the MOSES metric suite with additional intrinsic metrics (sphere exclusion diversity, scaffold uniqueness, functional group and ring system diversity, ZINC20 purchasability via molbloom) and extrinsic metrics (analogue similarity/coverage, functional group and ring system similarity, outlier bits or “Silliness”).

Benchmark Mode

A MolScoreBenchmark class iterates over a list of JSON configuration files, providing standardized comparison. Pre-built presets reimplement GuacaMol and MolOpt benchmarks, and users can define custom benchmark suites without writing code.

Case Studies: 5-HT2A Ligand Design and Fine-Tuning Evaluation

The authors demonstrate MolScore with a SMILES-based RNN generative model using Augmented Hill-Climb for optimization, designing serotonin 5-HT2A receptor ligands across three objective sets of increasing complexity.

First Objective Set: Basic Drug Properties

Four objectives combine predicted 5-HT2A activity (via PIDGINv5 random forest models at 1 uM) with synthesizability (RAscore) and/or BBB permeability property ranges (TPSA < 70, HBD < 2, logP 2-4, MW < 400). All objectives were optimized successfully, with diversity filters preventing mode collapse. The most difficult single objective (5-HT2A activity alone) was hardest primarily because the diversity filter more heavily penalized similar molecules for this relatively easy task.

Second Objective Set: Selectivity

Six objectives incorporate selectivity proxies using PIDGINv5 models for off-target prediction against Class A GPCR membrane receptors (266 models), the D2 dopamine receptor, dopamine receptor family, serotonin receptor subtypes, and combinations. These proved substantially harder: selectivity against dopamine and serotonin receptor families combined was barely improved during optimization. Even with imperfect predictive models, the PIDGINv5 ensemble correctly identified 95 of 126 known selective 5-HT2A ligands. Nearest-neighbor analysis of de novo molecules (Tanimoto similarity 0.3-0.6) showed they tended to be structurally simpler versions of known selective ligands.

Third Objective Set: Structure-Based Docking

Two objectives use molecular docking via GlideSP into 5-HT2A (PDB: 6A93) and D2 (PDB: 6CM4) crystal structures with full ligand preparation (LigPrep for stereoisomer/tautomer/protonation state enumeration). Multi-parameter optimization includes docking score, D155 polar interaction constraint, formal charge, and consecutive rotatable bond limits. Single-target docking scores reached the mean of known ligands within 200 steps, but optimizing for divergent 5-HT2A vs D2 docking scores was much harder due to binding pocket similarity. Protein-ligand interaction fingerprint analysis (ProLIF) revealed that molecules optimized for selectivity avoided specific binding pocket regions shared between the two receptors.

Evaluation Case Study: Fine-Tuning Epochs

The moleval sub-package was used to track metrics across fine-tuning epochs of a SMILES RNN on A2A receptor ligands, showing that just one or two epochs sufficed to increase similarity to the fine-tuning set, while further epochs reduced novelty and diversity.

Configurable Benchmarking with Practical Drug Design Relevance

MolScore provides a more comprehensive platform than any single existing tool. Compared to prior work:

Feature	GuacaMol	MOSES	MolOpt	TDC	REINVENT	MolScore
Configurable objectives	No	N/A	No	No	Yes	Yes
Optimization objectives	Yes	No	Yes	Yes	Yes	Yes
Evaluation metrics	Yes	Yes	No	No	No	Yes
Model-agnostic	Yes	Yes	Yes	Yes	No	Yes
GUI	No	No	No	No	Yes	Yes

The framework integrates into any Python-based generative model in three lines of code. Dependency conflicts between scoring function libraries are handled by running conflicting components as local servers from isolated conda environments.

Key limitations acknowledged by the authors include: the assumption of conda for environment management, the inherent difficulty of designing non-exploitable objectives, and the fact that ligand-based predictive models may have limited applicability domains for out-of-distribution de novo molecules.

Future directions include accepting 3D molecular conformations as inputs, structure interaction fingerprint rescoring, and dynamic configuration files for curriculum learning.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pre-training	ChEMBL compounds	Not specified	Standard ChEMBL training set for SMILES RNN
Evaluation reference	5-HT2A ligands from ChEMBL31	3,771 compounds	Extracted for score distribution comparison
Activity models	PIDGINv5 on ChEMBL31	2,337 target models	Random forest classifiers at various concentration thresholds
Fine-tuning	A2A receptor ligands	Not specified	Used for moleval case study

Algorithms

The generative model used in case studies is a SMILES-based RNN with Augmented Hill-Climb reinforcement learning. Diversity filters penalize non-diverse molecules during optimization. Score transformation functions (normalize, linear threshold, Gaussian threshold, step threshold) map raw scores to 0-1 range. Aggregation functions (arithmetic mean, weighted sum, product, geometric mean, Pareto front) combine multi-parameter objectives.

Models

PIDGINv5 provides 2,337 pre-trained random forest classifiers on ChEMBL31 targets. RAscore provides pre-trained synthesizability prediction. ADMET-AI and ChemProp models are supported via isolated environments. Docking uses GlideSP with LigPrep for ligand preparation in the structure-based case study.

Evaluation

Intrinsic metrics: validity, uniqueness, scaffold uniqueness, internal diversity, sphere exclusion diversity, Solow-Polasky diversity, scaffold diversity, functional group diversity, ring system diversity, MCF and PAINS filters, ZINC20 purchasability.

Extrinsic metrics: novelty, FCD, analogue similarity/coverage, functional group similarity, ring system similarity, SNN similarity, fragment similarity, scaffold similarity, outlier bits, Wasserstein distance on LogP/SA/NP/QED/MW.

Hardware

Not specified in the paper. Docking-based objectives can be distributed across compute clusters via Dask.

Artifacts

Artifact	Type	License	Notes
MolScore	Code	MIT	Main framework, installable via pip
MolScore Examples	Code	MIT	Integration examples with SMILES-RNN, CReM, GraphGA

Paper Information

Citation: Thomas, M., O’Boyle, N. M., Bender, A., & de Graaf, C. (2024). MolScore: a scoring, evaluation and benchmarking framework for generative models in de novo drug design. Journal of Cheminformatics, 16(1), 64. https://doi.org/10.1186/s13321-024-00861-w

@article{thomas2024molscore,
  title={MolScore: a scoring, evaluation and benchmarking framework for generative models in de novo drug design},
  author={Thomas, Morgan and O'Boyle, Noel M. and Bender, Andreas and de Graaf, Chris},
  journal={Journal of Cheminformatics},
  volume={16},
  number={1},
  pages={64},
  year={2024},
  publisher={BioMed Central},
  doi={10.1186/s13321-024-00861-w}
}

MolGenBench: Benchmarking Molecular Generative Models

Wed, 25 Mar 2026 00:00:00 +0000

A Comprehensive Benchmark for Structure-Based Molecular Generation

MolGenBench is a Resource paper that provides a large-scale, application-oriented benchmark for evaluating molecular generative models in the context of structure-based drug design (SBDD). The primary contribution is a dataset of 220,005 experimentally validated active molecules across 120 protein targets, organized into 5,433 chemical series, along with a suite of novel evaluation metrics. The benchmark addresses both de novo molecular design and hit-to-lead (H2L) optimization, a critical drug discovery stage that existing benchmarks largely ignore.

Gaps in Existing Molecular Generation Benchmarks

Despite rapid progress in deep generative models for drug discovery, the evaluation landscape has not kept pace. The authors identify four categories of limitations in existing benchmarks:

Dataset construction: Existing benchmarks use overly stringent activity cutoffs and too few protein targets. The widely used CrossDocked2020 dataset contains very few reference ligands per target, making it difficult to evaluate whether a model can rediscover the full distribution of active compounds.
Model selection: Prior benchmark studies evaluate a narrow range of architectures and do not systematically examine the effects of training data composition, prior knowledge integration, or architectural paradigm.
Evaluation scenarios: Existing benchmarks focus exclusively on de novo generation. Hit-to-lead optimization, where a hit compound is refined through R-group modifications, remains unstandardized.
Evaluation metrics: Standard metrics (QED, Vina score, SA score) correlate strongly with atom count and fail to assess target-specific generation capacity. The AddCarbon model illustrates this: simply adding random carbon atoms to training molecules achieves near-perfect scores on standard metrics while producing nonsensical chemistry.

Novel Metrics for Evaluating Molecular Generation

MolGenBench introduces three key metrics designed to capture aspects of model performance that existing metrics miss.

Target-Aware Score (TAScore)

The TAScore measures whether a model generates target-specific molecules rather than generic structures. It compares the ratio of active molecule or scaffold recovery on a specific target to the background recovery across all targets:

$$ \text{TAScore}_{\text{label}, i} = \frac{S_{i} / S_{\text{all}}}{R_{i} / R_{\text{all}}}; \quad \text{label} \in \{\text{SMILES}, \text{scaffold}\} $$

For target $i$: $R_{\text{all}}$ is the total number of distinct molecules generated across all 120 targets; $R_{i}$ is the subset matching known actives for target $i$ (without conditioning on target $i$); $S_{\text{all}}$ is the total generated when conditioned on target $i$; and $S_{i}$ is the subset matching known actives for target $i$. A TAScore above 1 indicates the model uses target-specific information effectively.

Hit Rate

The hit rate quantifies the efficiency of active compound discovery:

$$ \text{HitRate}_{\text{label}} = \frac{\mathcal{M}_{\text{active}}}{\mathcal{M}_{\text{sampled}}}; \quad \text{label} \in \{\text{SMILES}, \text{scaffold}\} $$

where $\mathcal{M}_{\text{active}}$ is the number of unique active molecules or scaffolds found, and $\mathcal{M}_{\text{sampled}}$ is the total number of generated molecules.

Mean Normalized Affinity (MNA) Score

For H2L optimization, the MNA Score measures whether models generate compounds with improved potency relative to the known activity range within each chemical series:

$$ \text{NA}_{g} = \frac{\text{Affinity}_{g}^{\text{series}} - \text{Affinity}_{\min}^{\text{series}}}{\text{Affinity}_{\max}^{\text{series}} - \text{Affinity}_{\min}^{\text{series}}} $$

$$ \text{MNAScore} = \frac{1}{G} \sum_{g}^{G} \text{NA}_{g} $$

This normalizes affinities to [0, 1] within each series, enabling cross-series comparison.

Systematic Evaluation of 17 Generative Models Across Two Drug Discovery Scenarios

Dataset Construction

The MolGenBench dataset was built from ChEMBL v33. Ligands failing RDKit validation were discarded, along with entries where binding affinity exceeded 10 uM. The 120 protein targets were selected based on minimum thresholds: at least 50 active molecules, at least 50 unique Bemis-Murcko scaffolds, and at least 20 distinct chemical series per target. For H2L optimization, maximum common substructures (MCS) were identified per series, with dual thresholds requiring the MCS to appear in over 80% of molecules and cover more than one-third of each molecule’s atoms. The top 5 series per target (ranked by dockable ligands) formed the H2L test set: 600 compound series across 120 targets.

Evaluated Models

De novo models (10): Pocket2Mol, TargetDiff, FLAG, DecompDiff, SurfGen, PocketFlow, MolCraft, TamGen, DiffSBDD-M (trained on BindingMOAD), DiffSBDD-C (trained on CrossDock). These span autoregressive, diffusion, and Bayesian flow network architectures.

H2L models (7): Fragment-based (DiffSBDD-M/C inpainting, Delete, DiffDec) and ligand-based (ShEPhERD, ShapeMol, PGMG). These use pharmacophore, surface, or shape priors.

Models were further stratified by whether test proteins appeared in their CrossDock training set (“Proteins in CrossDock” vs. “Proteins Not in CrossDock”), enabling direct measurement of generalization.

Evaluation Dimensions

The benchmark evaluates six dimensions:

Dimension	Key Metrics
Basic molecular properties	Validity, QED, SA score, uniqueness, diversity, JSD alignment
Chemical safety	Industry-standard filter pass rates (Eli Lilly, Novartis, ChEMBL rules)
Conformational quality	PoseBusters pass rate, strain energy, steric clash frequency
Active compound recovery	Hit rate, hit fraction, active molecule and scaffold recovery counts
Target awareness	TAScore at molecule and scaffold levels
Lead optimization	MNA Score, number of series with hits

Key Results: Basic Properties and Chemical Safety

Most models generate drug-like molecules with reasonable QED (0.4-0.6) and SA scores (0.5-0.8). However, two models (FLAG, SurfGen) showed validity below 0.4. TamGen exhibited low uniqueness (~27%), suggesting overreliance on pretrained patterns.

Chemical filter pass rates revealed a more concerning picture: only TamGen and PGMG exceeded 50% of molecules passing all industry-standard filters. Most models fell below 40%, and some (FLAG, SurfGen) below 5%. Nearly 70% of reference active molecules passed the same filters, indicating models frequently generate high-risk compounds.

Key Results: Conformational Quality

MolCraft achieved the highest PoseBusters validity (0.783 PB-valid score among valid molecules). PocketFlow, despite perfect SMILES validity, had fewer than half of its valid molecules pass conformational checks. Most models produced conformations with higher strain energy than those from AutoDock Vina. Some models (MolCraft for de novo, DiffDec for H2L) surpassed Vina in minimizing steric clashes, suggesting advanced architectures can exceed the patterns in their training data.

Key Results: Active Compound Recovery and Hit Rates

De novo models exhibited very low hit rates. The highest molecular hit rate among de novo models was 0.124% on proteins in CrossDock, dropping to 0.024% on unseen proteins. Scaffold-level hit rates were 10-fold higher, showing that generating pharmacologically plausible scaffolds is considerably easier than generating fully active molecules.

After removing molecules overlapping with the CrossDock training set, TamGen’s recovery dropped substantially (from 30.3 to 18.7 targets), confirming significant memorization effects. On proteins not in CrossDock, half of the de novo models failed to recover any active molecules at all.

Fragment-based H2L models substantially outperformed both de novo models and ligand-based H2L approaches. Delete recovered active molecules in 44.3 series (out of 600), and DiffDec in 34.7 series.

Key Results: Target Awareness

Most de novo models failed the TAScore evaluation. PocketFlow showed the strongest target awareness at the scaffold level, with only 27% of targets showing TAScore < 1 (indicating no target specificity). At the molecular level, results were even weaker: TamGen achieved TAScore > 1 for only 30.6% of CrossDock-seen targets and just 4 out of 35 unseen targets. Most models generated structurally similar molecules regardless of which target they were conditioned on.

Key Results: H2L Optimization (MNA Score)

DiffDec achieved the highest total active hits (121.7) and the best MNA Score (0.523), followed by Delete (104.7 hits, MNA Score 0.482). Ligand-based models (ShEPhERD, PGMG) recovered fewer hits but showed higher MNA Scores per hit, suggesting pharmacophore-based priors help prioritize more potent molecules when actives are found. The most successful model (Delete) achieved a hit in only 9.6% of series (57/600), indicating substantial room for improvement.

Critical Findings and Limitations of Current Molecular Generative Models

The benchmark reveals several consistent limitations:

Low screening efficiency: De novo models achieve molecular hit rates below 0.13%, far from practical utility. Scaffold recovery is more feasible but still limited.
Weak target awareness: Most SBDD models fail to use protein structural information effectively, generating similar molecules across different targets. This raises concerns about off-target effects.
Conformational prediction remains difficult: Most models produce conformations with higher strain energy than classical docking, and only a small fraction (typically below 23%) of generated poses match redocked conformations within 2 Angstrom RMSD.
Generalization gap: Performance consistently drops on proteins not in the training set, and prior benchmarks that do not stratify by training data exposure overestimate real-world utility.
Inference-time scaling does not solve the problem: Sampling up to 100,000 molecules increased the absolute number of active discoveries but with diminishing efficiency. Without better scoring functions, scaling sampling offers limited practical value.
Chemical safety: Most models produce a majority of molecules that fail industry-standard reactivity and promiscuity filters.

The authors acknowledge that the benchmark’s 220,005 active molecules represent a biased subset of bioactive chemical space. A model’s failure to rediscover known actives for a given target may reflect sampling limitations rather than generating inactive compounds.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Active compounds	ChEMBL v33	220,005 molecules, 120 targets	Filtered at 10 uM affinity threshold
H2L series	ChEMBL v33 + PDB	5,433 series (600 used for H2L test)	MCS-based series construction
Protein structures	PDB	120 targets	One PDB entry per target
Training (most models)	CrossDocked2020	Varies	Standard SBDD training set

Algorithms

De novo models sampled 1,000 molecules per target; H2L models sampled 200 per series
All experiments repeated three times with different random seeds
Docking performed with AutoDock Vina using standard parameters
Chemical filters applied via the medchem library
Conformational quality assessed with PoseBusters and PoseCheck
Interaction scores computed via ProLIF with frequency-weighted normalization

Models

All 17 models were obtained from their official GitHub repositories and run with default configurations. The benchmark does not introduce new model architectures.

Evaluation

Summary of key metrics across the best-performing models in each category:

Metric	Best De Novo	Value	Best H2L	Value
PB-valid score	MolCraft	0.783	DiffSBDD-M	0.597
Molecular hit rate (in CrossDock)	TamGen	0.124%	DiffDec	Higher than de novo
Scaffold hit rate (in CrossDock)	PocketFlow	>10%	Delete	Lower than PocketFlow
TAScore scaffold (% targets >1)	PocketFlow	73%	N/A	N/A
MNA Score	N/A	N/A	DiffDec	0.523
Filter pass rate	TamGen	>50%	PGMG	>50%

Hardware

Specific hardware requirements are not detailed in the paper. Models were run using their default configurations from official repositories.

Artifacts

Artifact	Type	License	Notes
MolGenBench	Code	MIT	Benchmark evaluation framework
Zenodo dataset	Dataset	CC-BY-NC-ND 4.0	Processed data and source data for all results

Paper Information

Citation: Cao, D., Fan, Z., Yu, J., Chen, M., Jiang, X., Sheng, X., Wang, X., Zeng, C., Luo, X., Teng, D., & Zheng, M. (2025). Benchmarking Real-World Applicability of Molecular Generative Models from De novo Design to Lead Optimization with MolGenBench. bioRxiv. https://doi.org/10.1101/2025.11.03.686215

@article{cao2025molgenbench,
  title={Benchmarking Real-World Applicability of Molecular Generative Models from De novo Design to Lead Optimization with MolGenBench},
  author={Cao, Duanhua and Fan, Zhehuan and Yu, Jie and Chen, Mingan and Jiang, Xinyu and Sheng, Xia and Wang, Xingyou and Zeng, Chuanlong and Luo, Xiaomin and Teng, Dan and Zheng, Mingyue},
  journal={bioRxiv},
  year={2025},
  doi={10.1101/2025.11.03.686215}
}

GuacaMol: Benchmarking Models for De Novo Molecular Design

Wed, 25 Mar 2026 00:00:00 +0000

A Standardized Benchmark for Molecular Design

GuacaMol is a Resource paper. Its primary contribution is a standardized, open-source benchmarking framework for evaluating models for de novo molecular design. The framework defines 5 distribution-learning benchmarks and 20 goal-directed optimization benchmarks, implemented as a Python package. The authors also provide baseline results for several classical and neural generative models, establishing reference performance levels for future comparisons.

The Need for Consistent Evaluation in Generative Chemistry

By 2018, deep generative models for molecular design (VAEs, RNNs, GANs) had shown promising results, but the field lacked consistent evaluation standards. Different papers used different tasks, different datasets, and different metrics, making it difficult to compare models or assess real progress. Comparative studies between neural approaches and well-established algorithms like genetic algorithms were rare.

In other areas of machine learning, standardized benchmarks (ImageNet for vision, GLUE for NLP) had driven rapid progress by enabling fair comparisons. The de novo design community lacked an equivalent. Additionally, many existing evaluations focused on easily optimizable properties (logP, QED) that could not differentiate between models, since even simple baselines achieved near-perfect scores on those tasks.

Benchmark Design: Distribution Learning and Goal-Directed Optimization

GuacaMol separates evaluation into two independent dimensions, reflecting the two main use cases of generative models.

Distribution-Learning Benchmarks

These five benchmarks assess how well a model learns to generate molecules similar to a training set (a standardized subset of ChEMBL 24):

Validity: Fraction of generated molecules that are chemically valid (parseable by RDKit), measured over 10,000 generated samples.
Uniqueness: Fraction of unique canonical SMILES among 10,000 valid generated molecules.
Novelty: Fraction of generated molecules not present in the training set, measured over 10,000 unique samples.
Fréchet ChemNet Distance (FCD): Measures distributional similarity between generated and reference molecules using hidden representations from ChemNet (trained on biological activity prediction). The FCD score is transformed as:

$$S = \exp(-0.2 \cdot \text{FCD})$$

KL Divergence: Compares distributions of nine physicochemical descriptors (BertzCT, MolLogP, MolWt, TPSA, NumHAcceptors, NumHDonors, NumRotatableBonds, NumAliphaticRings, NumAromaticRings) plus maximum nearest-neighbor ECFP4 similarity. The final score aggregates per-descriptor KL divergences:

$$S = \frac{1}{k} \sum_{i}^{k} \exp(-D_{\text{KL}, i})$$

where $k = 9$ is the number of descriptors.

Goal-Directed Benchmarks

The 20 goal-directed benchmarks evaluate a model’s ability to generate molecules that maximize a given scoring function. These span several categories:

Rediscovery (3 tasks): Regenerate a specific target molecule (Celecoxib, Troglitazone, Thiothixene) using Tanimoto similarity on ECFP4 fingerprints.
Similarity (3 tasks): Generate many molecules similar to a target (Aripiprazole, Albuterol, Mestranol) above a threshold of 0.75.
Isomers (2 tasks): Generate molecules matching a target molecular formula ($\text{C}_{11}\text{H}_{24}$ and $\text{C}_9\text{H}_{10}\text{N}_2\text{O}_2\text{PF}_2\text{Cl}$).
Median molecules (2 tasks): Maximize similarity to two reference molecules simultaneously (camphor/menthol and tadalafil/sildenafil).
Multi-property optimization (7 tasks): Optimize combinations of similarity, physicochemical properties, and structural features for drug-relevant molecules (Osimertinib, Fexofenadine, Ranolazine, Perindopril, Amlodipine, Sitagliptin, Zaleplon).
SMARTS-based (1 task): Target molecules containing specific substructure patterns with constrained physicochemical properties (Valsartan SMARTS).
Scaffold/decorator hop (2 tasks): Modify molecular scaffolds while preserving substituent patterns, or vice versa.

The benchmark score for most goal-directed tasks combines top-1, top-10, and top-100 molecule scores:

$$S = \frac{1}{3}\left(s_1 + \frac{1}{10}\sum_{i=1}^{10} s_i + \frac{1}{100}\sum_{i=1}^{100} s_i\right)$$

where $s_i$ are molecule scores sorted in decreasing order.

Score Modifiers

Raw molecular properties are transformed via modifier functions to restrict scores to [0, 1]:

Gaussian($\mu$, $\sigma$): Targets a specific property value
MinGaussian($\mu$, $\sigma$): Full score below $\mu$, decreasing above
MaxGaussian($\mu$, $\sigma$): Full score above $\mu$, decreasing below
Thresholded($t$): Full score above threshold $t$, linear decrease below

Multi-property objectives use either arithmetic or geometric means to combine individual scores.

Baseline Models and Experimental Setup

The authors evaluate six baseline models spanning different paradigms:

Distribution-learning baselines:

Random sampler: Samples molecules directly from the dataset (provides upper/lower bounds).
SMILES LSTM: 3-layer LSTM (hidden size 1024) trained to predict next SMILES characters.
Graph MCTS: Monte Carlo Tree Search building molecules atom-by-atom.
VAE: Variational autoencoder on SMILES representations.
AAE: Adversarial autoencoder.
ORGAN: Objective-reinforced generative adversarial network.

Goal-directed baselines:

Best of dataset: Scores all training molecules and returns the best (virtual screening baseline).
SMILES LSTM: Same model with 20 iterations of hill-climbing (8192 samples per iteration, top 1024 for fine-tuning).
SMILES GA: Genetic algorithm operating on SMILES strings with grammar-based mutations.
Graph GA: Genetic algorithm operating on molecular graphs with crossover and mutation.
Graph MCTS: Monte Carlo Tree Search with 40 simulations per molecule.

The training dataset is ChEMBL 24 after filtering: salt removal, charge neutralization, SMILES length cap of 100, element restrictions, and removal of molecules similar (ECFP4 > 0.323) to 10 held-out drug molecules used in benchmarks.

Distribution-Learning Results

Benchmark	Random	SMILES LSTM	Graph MCTS	AAE	ORGAN	VAE
Validity	1.000	0.959	1.000	0.822	0.379	0.870
Uniqueness	0.997	1.000	1.000	1.000	0.841	0.999
Novelty	0.000	0.912	0.994	0.998	0.687	0.974
KL divergence	0.998	0.991	0.522	0.886	0.267	0.982
FCD	0.929	0.913	0.015	0.529	0.000	0.863

Goal-Directed Results (Selected)

Benchmark	Best of Dataset	SMILES LSTM	SMILES GA	Graph GA	Graph MCTS
Celecoxib rediscovery	0.505	1.000	0.732	1.000	0.355
Osimertinib MPO	0.839	0.907	0.886	0.953	0.784
Sitagliptin MPO	0.509	0.545	0.689	0.891	0.458
Scaffold Hop	0.738	0.998	0.885	1.000	0.478
Total (20 tasks)	12.144	17.340	14.396	17.983	9.009

Key Findings and Limitations

Main Findings

The Graph GA achieves the highest total score across goal-directed benchmarks (17.983), followed closely by the SMILES LSTM (17.340). This result is notable because genetic algorithms are well-established methods, and the LSTM-based neural approach nearly matches their optimization performance.

However, compound quality tells a different story. When examining the top 100 molecules per task through chemical quality filters (SureChEMBL, Glaxo, PAINS rules), 77% of LSTM-generated molecules pass, matching the Best of ChEMBL baseline. In contrast, Graph GA produces only 40% passing molecules, and Graph MCTS only 22%. This suggests that neural models benefit from pre-training on real molecular distributions, which encodes implicit knowledge about what constitutes a “reasonable” molecule.

ORGAN performs poorly across all distribution-learning tasks, with more than half its generated molecules being invalid. This is consistent with mode collapse, a known problem in GAN training.

Simpler generative models (LSTM, VAE) outperform more complex architectures (ORGAN, AAE) on distribution learning. Graph MCTS struggles with both distribution learning and goal-directed optimization, suggesting that single-molecule search trees are less effective than population-based approaches.

Limitations

The authors explicitly identify several issues:

Compound quality is hard to quantify: The rule-based filters used are acknowledged as “high precision, low recall” surrogates. They catch some problematic molecules but cannot encode the full breadth of medicinal chemistry expertise.
Some benchmarks are too easy: The trivially optimizable tasks (logP, QED, CNS MPO) cannot differentiate between models. All baselines achieve near-perfect scores on these.
Sample efficiency and runtime are not benchmarked: The framework does not penalize models for requiring excessive scoring function calls.
Synthesis accessibility is not addressed: Generated molecules may be valid but impractical to synthesize.

Future Directions

The authors call for harder benchmark tasks, better compound quality metrics, attention to sample efficiency and runtime constraints, and further development of graph-based neural generative models.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Training	ChEMBL 24 (post-processed)	~1.6M molecules	Salt removal, neutralization, SMILES length cap, element restrictions
Evaluation	10 held-out drug molecules	10	Removed from training set via ECFP4 similarity threshold
Quality filters	SureChEMBL, Glaxo, PAINS, in-house rules	N/A	Applied via rd_filters

Algorithms

SMILES LSTM: 3-layer LSTM, hidden size 1024; hill-climbing with 20 iterations, 8192 samples per iteration, top 1024 for fine-tuning
Graph GA: Population of 100, mating pool of 200, crossover + mutation (probability 0.5), 1000 epochs max
SMILES GA: Population of 300, offspring of 600, SMILES grammar-based mutations, 1000 epochs max
Graph MCTS: 40 simulations per molecule, 25 children per step, rollout to 60 atoms, starting from CC

Models

All baseline implementations are released as open-source code. VAE, AAE, and ORGAN implementations are from the MOSES repository.

Evaluation

All distribution-learning benchmarks sample 10,000 molecules. Goal-directed benchmarks use combinations of top-1, top-10, and top-100 scores. Compound quality is assessed via the percentage of top-100 molecules passing chemical filters.

Hardware

Hardware requirements are not specified in the paper.

Artifacts

Artifact	Type	License	Notes
GuacaMol	Code	MIT	Benchmarking framework and scoring functions
GuacaMol Baselines	Code	MIT	Baseline model implementations
ChEMBL dataset	Dataset	CC-BY-SA 3.0	Post-processed ChEMBL 24 for benchmarks
FCD package	Code	LGPL-3.0	Fréchet ChemNet Distance implementation

Paper Information

Citation: Brown, N., Fiscato, M., Segler, M. H. S., & Vaucher, A. C. (2019). GuacaMol: Benchmarking Models for De Novo Molecular Design. Journal of Chemical Information and Modeling, 59(3), 1096-1108. https://doi.org/10.1021/acs.jcim.8b00839

Additional Resources:

@article{brown2019guacamol,
  title={GuacaMol: Benchmarking Models for de Novo Molecular Design},
  author={Brown, Nathan and Fiscato, Marco and Segler, Marwin H. S. and Vaucher, Alain C.},
  journal={Journal of Chemical Information and Modeling},
  volume={59},
  number={3},
  pages={1096--1108},
  year={2019},
  publisher={American Chemical Society},
  doi={10.1021/acs.jcim.8b00839}
}

Frechet ChemNet Distance for Molecular Generation

Wed, 25 Mar 2026 00:00:00 +0000

A Unified Evaluation Metric for Molecular Generation

This is a Method paper that introduces the Frechet ChemNet Distance (FCD), a single scalar metric for evaluating generative models that produce molecules for drug discovery. FCD adapts the Frechet Inception Distance (FID) from image generation to the molecular domain. By comparing distributions of learned representations from a drug-activity prediction network (ChemNet), FCD simultaneously captures whether generated molecules are chemically valid, biologically relevant, and structurally diverse.

Inconsistent Evaluation of Molecular Generative Models

At the time of this work (2018), deep generative models for molecules were proliferating: RNNs combined with variational autoencoders, reinforcement learning, and GANs all produced SMILES strings representing novel molecules. The evaluation landscape was fragmented. Different papers reported different metrics: percentage of valid SMILES, mean logP, druglikeness, synthetic accessibility (SA) scores, or internal diversity via Tanimoto distance.

This inconsistency created several problems. First, method comparison across publications was difficult because no common metric existed. Second, simple metrics like “fraction of valid SMILES” could be trivially maximized by generating short, simple molecules (e.g., “CC” or “CCC”). Third, individual property metrics (logP, druglikeness) each captured only one dimension of quality. A model could score well on logP but produce molecules that were not diverse or not biologically meaningful.

The authors argued that a good metric should capture three properties simultaneously: (1) chemical validity and similarity to real drug-like molecules, (2) biological relevance, and (3) diversity within the generated set.

Core Innovation: Frechet Distance over ChemNet Activations

The key insight is to use a neural network trained on biological activity prediction as a feature extractor for molecules, then compare distributions of these features using the Frechet (Wasserstein-2) distance.

ChemNet Architecture

ChemNet is a multi-task neural network trained to predict bioactivities across approximately 6,000 assays from three major drug discovery databases (ChEMBL, ZINC, PubChem). The architecture processes one-hot encoded SMILES strings through:

Two 1D convolutional layers with SELU activations
A max-pooling layer
Two stacked LSTM layers
A fully connected output layer

The penultimate layer (the second LSTM’s hidden state after processing the full input sequence) serves as the molecular representation. Because ChemNet was trained to predict drug activities, its internal representations encode both chemical structure (from the input side) and biological function (from the output side).

The FCD Formula

Given a set of real molecules and a set of generated molecules, FCD is computed as follows:

Pass each molecule (as a SMILES string) through ChemNet and extract penultimate-layer activations.
Fit a multivariate Gaussian to each set by computing the mean $\mathbf{m}$ and covariance $\mathbf{C}$ for the generated set, and mean $\mathbf{m}_w$ and covariance $\mathbf{C}_w$ for the real set.
Compute the squared Frechet distance:

$$ d^{2}\left((\mathbf{m}, \mathbf{C}), (\mathbf{m}_w, \mathbf{C}_w)\right) = |\mathbf{m} - \mathbf{m}_w|_2^{2} + \mathrm{Tr}\left(\mathbf{C} + \mathbf{C}_w - 2(\mathbf{C}\mathbf{C}_w)^{1/2}\right) $$

The Gaussian assumption is justified by the maximum entropy principle: the Gaussian is the maximum-entropy distribution for given mean and covariance. A lower FCD indicates that the generated distribution is closer to the real distribution.

Why Not Just Fingerprints?

The authors also define a Frechet Fingerprint Distance (FFD) that replaces ChemNet activations with 2048-bit ECFP_4 fingerprints. FFD captures chemical structure but not biological function. The experimental comparison shows that FCD produces more distinct separations between biased and unbiased molecule sets, particularly for biologically meaningful biases.

Detecting Flaws in Generative Models

The experiments evaluate whether FCD can detect specific failure modes in generative models. The authors simulate five types of biased generators by selecting molecules from real databases that exhibit particular properties, then compare FCD against individual metrics (logP, druglikeness, SA score, internal diversity) and FFD.

Simulated Bias Experiments

All experiments use 5,000 molecules drawn 5 times each. The reference distribution is 200,000 randomly drawn real molecules not used for ChemNet training.

Bias Type	logP	Druglikeness	SA Score	Int. Diversity	FFD	FCD
Low druglikeness (<5th pct)	-	Detects	-	-	Detects	Detects
High logP (>95th pct)	Detects	Detects	-	-	Detects	Detects
Low SA score (<5th pct)	-	Partial	-	Partial	Detects	Detects
Mode collapse (cluster)	-	-	-	Detects	Detects	Detects
Kinase inhibitors (PLK1)	-	-	-	-	Detects	Detects

FCD is the only metric that detects all five bias types. The biological bias test (kinase inhibitors for PLK1-PBD from PubChem AID 720504) is particularly notable: only FFD and FCD detect this bias, and FCD provides a more distinct separation. This validates the hypothesis that incorporating biological information through ChemNet activations improves evaluation beyond purely chemical descriptors.

Sample Size Requirements

The authors tested FCD convergence with varying sample sizes (5 to 300,000 molecules). Mean FCD values for samples drawn from the real distribution:

Sample Size	Mean FCD	Std Dev
5	76.46	5.03
50	31.86	0.75
500	4.41	0.03
5,000	0.42	0.01
50,000	0.05	0.00
300,000	0.02	0.00

A sample size of 5,000 molecules is sufficient for reliable estimation, with the mean FCD approaching zero and negligible variance.

Benchmarking Published Generative Models

The authors computed FCD for several published generative methods:

Method	FCD	Notes
Random real molecules	0.22	Baseline (near zero as expected)
Segler et al. (LSTM)	1.62	Trained to approximate full ChEMBL distribution
DRD2-targeted methods	24.14 to 47.85	Olivecrona, RL, and ORGAN agents
Rule-based baseline	58.76	Random concatenation of C, N, O atoms

The ranking matches expectations. The Segler model, trained to approximate the overall molecule distribution, achieves the lowest FCD (1.62). Models optimized for a specific target (DRD2), including the Olivecrona RL agents, the RL method by Benhenda, and ORGAN, produce higher FCD values (24.14 to 47.85) against the general distribution. More training iterations push these models further from the general distribution, as they become increasingly DRD2-specific. The canonical and reduced Olivecrona agents learn similar chemical spaces, consistent with the original authors’ conclusions. The rule-based system scores worst (58.76), confirming FCD as a meaningful quality metric.

Conclusions and Impact

FCD provides a single metric that unifies the evaluation of chemical validity, biological relevance, and diversity for molecular generative models. Its main advantages are:

It captures multiple quality dimensions in one score, simplifying method comparison.
It detects biases that no single existing metric can catch alone.
It requires only SMILES strings as input, making it applicable to any generative method (including graph-based approaches via SMILES conversion).
It incorporates biological information through ChemNet, distinguishing it from purely chemical metrics like FFD.

Limitations: The metric depends on the ChemNet model, which was trained on a specific set of bioactivity assays. Molecules outside the training distribution of ChemNet may not be well-represented. The Gaussian assumption for the activation distributions may not hold perfectly. FCD measures distance to a reference set, so it evaluates how well a generator approximates a given distribution rather than the absolute quality of individual molecules. When using FCD for targeted generation (e.g., molecules active against a specific protein), the reference set should be chosen accordingly, not the general drug-like molecule distribution.

FCD has since become a standard evaluation metric in the molecular generation community, adopted by benchmarking platforms like MOSES and GuacaMol.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
ChemNet training	ChEMBL, ZINC, PubChem	~6,000 assays	Two-thirds for training, one-third for testing
Reference distribution	Combined databases	200,000 molecules	Excluded from ChemNet training
Bias simulations	Subsets of combined databases	5,000 per experiment	5 repetitions each

Algorithms

ChemNet: 2x 1D-conv (SELU), max-pool, 2x stacked LSTM, FC output
FCD: Squared Frechet distance between Gaussian-fitted ChemNet penultimate-layer activations
FFD: Same as FCD but using 2048-bit ECFP_4 fingerprints instead of ChemNet activations
Molecular property calculations: RDKit (logP, druglikeness, SA score, Morgan fingerprints with radius 2)

Evaluation

Metric	Description
FCD	Frechet distance over ChemNet activations (lower = closer to reference)
FFD	Frechet distance over ECFP_4 fingerprints
logP	Mean partition coefficient
Druglikeness	Geometric mean of desired molecular properties (QED)
SA Score	Synthetic accessibility score
Internal Diversity	Tanimoto distance within generated set

Hardware

Hardware specifications are not provided in the paper.

Artifacts

Artifact	Type	License	Notes
FCD Implementation	Code	LGPL-3.0	Official Python implementation; requires only SMILES input

Paper Information

Citation: Preuer, K., Renz, P., Unterthiner, T., Hochreiter, S., & Klambauer, G. (2018). Fréchet ChemNet Distance: A Metric for Generative Models for Molecules in Drug Discovery. Journal of Chemical Information and Modeling, 58(9), 1736-1741.

@article{preuer2018frechet,
  title={Fr{\'e}chet ChemNet Distance: A Metric for Generative Models for Molecules in Drug Discovery},
  author={Preuer, Kristina and Renz, Philipp and Unterthiner, Thomas and Hochreiter, Sepp and Klambauer, G{\"u}nter},
  journal={Journal of Chemical Information and Modeling},
  volume={58},
  number={9},
  pages={1736--1741},
  year={2018},
  doi={10.1021/acs.jcim.8b00234},
  publisher={American Chemical Society}
}

Failure Modes in Molecule Generation & Optimization

Wed, 25 Mar 2026 00:00:00 +0000

An Empirical Critique of Molecular Generation Evaluation

This is an Empirical paper that critically examines evaluation practices for molecular generative models. Rather than proposing a new generative method, the paper exposes systematic weaknesses in both distribution-learning metrics and goal-directed optimization scoring functions. The primary contributions are: (1) demonstrating that a trivially simple “AddCarbon” model can achieve near-perfect scores on widely used distribution-learning benchmarks, and (2) introducing an experimental framework with optimization scores and control scores that reveals model-specific and data-specific biases when ML models serve as scoring functions for goal-directed generation.

Evaluation Gaps in De Novo Molecular Design

The rapid growth of deep learning methods for molecular generation (RNN-based SMILES generators, VAEs, GANs, graph neural networks) created a need for standardized evaluation. Benchmarking suites like GuacaMol and MOSES introduced metrics for validity, uniqueness, novelty, KL divergence over molecular properties, and Frechet ChemNet Distance (FCD). For goal-directed generation, penalized logP became a common optimization target.

However, these metrics leave significant blind spots. Distribution-learning metrics do not detect whether a model merely copies training molecules with minimal modifications. Goal-directed benchmarks often use scoring functions that fail to capture the full requirements of drug discovery (synthetic feasibility, drug-likeness, absence of reactive substructures). When ML models serve as scoring functions, the problem worsens because generated molecules can exploit artifacts of the learned model rather than exhibiting genuinely desirable properties.

At the time of writing, wet-lab validations of generative models remained scarce, with only a handful of studies (Merk et al., Zhavoronkov et al.) demonstrating in vitro activity for generated compounds. The lack of rigorous evaluation left the field unable to distinguish meaningfully innovative methods from those that simply exploit metric weaknesses.

The Copy Problem and Control Score Framework

The paper introduces two key conceptual contributions.

The AddCarbon Model for Distribution-Learning

The AddCarbon model is deliberately trivial: it samples a molecule from the training set, inserts a single carbon atom at a random position in its SMILES string, and returns the result if it produces a valid, novel molecule. This model achieves near-perfect scores across most GuacaMol distribution-learning benchmarks:

Benchmark	RS	LSTM	GraphMCTS	AAE	ORGAN	VAE	AddCarbon
Validity	1.000	0.959	1.000	0.822	0.379	0.870	1.000
Uniqueness	0.997	1.000	1.000	1.000	0.841	0.999	0.999
Novelty	0.000	0.912	0.994	0.998	0.687	0.974	1.000
KL divergence	0.998	0.991	0.522	0.886	0.267	0.982	0.982
FCD	0.929	0.913	0.015	0.529	0.000	0.863	0.871

The AddCarbon model beats all baselines except the LSTM on the FCD metric, despite being practically useless. This exposes what the authors call the “copy problem”: current metrics check only for exact matches to training molecules, so minimal edits evade novelty detection. The authors argue that likelihood-based evaluation on hold-out test sets, analogous to standard practice in NLP, would provide a more comprehensive metric.

Control Scores for Goal-Directed Generation

For goal-directed generation, the authors introduce a three-score experimental design:

Optimization Score (OS): Output of a classifier trained on data split 1, used to guide the molecular optimizer.
Model Control Score (MCS): Output of a second classifier trained on split 1 with a different random seed. Divergence between OS and MCS quantifies model-specific biases.
Data Control Score (DCS): Output of a classifier trained on data split 2. Divergence between OS and DCS quantifies data-specific biases.

This mirrors the training/test split paradigm in supervised learning. If a generator truly produces molecules with the desired bioactivity, the control scores should track the optimization score. Divergence between them indicates the optimizer is exploiting artifacts of the specific model or training data rather than learning generalizable chemical properties.

Experimental Setup: Three Targets, Three Generators

Targets and Data

The authors selected three biological targets from ChEMBL: Janus kinase 2 (JAK2), epidermal growth factor receptor (EGFR), and dopamine receptor D2 (DRD2). For each target, the data was split into two halves (split 1 and split 2) with balanced active/inactive ratios. Random forest classifiers using binary folded ECFP4 fingerprints (radius 2, size 1024) were trained to produce three scoring functions per target: the OS and MCS on split 1 (different random seeds), and the DCS on split 2.

Generators

Three molecular generators were evaluated:

Graph-based Genetic Algorithm (GA): Iteratively applies random mutations and crossovers to a population of molecules, retaining the best in each generation. One of the top performers in GuacaMol.
SMILES-LSTM: An autoregressive model that generates SMILES character by character, optimized via hill climbing (iteratively sampling, keeping top molecules, fine-tuning). Also a top GuacaMol performer.
Particle Swarm Optimization (PS): Optimizes molecules in the continuous latent space of a SMILES-based sequence-to-sequence model.

Each optimizer was run 10 times per target dataset.

Score Divergence and Exploitable Biases

Optimization vs. Control Score Divergence

Across all three targets and all three generators, the OS consistently outpaced both control scores during optimization. The DCS sometimes stagnated or even decreased while the OS continued to climb. This divergence demonstrates that the generators exploit biases in the scoring function rather than discovering genuinely active compounds.

The MCS also diverged from the OS despite being trained on exactly the same data, confirming model-specific biases: the optimization exploits features unique to the particular random forest instance. The larger gap between OS and DCS (compared to OS and MCS) indicates that data-specific biases contribute more to the divergence than model-specific biases.

Chemical Space Migration

Optimized molecules migrated toward the region of split 1 actives (used to train the OS), as shown by t-SNE embeddings and nearest-neighbor Tanimoto similarity analysis. Optimized molecules had more similar neighbors in split 1 than in split 2, confirming data-specific bias. By the end of optimization, generated molecules occupied different regions of chemical space than known actives when measured by logP and molecular weight, with compounds from the same optimization run forming distinct clusters.

Quality of Generated Molecules

High-scoring generated molecules frequently contained problematic substructures: reactive dienes, nitrogen-fluorine bonds, long heteroatom chains that are synthetically infeasible, and highly uncommon functional groups. The LSTM optimizer showed a bias toward high molecular weight, low diversity, and high logP values. These molecules would be rejected by medicinal chemists despite their high optimization scores.

Key Takeaways

The authors emphasize several practical implications:

Early stopping: Control scores can indicate when further optimization is exploiting biases rather than finding better molecules. Optimization should stop when control scores plateau.
Scoring function iteration: In practice, generative models are “highly adept at exploiting” incomplete scoring functions, necessitating several iterations of generation and scoring function refinement.
Synthetic accessibility: Even high-scoring molecules are useless if they cannot be synthesized. The authors consider this a major challenge for practical adoption.
Likelihood-based evaluation: For distribution-learning, the authors recommend reporting test-set likelihoods for likelihood-based models, following standard NLP practice.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Bioactivity data	ChEMBL (JAK2, EGFR, DRD2)	See Table S1	Binary classification tasks, split 50/50
Distribution-learning	GuacaMol training set	Subset of ChEMBL	Used as starting population for GA and PS

Algorithms

Scoring function: Random forest classifier (scikit-learn) on binary ECFP4 fingerprints (size 1024, radius 2, RDKit)
GA: Graph-based genetic algorithm from Jensen (2019)
LSTM: SMILES-LSTM with hill climbing, pretrained model from GuacaMol
PS: Particle swarm optimization in latent space of a sequence-to-sequence model (Winter et al. 2019)
Each optimizer run 10 times per target

Evaluation

Metric	Description	Notes
Optimization Score (OS)	RF classifier on split 1	Guides optimization
Model Control Score (MCS)	RF on split 1, different seed	Detects model-specific bias
Data Control Score (DCS)	RF on split 2	Detects data-specific bias
GuacaMol metrics	Validity, uniqueness, novelty, KL div, FCD	For distribution-learning

Artifacts

Artifact	Type	License	Notes
ml-jku/mgenerators-failure-modes	Code	Unknown	Data, code, and results

Hardware

Not specified in the paper.

Citation

@article{renz2019failure,
  title={On failure modes in molecule generation and optimization},
  author={Renz, Philipp and Van Rompaey, Dries and Wegner, J{\"o}rg Kurt and Hochreiter, Sepp and Klambauer, G{\"u}nter},
  journal={Drug Discovery Today: Technologies},
  volume={32-33},
  pages={55--63},
  year={2019},
  publisher={Elsevier},
  doi={10.1016/j.ddtec.2020.09.003}
}

Paper Information

Citation: Renz, P., Van Rompaey, D., Wegner, J. K., Hochreiter, S., & Klambauer, G. (2019). On failure modes in molecule generation and optimization. Drug Discovery Today: Technologies, 32-33, 55-63. https://doi.org/10.1016/j.ddtec.2020.09.003

Publication: Drug Discovery Today: Technologies, Volume 32-33, 2019

Additional Resources:

Code and data (GitHub)

DOCKSTRING: Docking-Based Benchmarks for Drug Design

Wed, 25 Mar 2026 00:00:00 +0000

A Three-Part Resource for Docking-Based ML Benchmarks

DOCKSTRING is a Resource paper that delivers three integrated components for benchmarking machine learning models in drug discovery using molecular docking. The primary contributions are: (1) an open-source Python package wrapping AutoDock Vina for deterministic docking from SMILES strings, (2) a dataset of over 15 million docking scores and poses covering 260,000+ molecules docked against 58 medically relevant protein targets, and (3) a suite of benchmark tasks spanning regression, virtual screening, and de novo molecular design. The paper additionally provides baseline results across classical and deep learning methods.

Why Existing Molecular Benchmarks Fall Short

ML methods for drug discovery are frequently evaluated using simple physicochemical properties such as penalized logP or QED (quantitative estimate of druglikeness). These properties are computationally cheap and easy to optimize, but they do not depend on the interaction between a candidate compound and a protein target. As a result, strong performance on logP or QED benchmarks does not necessarily translate to strong performance on real drug design tasks.

Molecular docking offers a more realistic evaluation objective because docking scores depend on the 3D structure of the ligand-target complex. Docking is routinely used by medicinal chemists to estimate binding affinities during hit discovery and lead optimization. Several prior efforts attempted to bring docking into ML benchmarking, but each had limitations:

VirtualFlow and DockStream require manually prepared target files and domain expertise.
TDC and Cieplinski et al. provide SMILES-to-score wrappers but lack proper ligand protonation and randomness control, and cover very few targets (one and four, respectively).
DUD-E is easily overfit by ML models that memorize actives vs. decoys.
GuacaMol and MOSES rely on physicochemical properties or similarity functions that miss 3D structural subtleties.
MoleculeNet compiles experimental datasets but does not support on-the-fly label computation needed for transfer learning or de novo design.

DOCKSTRING addresses all of these gaps: it standardizes the docking procedure, automates ligand and target preparation, controls randomness for reproducibility, and provides a large, diverse target set.

Core Innovation: Standardized End-to-End Docking Pipeline

The key innovation is a fully automated, deterministic docking pipeline that produces reproducible scores from a SMILES string in four lines of Python code. The pipeline consists of three stages:

Target Preparation. 57 of the 58 protein targets originate from the Directory of Useful Decoys Enhanced (DUD-E). PDB files are standardized with Open Babel, polar hydrogens are added, and conversion to PDBQT format is performed with AutoDock Tools. Search boxes are derived from crystallographic ligands with 12.5 A padding and a minimum side length of 30 A. The 58th target (DRD2, dopamine receptor D2) was prepared separately following the same protocol.

Ligand Preparation. Ligands are protonated at pH 7.4 with Open Babel, embedded into 3D conformations using the ETKDG algorithm in RDKit, refined with the MMFF94 force field, and assigned Gasteiger partial charges. Stereochemistry of determined stereocenters is maintained, while undetermined stereocenters are assigned randomly but consistently across runs.

Docking. AutoDock Vina runs with default exhaustiveness (8), up to 9 binding modes, and an energy range of 3 kcal/mol. The authors verified that fixing the random seed yields docking score variance of less than 0.1 kcal/mol across runs, making the pipeline fully deterministic.

The three de novo design objective functions incorporate a QED penalty to enforce druglikeness:

$$ f_{\text{F2}}(l) = s(l, \text{F2}) + 10(1 - \text{QED}(l)) $$

$$ f_{\text{PPAR}}(l) = \max_{t \in \text{PPAR}} s(l, t) + 10(1 - \text{QED}(l)) $$

$$ f_{\text{JAK2}}(l) = s(l, \text{JAK2}) - \min(s(l, \text{LCK}), -8.1) + 10(1 - \text{QED}(l)) $$

The F2 task optimizes binding to a single protease. The Promiscuous PPAR task requires strong binding to three nuclear receptors simultaneously. The Selective JAK2 task is adversarial, requiring strong JAK2 binding while avoiding LCK binding (two kinases with a score correlation of 0.80).

Experimental Setup: Regression, Virtual Screening, and De Novo Design

Dataset Construction

The dataset combines molecules from ExCAPE-DB (which curates PubChem and ChEMBL bioactivity assays). The authors selected all molecules with active labels against targets having at least 1,000 experimental actives, plus 150,000 inactive-only molecules. After discarding 1.8% of molecules that failed ligand preparation, the final dataset contains 260,155 compounds docked against 58 targets, producing over 15 million docking scores and poses. The dataset required over 500,000 CPU hours to generate.

Cluster analysis using DBSCAN (Jaccard distance threshold of 0.25 on RDKit fingerprints) found 52,000 clusters, and Bemis-Murcko scaffold decomposition identified 102,000 scaffolds, confirming high molecular diversity. Train/test splitting follows cluster labels to prevent data leakage.

Regression Baselines

Five targets of varying difficulty were selected: PARP1 (easy), F2 (easy-medium), KIT (medium), ESR2 (hard), and PGR (hard). Baselines include Ridge, Lasso, XGBoost, exact GP, sparse GP, MPNN, and Attentive FP.

Target	Ridge	Lasso	XGBoost	GP (exact)	GP (sparse)	MPNN	Attentive FP
logP	0.640	0.640	0.734	0.707	0.716	0.953	1.000
QED	0.519	0.483	0.660	0.640	0.598	0.901	0.981
ESR2	0.421	0.416	0.497	0.441	0.508	0.506	0.627
F2	0.672	0.663	0.688	0.705	0.744	0.798	0.880
KIT	0.604	0.594	0.674	0.637	0.684	0.755	0.806
PARP1	0.706	0.700	0.723	0.743	0.772	0.815	0.910
PGR	0.242	0.245	0.345	0.291	0.387	0.324	0.678

Values are mean $R^2$ over three runs. Attentive FP achieves the best performance on every target but remains well below perfect prediction on the harder targets, confirming that docking score regression is a meaningful benchmark.

Virtual Screening Baselines

Models trained on PARP1, KIT, and PGR docking scores rank all molecules in ZINC20 (~1 billion compounds). The top 5,000 predictions are docked, and the enrichment factor (EF) is computed relative to a 0.1 percentile activity threshold.

Target	Threshold	FSS	Ridge	Attentive FP
KIT	-10.7	239.2	451.6	766.5
PARP1	-12.1	313.1	325.9	472.2
PGR	-10.1	161.4	120.5	461.3

The maximum possible EF is 1,000. Attentive FP substantially outperforms fingerprint similarity search (FSS) and Ridge regression across all targets.

De Novo Design Baselines

Four optimization methods were tested: SELFIES GA, Graph GA, GP-BO with UCB acquisition ($\beta = 10$), and GP-BO with expected improvement (EI), each with a budget of 5,000 objective function evaluations. Without QED penalties, all methods easily surpass the best training set molecules but produce large, lipophilic, undrug-like compounds. With QED penalties, the tasks become substantially harder: GP-BO with EI is the only method that finds 25 molecules better than the training set across all three tasks.

The Selective JAK2 task proved hardest due to the high correlation between JAK2 and LCK scores. Pose analysis of the top de novo molecule revealed a dual binding mode: type V inhibitor behavior in JAK2 (binding distant N- and C-terminal lobe regions) and type I behavior in LCK (hinge-binding), suggesting a plausible selectivity mechanism.

Key Findings and Limitations

Key findings:

Docking scores are substantially harder to predict than logP or QED, making them more suitable for benchmarking high-performing ML models. Graph neural networks (Attentive FP) achieve near-perfect $R^2$ on logP but only 0.63-0.91 on docking targets.
In-distribution regression difficulty does not necessarily predict out-of-distribution virtual screening difficulty. PARP1 is easiest for regression, but KIT is easiest for virtual screening.
Adding a QED penalty to de novo design objectives transforms trivially solvable tasks into meaningful benchmarks. The adversarial Selective JAK2 objective, which exploits correlated docking scores, may be an effective way to avoid docking score biases toward large and lipophilic molecules.
Docking scores from related protein targets are highly correlated, supporting the biological meaningfulness of the dataset and enabling multiobjective and transfer learning tasks.

Limitations acknowledged by the authors:

Docking scores are approximate heuristics. They use static binding sites and force fields with limited calibration for certain metal ions. DOCKSTRING benchmarks should not substitute for rational drug design and experimental validation.
The pipeline relies on AutoDock Vina specifically; other docking programs may produce different rankings.
Top de novo molecules for F2 and Promiscuous PPAR contain conjugated ring structures uncommon in successful drugs.
Platform support is primarily Linux, with noted scoring inconsistencies on macOS.

Future directions mentioned include multiobjective tasks (transfer learning, few-shot learning), improved objective functions for better pharmacokinetic properties and synthetic feasibility, and multifidelity optimization tasks combining docking with more expensive computational methods.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Ligand source	ExCAPE-DB (PubChem + ChEMBL)	260,155 molecules	Actives against 58 targets + 150K inactive-only
Docking scores	DOCKSTRING dataset	15M+ scores and poses	Full matrix across all molecule-target pairs
Virtual screening library	ZINC20	~1 billion molecules	Used for out-of-distribution evaluation
Target structures	DUD-E + PDB 6CM4 (DRD2)	58 targets	Kinases (22), enzymes (12), nuclear receptors (9), proteases (7), GPCRs (5), cytochromes (2), chaperone (1)

Algorithms

Docking engine: AutoDock Vina with default exhaustiveness (8), up to 9 binding modes, energy range of 3 kcal/mol
Ligand preparation: Open Babel (protonation at pH 7.4), RDKit ETKDG (3D embedding), MMFF94 (force field refinement), Gasteiger charges
Regression models: Ridge, Lasso, XGBoost (hyperparameters via 20-configuration random search with 5-fold CV), exact GP and sparse GP (Tanimoto kernel on fingerprints), MPNN, Attentive FP (DeepChem defaults, 10 epochs)
Optimization: Graph GA (population 250, offspring 25, mutation rate 0.01), SELFIES GA (same population/offspring settings), GP-BO with UCB ($\beta = 10$) or EI (batch size 5, 1000 offspring, 25 generations per iteration)

Evaluation

Metric	Setting	Notes
$R^2$ (coefficient of determination)	Regression	Cluster-split train/test
EF (enrichment factor)	Virtual screening	Top 5,000 from ZINC20, 0.1 percentile threshold
Objective value trajectory	De novo design	5,000 function evaluation budget

Hardware

The dataset required over 500,000 CPU hours to compute, using the University of Cambridge Research Computing Service (EPSRC and DiRAC funded). Per-target docking takes approximately 15 seconds on 8 CPUs.

Artifacts

Artifact	Type	License	Notes
DOCKSTRING Python package	Code	Apache 2.0	Wraps AutoDock Vina; available via conda-forge and PyPI
DOCKSTRING dataset	Dataset	Apache 2.0	15M+ docking scores and poses for 260K molecules x 58 targets
Benchmark baselines	Code	Apache 2.0	Regression, virtual screening, and de novo design baseline implementations

Paper Information

Citation: García-Ortegón, M., Simm, G. N. C., Tripp, A. J., Hernández-Lobato, J. M., Bender, A., & Bacallado, S. (2022). DOCKSTRING: Easy Molecular Docking Yields Better Benchmarks for Ligand Design. Journal of Chemical Information and Modeling, 62(15), 3486-3502. https://doi.org/10.1021/acs.jcim.1c01334

Publication: Journal of Chemical Information and Modeling, 2022

Additional Resources:

Citation

@article{garciaortegon2022dockstring,
  title={{DOCKSTRING}: Easy Molecular Docking Yields Better Benchmarks for Ligand Design},
  author={Garc{\'\i}a-Orteg{\'o}n, Miguel and Simm, Gregor N. C. and Tripp, Austin J. and Hern{\'a}ndez-Lobato, Jos{\'e} Miguel and Bender, Andreas and Bacallado, Sergio},
  journal={Journal of Chemical Information and Modeling},
  volume={62},
  number={15},
  pages={3486--3502},
  year={2022},
  publisher={American Chemical Society},
  doi={10.1021/acs.jcim.1c01334}
}

Tartarus: Realistic Inverse Molecular Design Benchmarks

Mon, 23 Mar 2026 00:00:00 +0000

A Resource for Realistic Molecular Design Evaluation

This is a Resource paper. Its primary contribution is Tartarus, a modular benchmarking platform for inverse molecular design that provides physically grounded evaluation tasks across four application domains: organic photovoltaics, organic emitters, protein ligands, and chemical reaction substrates. Each task pairs a curated reference dataset with a computational simulation workflow that evaluates proposed molecular structures using established methods from computational chemistry (force fields, semi-empirical quantum chemistry, density functional theory, and molecular docking).

The Problem with Existing Molecular Design Benchmarks

Inverse molecular design, the challenge of crafting molecules with specific optimal properties, is central to drug, catalyst, and materials discovery. Many algorithms have been proposed for this task, but the benchmarks used to evaluate them have significant limitations:

Penalized logP, one of the most common benchmarks, depends heavily on molecule size and chain composition, limiting its informativeness.
QED maximization has reached saturation, with numerous models achieving near-perfect scores.
GuacaMol often yields near-perfect scores across models, obscuring meaningful performance differences. Gao et al. (2022) traced this to unlimited property evaluations, with imposed limits revealing much larger disparities.
MOSES evaluates distribution-matching ability, but the emergence of SELFIES and simple algorithms has made these tasks relatively straightforward.
Molecular docking benchmarks are gaining popularity, but tend to favor reactive or unstable molecules and typically cover only drug design.

These benchmarks share a common weakness: they rely on cheap, approximate property estimators (often QSAR models or simple heuristics) rather than physics-based simulations. This makes them poor proxies for real molecular design campaigns, where properties must be validated through computational or experimental workflows. Tartarus addresses this by providing benchmark tasks grounded in established simulation methods.

Physics-Based Simulation Workflows as Benchmark Oracles

The core innovation in Tartarus is the use of computational chemistry simulation pipelines as objective functions for benchmarking. Rather than relying on learned property predictors, each benchmark task runs a full simulation workflow to evaluate proposed molecules:

Organic Photovoltaics (OPV): Starting from a SMILES string, the workflow generates 3D coordinates with Open Babel, performs conformer search with CREST at the GFN-FF level, optimizes geometry at GFN2-xTB, and computes HOMO/LUMO energies. Power conversion efficiency (PCE) is estimated via the Scharber model for single-junction organic solar cells. HOMO and LUMO energies are calibrated against DFT results from the Harvard Clean Energy Project Database using Theil-Sen regression:

$$ E_{\text{HOMO, calibrated}} = E_{\text{HOMO, GFN2-xTB}} \cdot 0.8051 + 2.5377 \text{ eV} $$

$$ E_{\text{LUMO, calibrated}} = E_{\text{LUMO, GFN2-xTB}} \cdot 0.8788 + 3.7913 \text{ eV} $$

Organic Emitters (OLED): The workflow uses conformer search via CREST, geometry optimization at GFN0-xTB, and TD-DFT single-point calculations at the B3LYP/6-31G* level with PySCF to extract singlet-triplet gaps, oscillator strengths, and vertical excitation energies.
Protein Ligands: The workflow generates 3D coordinates, applies structural filters (Lipinski’s Rule of Five, reactive moiety checks), and performs molecular docking using QuickVina2 with re-scoring via smina against three protein targets: 1SYH (ionotropic glutamate receptor), 6Y2F (SARS-CoV-2 main protease), and 4LDE (beta-2 adrenoceptor).
Chemical Reaction Substrates: The workflow models the intramolecular double hydrogen transfer in syn-sesquinorbornenes using the SEAM force field approach at the GFN-FF/GFN2-xTB level to compute activation and reaction energies.

Each benchmark also includes a curated reference dataset for training generative models and a standardized evaluation protocol: train on 80% of the dataset, use 20% for hyperparameter optimization, then optimize structures starting from the best reference molecule with a constrained budget of 5,000 proposed compounds, a 24-hour runtime cap, and five independent repetitions.

Benchmark Tasks, Datasets, and Model Comparisons

Models Evaluated

Eight generative models spanning major algorithm families were tested:

VAEs: SMILES-VAE and SELFIES-VAE
Flow models: MoFlow
Reinforcement learning: REINVENT
LSTM-based hill climbing: SMILES-LSTM-HC and SELFIES-LSTM-HC
Genetic algorithms: GB-GA and JANUS

Organic Photovoltaics Results

The reference dataset (CEP_SUB) contains approximately 25,000 molecules from the Harvard Clean Energy Project Database. Two objectives combine PCE with synthetic accessibility (SAscore):

Model	PCE_PCBM - SAscore	PCE_PCDTBT - SAscore
Dataset	7.57	31.71
SMILES-VAE	7.44 +/- 0.28	10.23 +/- 11.14
SELFIES-VAE	7.05 +/- 0.66	29.24 +/- 0.65
MoFlow	7.08 +/- 0.31	29.81 +/- 0.37
SMILES-LSTM-HC	6.69 +/- 0.40	31.79 +/- 0.15
SELFIES-LSTM-HC	7.40 +/- 0.41	30.71 +/- 1.20
REINVENT	7.48 +/- 0.11	30.47 +/- 0.44
GB-GA	7.78 +/- 0.02	30.24 +/- 0.80
JANUS	7.59 +/- 0.14	31.34 +/- 0.74

GB-GA achieves the best score on the first task (7.78), while SMILES-LSTM-HC leads on the second (31.79). Most models can marginally improve PCE but struggle to simultaneously improve PCE and reduce SAscore.

Organic Emitters Results

The reference dataset (GDB-13_SUB) contains approximately 380,000 molecules filtered for conjugated pi-systems from GDB-13. Three objectives target singlet-triplet gap minimization, oscillator strength maximization, and a combined multi-objective:

Model	Delta E(S1-T1)	f12	Multi-objective
Dataset	0.020	2.97	-0.04
SMILES-VAE	0.071 +/- 0.003	0.50 +/- 0.27	-0.57 +/- 0.33
SELFIES-VAE	0.016 +/- 0.001	0.36 +/- 0.31	0.17 +/- 0.10
MoFlow	0.013 +/- 0.001	0.81 +/- 0.11	-0.04 +/- 0.06
GB-GA	0.012 +/- 0.002	2.14 +/- 0.45	0.07 +/- 0.03
JANUS	0.008 +/- 0.001	2.07 +/- 0.16	0.02 +/- 0.05

Only JANUS, GB-GA, and SELFIES-VAE generate compounds comparable to or improving upon the best training molecules. JANUS achieves the lowest singlet-triplet gap (0.008 eV), while SELFIES-VAE achieves the highest multi-objective fitness (0.17). Some proposed structures contain reactive moieties, likely because stability is not explicitly penalized in the objective functions.

Protein Ligand Results

The reference dataset contains approximately 152,000 molecules from the DTP Open Compound Collection, filtered for drug-likeness. Docking is performed against three protein targets using both QuickVina2 and smina re-scoring:

Model	1SYH (smina)	6Y2F (smina)	4LDE (smina)	SR (1SYH)
Dataset	-10.2	-8.2	-13.1	100.0%
SMILES-VAE	-10.4 +/- 0.6	-8.9 +/- 0.8	-11.1 +/- 0.4	12.3%
SELFIES-VAE	-10.9 +/- 0.3	-10.1 +/- 0.4	-11.9 +/- 0.2	34.8%
REINVENT	-12.1 +/- 0.2	-11.4 +/- 0.3	-13.7 +/- 0.5	77.8%
GB-GA	-12.0 +/- 0.2	-11.0 +/- 0.2	-13.8 +/- 0.4	72.6%
JANUS	-11.9 +/- 0.2	-11.9 +/- 0.4	-13.6 +/- 0.5	68.4%

No single model consistently achieves the best docking score across all three targets. REINVENT leads on 1SYH, JANUS on 6Y2F, and GB-GA on 4LDE. Both VAE models show low success rates for structural filter compliance (12-39%), while REINVENT, GAs, and LSTMs achieve 68-78%.

Chemical Reaction Substrates Results

The reference dataset (SNB-60K) contains approximately 60,000 syn-sesquinorbornene derivatives generated via STONED-SELFIES mutations. Four objectives target activation energy, reaction energy, and two combined metrics:

Model	Delta E(activation)	Delta E(reaction)	Delta E(act) + Delta E(rxn)	-Delta E(act) + Delta E(rxn)
Dataset	64.94	-34.39	56.48	-95.25
SMILES-VAE	76.81 +/- 0.25	-10.96 +/- 0.71	71.01 +/- 0.62	-90.94 +/- 1.04
MoFlow	70.12 +/- 2.13	-20.21 +/- 4.13	63.21 +/- 0.69	-92.82 +/- 3.06
GB-GA	56.04 +/- 3.07	-41.39 +/- 5.76	45.20 +/- 6.78	-100.07 +/- 1.35
JANUS	47.56 +/- 2.19	-45.37 +/- 7.90	39.22 +/- 3.99	-97.14 +/- 1.13

Only JANUS and GB-GA consistently outperform the best reference compounds. Both VAE models fail to surpass the dataset baseline on any objective. JANUS achieves the best single-objective scores for activation energy (47.56) and reaction energy (-45.37), and the best combined score (39.22).

Key Findings and Limitations

Central Finding: Algorithm Performance is Domain-Dependent

The most important result from Tartarus is that no single generative model consistently outperforms the others across all benchmark domains. This has several implications:

Genetic algorithms (GB-GA and JANUS) show the most consistently strong performance across benchmarks, despite being among the simplest approaches and requiring minimal pre-conditioning time (seconds vs. hours for deep models).
VAE-based models (SMILES-VAE and SELFIES-VAE) show the weakest overall performance, often failing to surpass the best molecules in the reference datasets. Their reliance on the available training data appears to limit their effectiveness.
REINVENT performs competitively on protein ligand tasks but shows weaker performance on other benchmarks.
Representation matters: SELFIES-based models generally outperform their SMILES-based counterparts (e.g., SELFIES-VAE vs. SMILES-VAE), consistent with SELFIES providing 100% validity guarantees.

Timing Analysis

Training time varies dramatically across models. Both VAEs require over 9 hours of GPU training, with estimated CPU-only training times of approximately 25 days. REINVENT and MoFlow train in under 1 hour. Both GAs complete pre-conditioning in seconds and require no GPU.

Limitations Acknowledged by the Authors

Benchmark domains covered are not comprehensive and need expansion.
3D generative models are not well supported, as proposed conformers are ignored in favor of simulation-derived geometries.
The chemical reaction substrate benchmark requires specialized geometries (reactant, product, transition state) that most 3D generative models cannot produce.
Results depend heavily on both model hyperparameters and benchmark settings (compute budget, number of evaluations).
Objective functions may need revision when undesired structures are promoted.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
OPV Training	CEP_SUB (Harvard Clean Energy Project subset)	~25,000 molecules	From HIPS/neural-fingerprint repository
Emitter Training	GDB-13_SUB (filtered GDB-13)	~380,000 molecules	Conjugated pi-system filter applied
Ligand Training	DTP Open Compound Collection (filtered)	~152,000 molecules	Drug-likeness and structural filters applied
Reaction Training	SNB-60K (STONED-SELFIES mutations)	~60,000 molecules	Generated from syn-sesquinorbornene core

Algorithms

All eight algorithms are implemented in the Tartarus repository with configuration files and installation instructions. The evaluation protocol specifies: 80/20 train/validation split, population size of 5,000, 24-hour runtime cap, five independent runs per model.

Models

Pre-trained model checkpoints are not provided. Training must be performed from scratch using the provided reference datasets and hyperparameter configurations documented in the Supporting Information.

Evaluation

Properties are evaluated through physics-based simulation workflows (not learned surrogates). Each workflow accepts a SMILES string and returns computed properties. Key software dependencies include: Open Babel, CREST, xTB, PySCF, QuickVina2, smina, and RDKit.

Hardware

Training and sampling benchmarks were conducted using 24 CPU cores (AMD Rome 7532 @ 2.40 GHz) and a single Tesla A100 GPU. Simulations were run on the Beluga, Narval, Niagara, Cedar, and Sherlock supercomputing clusters.

Artifact	Type	License	Notes
Tartarus GitHub	Code	Unknown	Benchmark tasks, simulation workflows, model configs
Zenodo Archive	Dataset	Unknown	Reference datasets for all four benchmark domains
Discord Community	Other	N/A	Discussion and collaboration channel

Paper Information

Citation: Nigam, A., Pollice, R., Tom, G., Jorner, K., Willes, J., Thiede, L. A., Kundaje, A., & Aspuru-Guzik, A. (2023). Tartarus: A Benchmarking Platform for Realistic And Practical Inverse Molecular Design. Advances in Neural Information Processing Systems 36, 3263-3306.

Publication: NeurIPS 2023

Additional Resources:

Citation

@inproceedings{nigam2023tartarus,
  title={Tartarus: A Benchmarking Platform for Realistic And Practical Inverse Molecular Design},
  author={Nigam, AkshatKumar and Pollice, Robert and Tom, Gary and Jorner, Kjell and Willes, John and Thiede, Luca A. and Kundaje, Anshul and Aspuru-Guzik, Al{\'a}n},
  booktitle={Advances in Neural Information Processing Systems},
  volume={36},
  pages={3263--3306},
  year={2023}
}

SMINA Docking Benchmark for De Novo Drug Design Models

Mon, 23 Mar 2026 00:00:00 +0000

A Docking-Based Benchmark for De Novo Drug Design

This is a Resource paper. Its primary contribution is a standardized benchmark for evaluating generative models in de novo drug design. Rather than introducing a new generative method, the paper provides a reusable evaluation framework built around molecular docking, a widely used computational proxy for predicting protein-ligand binding. The benchmark uses SMINA (a fork of AutoDock Vina) to score generated molecules against eight protein targets, offering a more realistic evaluation than commonly used proxy metrics like logP or QED.

Why Existing Benchmarks Fall Short

De novo drug design methods are typically evaluated using simple proxy tasks that do not reflect the complexity of real drug discovery. The octanol-water partition coefficient (logP) can be trivially optimized by producing unrealistic molecules. The QED drug-likeness score suffers from the same issue. Neural network-based bioactivity predictors are similarly exploitable.

As Coley et al. (2020) note: “The current evaluations for generative models do not reflect the complexity of real discovery problems.”

More realistic evaluation approaches exist in adjacent domains (photovoltaics, excitation energies), where physical calculations are used to both train and evaluate models. Yet de novo drug design has largely relied on the same simplistic proxies. This gap between proxy task performance and real-world utility motivates the development of a docking-based benchmark that, while still a proxy, captures more of the structural complexity involved in protein-ligand interactions.

Benchmark Design: SMINA Docking with the Vinardo Scoring Function

The benchmark is defined by three components: (1) docking software that computes a ligand’s pose in the binding site, (2) a scoring function that evaluates the pose, and (3) a training set of compounds with precomputed docking scores.

The concrete instantiation uses SMINA v. 2017.11.9 with the Vinardo scoring function:

$$S = -0.045 \cdot G + 0.8 \cdot R - 0.035 \cdot H - 0.6 \cdot B$$

where $S$ is the docking score, $G$ is the gauss term, $R$ is repulsion, $H$ is the hydrophobic term, and $B$ is the non-directional hydrogen bond term. The gauss and repulsion terms measure steric interactions between the ligand and the protein, while the hydrophobic and hydrogen bond terms capture favorable non-covalent contacts.

The benchmark includes three task variants:

Docking Score Function: Optimize the full Vinardo docking score (lower is better).
Repulsion: Minimize only the repulsion component, defined as:

$$ R(a_1, a_2) = \begin{cases} d(a_1, a_2)^2 & d(a_1, a_2) < 0 \\ 0 & \text{otherwise} \end{cases} $$

where $d(a_1, a_2)$ is the inter-atomic distance minus the sum of van der Waals radii.

Hydrogen Bonding: Maximize the hydrogen bond term:

$$ B(a_1, a_2) = \begin{cases} 0 & (a_1, a_2) \text{ do not form H-bond} \\ 1 & d(a_1, a_2) < -0.6 \\ 0 & d(a_1, a_2) \geq 0 \\ \frac{d(a_1, a_2)}{-0.6} & \text{otherwise} \end{cases} $$

Scores are averaged over the top 5 binding poses for stability. Generated compounds are filtered by Lipinski’s Rule of Five and a minimum molecular weight of 100. Each model must generate 250 unique molecules per target.

Training data comes from ChEMBL, covering eight drug targets: 5-HT1B, 5-HT2B, ACM2, CYP2D6, ADRB1, MOR, A2A, and D2. Dataset sizes range from 1,082 (ADRB1) to 10,225 (MOR) molecules.

Experimental Evaluation of Three Generative Models

Models Tested

Three popular generative models were evaluated:

CVAE (Chemical Variational Autoencoder): A VAE operating on SMILES strings.
GVAE (Grammar Variational Autoencoder): Extends CVAE by enforcing grammatical correctness of generated SMILES.
REINVENT: A recurrent neural network trained first on ChEMBL in a supervised manner, then fine-tuned with reinforcement learning using docking scores as rewards.

For CVAE and GVAE, molecules are generated by sampling from the latent space and taking 50 gradient steps to optimize an MLP that predicts the docking score. For REINVENT, a random forest model predicts docking scores from ECFP fingerprints, and the reward combines this prediction with the QED score.

Baselines

Two baselines provide context:

Training set: The top 50%, 10%, and 1% of docking scores from the ChEMBL training set.
ZINC subset: A random sample of ~9.2 million drug-like molecules from ZINC, with the same percentile breakdowns.

Diversity is measured as the mean Tanimoto distance (using 1024-bit ECFP with radius 2) between all pairs of generated molecules.

Key Results

Task	Model	5-HT1B Score	5-HT1B Diversity
Docking Score	CVAE	-4.647	0.907
Docking Score	GVAE	-4.955	0.901
Docking Score	REINVENT	-9.774	0.506
Docking Score	ZINC (10%)	-9.894	0.862
Docking Score	ZINC (1%)	-10.496	0.861
Docking Score	Train (10%)	-10.837	0.749

On the full docking score task, CVAE and GVAE fail to match even the mean ZINC docking score. REINVENT performs substantially better (e.g., -9.774 on 5-HT1B) but still falls short of the top 10% ZINC scores (-9.894) in most cases. The exception is ACM2, where REINVENT’s score (-9.775) exceeds the ZINC 10% threshold (-8.282).

On the repulsion task, all three models fail to outperform the top 10% ZINC scores. On the hydrogen bonding task (the easiest), GVAE and REINVENT nearly match the top 1% ZINC scores, suggesting that optimizing individual scoring components is more tractable than the full docking score.

A consistent finding across all experiments is that REINVENT generates substantially less diverse molecules than the training set (e.g., 0.506 vs. 0.787 mean Tanimoto distance on 5-HT1B). The t-SNE visualizations show generated molecules clustering in a single dense region, separate from the training data, regardless of optimization target.

The paper also notes a moderately strong correlation between docking scores and molecular weight or the number of rotatable bonds. Generated compounds achieve better docking scores at the same molecular weight after optimization, suggesting the models learn some structural preferences rather than simply exploiting molecular size.

Limitations of Current Generative Models for Drug Design

The main finding is negative: popular generative models for de novo drug design struggle to generate molecules that dock well when trained on realistically sized datasets (1,000 to 10,000 compounds). Even the best-performing model (REINVENT) generally cannot outperform the top 10% of a random ZINC subset on the full docking score task.

The authors acknowledge several limitations:

Docking is itself a proxy: The SMINA docking score is only an approximation of true binding affinity. The fact that even this simpler proxy is challenging should raise concerns about these models’ readiness for real drug discovery pipelines.
Limited model selection: Only three models were tested (CVAE, GVAE, REINVENT). The authors note that CVAE and GVAE were not designed for small training sets, and REINVENT may not represent the state of the art in all respects.
ML-based scoring surrogate: All models use an ML model (MLP or random forest) to predict docking scores during generation, rather than running SMINA directly. This introduces an additional approximation layer.
No similarity constraints: The benchmark does not impose constraints on the distance between generated and training molecules. A trivial baseline is to simply return the training set.

On a more positive note, the tested models perform well on the simplest subtask (hydrogen bonding), suggesting that optimizing docking scores from limited data is attainable but challenging. The benchmark has already been adopted by other groups, notably Nigam et al. (2021) for evaluating their JANUS genetic algorithm.

Future directions include adding similarity constraints, extending to additional protein targets, and using the benchmark to evaluate newer structure-based generative models that employ equivariant neural networks.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Training/Evaluation	ChEMBL (8 targets)	1,082-10,225 molecules per target	90/10 train/test split
Baseline	ZINC 15 subset	~9.2M drug-like molecules	In-stock, standard reactivity, drug-like
Protein structures	Protein Data Bank	8 structures	Cleaned with Schrodinger modeling package

Algorithms

CVAE/GVAE: Fine-tuned 5 epochs on target data, then 50 gradient steps in latent space to optimize MLP-predicted score
REINVENT: Pretrained on ChEMBL, fine-tuned with RL; reward = random forest prediction * QED score
All docking performed with SMINA v. 2017.11.9 using Vinardo scoring function in score_only mode
Scores averaged over top 5 binding poses
Filtering: Lipinski Rule of Five, minimum molecular weight 100

Evaluation

Metric	Description	Notes
Mean docking score	Average over 250 generated molecules	Lower is better for docking score and repulsion
Diversity	Mean Tanimoto distance (ECFP, r=2)	Higher is more diverse
ZINC percentile baselines	Top 50%, 10%, 1% from random ZINC subset	Task considered “solved” if generated score exceeds ZINC 1%

Hardware

Not specified in the paper.

Artifacts

Artifact	Type	License	Notes
smina-docking-benchmark	Code	MIT	Benchmark code, data, evaluation notebooks

Paper Information

Citation: Cieplinski, T., Danel, T., Podlewska, S., & Jastrzebski, S. (2023). Generative Models Should at Least Be Able to Design Molecules That Dock Well: A New Benchmark. Journal of Chemical Information and Modeling, 63(11), 3238-3247. https://doi.org/10.1021/acs.jcim.2c01355

Publication: Journal of Chemical Information and Modeling 2023

Additional Resources:

GitHub Repository

Citation

@article{cieplinski2023generative,
  title={Generative Models Should at Least Be Able to Design Molecules That Dock Well: A New Benchmark},
  author={Cieplinski, Tobiasz and Danel, Tomasz and Podlewska, Sabina and Jastrzebski, Stanislaw},
  journal={Journal of Chemical Information and Modeling},
  volume={63},
  number={11},
  pages={3238--3247},
  year={2023},
  publisher={American Chemical Society},
  doi={10.1021/acs.jcim.2c01355}
}

MolGenSurvey: Systematic Survey of ML for Molecule Design

Mon, 23 Mar 2026 00:00:00 +0000

A Taxonomy for ML-Driven Molecule Design

This is a Systematization paper that reviews machine learning approaches for molecule design across all three major molecular representations (1D string, 2D graph, 3D geometry) and both deep generative and combinatorial optimization paradigms. Prior surveys (including Sánchez-Lengeling & Aspuru-Guzik, 2018, Elton et al., 2019, Xue et al. 2019, Vanhaelen et al. 2020, Alshehri et al. 2020, Jiménez-Luna et al. 2020, and Axelrod et al. 2022) each covered subsets of the literature (e.g., only generative methods, or only specific task types). MolGenSurvey extends these by unifying the field into a single taxonomy based on input type, output type, and generation goal, identifying eight distinct molecule generation tasks. It catalogs over 100 methods across these categories and provides a structured comparison of evaluation metrics, datasets, and experimental setups.

The chemical space of drug-like molecules is estimated at $10^{23}$ to $10^{60}$, making exhaustive enumeration computationally infeasible. Traditional high-throughput screening searches existing databases but is slow and expensive. ML-based generative approaches offer a way to intelligently explore this space, either by learning continuous latent representations (deep generative models) or by directly searching the discrete chemical space (combinatorial optimization methods).

Molecular Representations

The survey identifies three mainstream featurization approaches for molecules, each carrying different tradeoffs for generation tasks.

1D String Descriptions

SMILES and SELFIES are the two dominant string representations. SMILES encodes molecules as character strings following grammar rules for bonds, branches, and ring closures. Its main limitation is that arbitrary strings are often chemically invalid. SELFIES augments the encoding rules for branches and rings to achieve 100% validity by construction.

Other string representations exist (InChI, SMARTS) but are less commonly used for generation. Representation learning over strings has adopted CNNs, RNNs, and Transformers from NLP.

2D Molecular Graphs

Molecules naturally map to graphs where atoms are nodes and bonds are edges. Graph neural networks (GNNs), particularly those following the message-passing neural network (MPNN) framework, have become the standard representation method. The MPNN updates each node’s representation by aggregating information from its $K$-hop neighborhood. Notable architectures include D-MPNN (directional message passing), PNA (diverse aggregation methods), AttentiveFP (attention-based), and Graphormer (transformer-based).

3D Molecular Geometry

Molecules are inherently 3D objects with conformations (3D structures at local energy minima) that determine function. Representing 3D geometry requires models that respect E(3) or SE(3) equivariance (invariance to rotation and translation). The survey catalogs architectures along this line including SchNet, DimeNet, EGNN, SphereNet, and PaiNN.

Additional featurization methods (molecular fingerprints/descriptors, 3D density maps, 3D surface meshes, and chemical images) are noted but have seen limited use in generation tasks.

Deep Generative Models

The survey covers six families of deep generative models applied to molecule design.

Autoregressive Models (ARs)

ARs factorize the joint distribution of a molecule as a product of conditional distributions over its subcomponents:

$$p(\boldsymbol{x}) = \prod_{i=1}^{d} p(\bar{x}_i \mid \bar{x}_1, \bar{x}_2, \ldots, \bar{x}_{i-1})$$

For molecular graphs, this means sequentially predicting the next atom or bond conditioned on the partial structure built so far. RNNs, Transformers, and BERT-style models all implement this paradigm.

Variational Autoencoders (VAEs)

VAEs learn a continuous latent space by maximizing the evidence lower bound (ELBO):

$$\log p(\boldsymbol{x}) \geq \mathbb{E}_{q(\boldsymbol{z}|\boldsymbol{x})}[\log p(\boldsymbol{x}|\boldsymbol{z})] - D_{KL}(q(\boldsymbol{z}|\boldsymbol{x}) | p(\boldsymbol{z}))$$

The first term is the reconstruction objective, and the second is a KL-divergence regularizer encouraging diverse, disentangled latent codes. Key molecular VAEs include ChemVAE (SMILES-based), JT-VAE (junction tree graphs), and GrammarVAE (grammar-constrained SMILES).

Normalizing Flows (NFs)

NFs model $p(\boldsymbol{x})$ via an invertible, deterministic mapping between data and latent space, using the change-of-variable formula with Jacobian determinants. Molecular applications include GraphNVP, MoFlow (one-shot graph generation), GraphAF (autoregressive flow), and GraphDF (discrete flow).

Generative Adversarial Networks (GANs)

GANs use a generator-discriminator game where the generator produces molecules and the discriminator distinguishes real from generated samples. Molecular GANs include MolGAN (graph-based with RL reward), ORGAN (SMILES-based with RL), and Mol-CycleGAN (molecule-to-molecule translation).

Diffusion Models

Diffusion models learn to reverse a gradual noising process. The forward process adds Gaussian noise over $T$ steps; a neural network learns to denoise at each step. The training objective reduces to predicting the noise added at each step:

$$\mathcal{L}_t = \mathbb{E}_{\boldsymbol{x}_0, \boldsymbol{\epsilon}}\left[|\epsilon_t - \epsilon_\theta(\sqrt{\bar{\alpha}_t}\boldsymbol{x}_0 + \sqrt{1 - \bar{\alpha}_t}\epsilon_t, t)|^2\right]$$

Diffusion has been particularly successful for 3D conformation generation (ConfGF, GeoDiff, DGSM).

Energy-Based Models (EBMs)

EBMs define $p(\boldsymbol{x}) = \frac{\exp(-E_\theta(\boldsymbol{x}))}{A}$ where $E_\theta$ is a learned energy function. The challenge is computing the intractable partition function $A$, addressed via contrastive divergence, noise-contrastive estimation, or score matching.

Combinatorial Optimization Methods

Unlike DGMs that learn from data distributions, combinatorial optimization methods (COMs) search directly over discrete chemical space using oracle calls to evaluate candidate molecules.

Reinforcement Learning (RL)

RL formulates molecule generation as a Markov Decision Process: states are partial molecules, actions are adding/removing atoms or bonds, and rewards come from property oracles. Methods include GCPN (graph convolutional policy network), MolDQN (deep Q-network), RationaleRL (property-aware substructure assembly), and REINVENT (SMILES-based policy gradient).

Genetic Algorithms (GA)

GAs maintain a population of molecules and evolve them through mutation and crossover operations. GB-GA operates on molecular graphs, GA+D uses SELFIES with adversarial discriminator enhancement, and JANUS uses SELFIES with parallel exploration strategies.

Bayesian Optimization (BO)

BO builds a Gaussian process surrogate of the objective function and uses an acquisition function to decide which molecules to evaluate next. It is often combined with VAE latent spaces (Constrained-BO-VAE, MSO) to enable continuous optimization.

Monte Carlo Tree Search (MCTS)

MCTS explores the molecular construction tree by branching and evaluating promising intermediates. ChemTS and MP-MCTS combine MCTS with autoregressive SMILES generators.

MCMC Sampling

MCMC methods (MIMOSA, MARS) formulate molecule optimization as sampling from a target distribution defined by multiple property objectives, using graph neural networks as proposal distributions.

Other Approaches

The survey also identifies two additional paradigms that do not fit neatly into either DGM or COM categories. Optimal Transport (OT) is used when matching between groups of molecules, particularly for conformation generation where each molecule has multiple associated 3D structures (e.g., GeoMol, EquiBind). Differentiable Learning formulates discrete molecules as differentiable objects, enabling gradient-based continuous optimization directly on molecular graphs (e.g., DST).

Task Taxonomy: Eight Molecule Generation Tasks

The survey’s central organizational contribution is a unified taxonomy of eight distinct molecule design tasks, defined by three axes: (1) whether generation is de novo (from scratch, no reference molecule) or conditioned on an input molecule, (2) whether the goal is generation (distribution learning, producing valid and diverse molecules) or optimization (goal-directed search for molecules with specific properties), and (3) the input/output data representation (1D string, 2D graph, 3D geometry). The paper’s Table 2 maps all combinations of these axes, showing that many are not meaningful (e.g., 1D string input to 2D graph output with no goal). Only eight combinations correspond to active research areas.

1D/2D Tasks

De novo 1D/2D molecule generation: Generate new molecules from scratch to match a training distribution. Methods span VAEs (ChemVAE, JT-VAE), flows (GraphNVP, MoFlow, GraphAF), GANs (MolGAN, ORGAN), ARs (MolecularRNN), and EBMs (GraphEBM).
De novo 1D/2D molecule optimization: Generate molecules with optimal properties from scratch, using oracle feedback. Methods include RL (GCPN, MolDQN), GA (GB-GA, JANUS), MCTS (ChemTS), and MCMC (MIMOSA, MARS).
1D/2D molecule optimization: Optimize properties of a given input molecule via local search. Methods include graph-to-graph translation (VJTNN, CORE, MOLER), VAE+BO (MSO, Constrained-BO-VAE), GANs (Mol-CycleGAN, LatentGAN), and differentiable approaches (DST).

3D Tasks

De novo 3D molecule generation: Generate novel 3D molecular structures from scratch, respecting geometric validity. Methods include ARs (G-SchNet, G-SphereNet), VAEs (3DMolNet), flows (E-NFs), and RL (MolGym).
De novo 3D conformation generation: Generate 3D conformations from given 2D molecular graphs. Methods include VAEs (CVGAE, ConfVAE), diffusion models (ConfGF, GeoDiff, DGSM), and optimal transport (GeoMol).
De novo binding-based 3D molecule generation: Design 3D molecules for specific protein binding pockets. Methods include density-based VAEs (liGAN), RL (DeepLigBuilder), and ARs (3DSBDD).
De novo binding-pose conformation generation: Find the appropriate 3D conformation of a given molecule for a given protein pocket. Methods include EBMs (DeepDock) and optimal transport (EquiBind).
3D molecule optimization: Optimize 3D molecular properties (scaffold replacement, conformation refinement). Methods include BO (BOA), ARs (3D-Scaffold, cG-SchNet), and VAEs (Coarse-GrainingVAE).

Evaluation Metrics

The survey organizes evaluation metrics into four categories.

Generation Evaluation

Basic metrics assess the quality of generated molecules:

Validity: fraction of chemically valid molecules among all generated molecules
Novelty: fraction of generated molecules absent from the training set
Uniqueness: fraction of distinct molecules among generated samples
Quality: fraction passing a predefined chemical rule filter
Diversity (internal/external): measured via pairwise similarity (Tanimoto, scaffold, or fragment) within generated set and between generated and training sets

Distribution Evaluation

Metrics measuring how well generated molecules capture the training distribution: KL divergence over physicochemical descriptors, Fréchet ChemNet Distance (FCD), and Mean Maximum Discrepancy (MMD).

Optimization Evaluation

Property oracles used as optimization targets: Synthetic Accessibility (SA), Quantitative Estimate of Drug-likeness (QED), LogP, kinase inhibition scores (GSK3-beta, JNK3), DRD2 activity, GuacaMol benchmark oracles, and Vina docking scores. Constrained optimization additionally considers structural similarity to reference molecules via Tanimoto, scaffold, or fragment similarity.

3D Evaluation

3D-specific metrics include stability (matching valence rules in 3D), RMSD and Kabsch-RMSD (conformation alignment), and Coverage/Matching scores for conformation ensembles.

Datasets

The survey catalogs 12 major datasets spanning 1D/2D and 3D molecule generation:

Dataset	Scale	Dimensionality	Purpose
ZINC	250K	1D/2D	Virtual screening compounds
ChEMBL	2.1M	1D/2D	Bioactive molecules
MOSES	1.9M	1D/2D	Benchmarking generation
CEPDB	4.3M	1D/2D	Organic photovoltaics
GDB-13	970M	1D/2D	Enumerated small molecules
QM9	134K	1D/2D/3D	Quantum chemistry properties
GEOM	450K/37M	1D/2D/3D	Conformer ensembles
ISO17	200/431K	1D/2D/3D	Molecule-conformation pairs
Molecule3D	3.9M	1D/2D/3D	DFT ground-state geometries
CrossDock2020	22.5M	1D/2D/3D	Docked ligand poses
scPDB	16K	1D/2D/3D	Binding sites
DUD-E	23K	1D/2D/3D	Active compounds with decoys

Challenges and Opportunities

Challenges

Out-of-distribution generation: Most deep generative models imitate known molecule distributions and struggle to explore truly novel chemical space.
Unrealistic problem formulation: Many task setups do not respect real-world chemistry constraints.
Expensive oracle calls: Methods typically assume unlimited access to property evaluators, which is unrealistic in drug discovery.
Lack of interpretability: Few methods explain why generated molecules have desired properties. Quantitative interpretability evaluation remains an open problem.
No unified evaluation protocols: The field lacks consensus on what defines a “good” drug candidate and how to fairly compare methods.
Insufficient benchmarking: Despite the enormous chemical space ($10^{23}$ to $10^{60}$ drug-like molecules), available benchmarks use only small fractions of large databases.
Low-data regime: Many real-world applications have limited training data, and generating molecules under data scarcity remains difficult.

Opportunities

Extension to complex structured data: Techniques from small molecule generation may transfer to proteins, antibodies, genes, crystal structures, and polysaccharides.
Connection to later drug development phases: Bridging the gap between molecule design and preclinical/clinical trial outcomes could improve real-world impact.
Knowledge discovery: Generative models over molecular latent spaces could reveal chemical rules governing molecular properties, and graph structure learning could uncover implicit non-bonded interactions.

Limitations

The survey was published in March 2022, so it does not cover subsequent advances in diffusion models for molecules (e.g., EDM, DiffSBDD), large language models applied to chemistry, or flow matching approaches.
Coverage focuses on small molecules. Macromolecule design (proteins, nucleic acids) is noted as a future direction rather than surveyed.
The survey catalogs methods but does not provide head-to-head experimental comparisons across all 100+ methods. Empirical discussion relies on individual papers’ reported results.
1D string-based methods receive less detailed coverage than graph and geometry-based approaches, reflecting the field’s shift toward structured representations at the time of writing.
As a survey, this paper produces no code, models, or datasets. The surveyed methods’ individual repositories are referenced in their original publications but are not aggregated here.

Paper Information

Citation: Du, Y., Fu, T., Sun, J., & Liu, S. (2022). MolGenSurvey: A Systematic Survey in Machine Learning Models for Molecule Design. arXiv preprint arXiv:2203.14500.

Publication: arXiv preprint, March 2022. Note: This survey covers literature through early 2022 and does not include subsequent advances in diffusion models, LLMs for chemistry, or flow matching.

Additional Resources:

arXiv: 2203.14500

@article{du2022molgensurvey,
  title={MolGenSurvey: A Systematic Survey in Machine Learning Models for Molecule Design},
  author={Du, Yuanqi and Fu, Tianfan and Sun, Jimeng and Liu, Shengchao},
  journal={arXiv preprint arXiv:2203.14500},
  year={2022}
}

UnCorrupt SMILES: Post Hoc Correction for De Novo Design

Sun, 22 Mar 2026 00:00:00 +0000

A Transformer-Based SMILES Error Corrector

This is a Method paper that proposes a post hoc approach to fixing invalid SMILES produced by de novo molecular generators. Rather than trying to prevent invalid outputs through alternative representations (SELFIES) or constrained architectures (graph models), the authors train a transformer model to translate invalid SMILES into valid ones. The corrector is framed as a sequence-to-sequence translation task, drawing on techniques from grammatical error correction (GEC) in natural language processing.

The Problem of Invalid SMILES in Molecular Generation

SMILES-based generative models produce some percentage of invalid outputs that cannot be converted to molecules. The invalidity rate varies substantially across model types:

RNN models (DrugEx): 5.7% invalid (pretrained) and 4.7% invalid (target-directed)
GANs (ORGANIC): 9.5% invalid
VAEs (GENTRL): 88.9% invalid

These invalid outputs represent wasted computation and potentially introduce bias toward molecules that are easier to generate correctly. Previous approaches to this problem include using alternative representations (DeepSMILES, SELFIES) or graph-based models, but these either limit the search space or increase computational cost. The authors propose a complementary strategy: fix the errors after generation.

Error Taxonomy Across Generator Types

The paper classifies invalid SMILES errors into six categories based on RDKit error messages:

Syntax errors: malformed SMILES grammar
Unclosed rings: unmatched ring closure digits
Parentheses errors: unbalanced open/close parentheses
Bond already exists: duplicate bonds between the same atoms
Aromaticity errors: atoms incorrectly marked as aromatic or kekulization failures
Valence errors: atoms exceeding their maximum bond count

The distribution of error types differs across generators. RNN-based models primarily produce aromaticity errors, suggesting they learn SMILES grammar well but struggle with chemical validity. The GAN (ORGANIC) produces mostly valence errors. The VAE (GENTRL) produces more grammar-level errors (syntax, parentheses, unclosed rings), indicating that sampling from the continuous latent space often produces sequences that violate basic SMILES structure.

Architecture and Training

The SMILES corrector uses a standard encoder-decoder transformer architecture based on Vaswani et al., with learned positional encodings. Key specifications:

Embedding dimension: 256
Encoder/decoder layers: 3 each
Attention heads: 8
Feed-forward dimension: 512
Dropout: 0.1
Optimizer: Adam (learning rate 0.0005)
Training: 20 epochs, batch size 16

Since no dataset of manually corrected invalid-valid SMILES pairs exists, the authors create synthetic training data by introducing errors into valid SMILES from the Papyrus bioactivity dataset (approximately 1.3M pairs). Errors are introduced through random perturbations following SMILES syntax rules: character substitutions, bond order changes, fragment additions from the GDB-8 database to atoms with full valence, and other structural modifications.

Training with Multiple Errors Improves Correction

A key finding is that training the corrector on inputs with multiple errors per SMILES substantially improves performance on real generator outputs. The baseline model (1 error per input) fixes 35-80% of invalid outputs depending on the generator. Increasing errors per training input to 12 raises this to 62-95%:

Generator	1 error/input	12 errors/input
RNN (DrugEx)	~60% fixed	62% fixed
Target-directed RNN	~60% fixed	68% fixed
GAN (ORGANIC)	~80% fixed	95% fixed
VAE (GENTRL)	~35% fixed	80% fixed

Training beyond 12 errors per input yields diminishing returns (80% average at 20 errors vs. 78% at 12). The improvement from multi-error training is consistent with GEC literature, where models learn to “distrust” inputs more when exposed to higher error rates.

The model also shows low overcorrection: only 14% of valid SMILES are altered during translation, comparable to overcorrection rates in spelling correction systems.

Fixed Molecules Are Comparable to Generator Outputs

The corrected molecules are evaluated against both the training set and the readily generated (valid) molecules from each generator:

Uniqueness: 97% of corrected molecules are unique
Novelty vs. generated: 97% of corrected molecules are novel compared to the valid generator outputs
Similarity to nearest neighbor (SNN): 0.45 between fixed and generated sets, indicating the corrected molecules explore different parts of chemical space
Property distributions: KL divergence scores between fixed molecules and the training set are comparable to those between generated molecules and the training set

This demonstrates that SMILES correction produces molecules that are as chemically reasonable as the generator’s valid outputs while exploring complementary regions of chemical space.

Local Chemical Space Exploration via Error Introduction

Beyond fixing generator errors, the authors propose using the SMILES corrector for analog generation. The workflow is:

Take a known active molecule
Introduce random errors into its SMILES (repeated 1000 times)
Correct the errors using the trained corrector

This “local sequence exploration” generates novel analogs with 97% validity. The uniqueness (39%) and novelty (16-37%) are lower than for generator correction because the corrector often regenerates the original molecule. However, the approach produces molecules that are structurally similar to the starting compound (SNN of 0.85 to known ligands).

The authors demonstrate this on selective Aurora kinase B (AURKB) inhibitors. The generated analogs occupy the same binding site region as the co-crystallized ligand VX-680 in docking studies, with predicted bioactivities similar to known compounds. Compared to target-directed RNN generation, SMILES exploration produces molecules closer to known actives (higher SNN, scaffold similarity, and KL divergence scores).

Limitations

The corrector performance drops when applied to real generator outputs compared to synthetic test data, because the synthetic error distribution does not perfectly match the errors that generators actually produce. Generator-specific correctors trained on actual invalid outputs could improve performance. The local exploration approach has limited novelty since the corrector frequently regenerates the original molecule. The evaluation uses predicted rather than experimental bioactivities for the Aurora kinase case study.

Reproducibility

Artifact	Type	License	Notes
LindeSchoenmaker/SMILES-corrector	Code + Data	MIT	Training code, synthetic error generation, and evaluation scripts

Data: Synthetic training pairs derived from the Papyrus bioactivity dataset (v5.5). Approximately 1.3M invalid-valid pairs per error-count setting.

Code: Transformer implemented in PyTorch, adapted from Ben Trevett’s seq2seq tutorial. Generative model baselines use DrugEx, GENTRL, and ORGANIC.

Evaluation: Validity assessed with RDKit. Similarity metrics (SNN, fragment, scaffold) and KL divergence computed following MOSES and GuacaMol benchmark protocols.

Paper Information

Citation: Schoenmaker, L., Béquignon, O. J. M., Jespers, W., & van Westen, G. J. P. (2023). UnCorrupt SMILES: a novel approach to de novo design. Journal of Cheminformatics, 15, 22.

Publication: Journal of Cheminformatics, 2023

Additional Resources:

GitHub: LindeSchoenmaker/SMILES-corrector

@article{schoenmaker2023uncorrupt,
  title={UnCorrupt SMILES: a novel approach to de novo design},
  author={Schoenmaker, Linde and B{\'e}quignon, Olivier J. M. and Jespers, Willem and van Westen, Gerard J. P.},
  journal={Journal of Cheminformatics},
  volume={15},
  number={1},
  pages={22},
  year={2023},
  publisher={Springer},
  doi={10.1186/s13321-023-00696-x}
}

Molecular Sets (MOSES): A Generative Modeling Benchmark

Mon, 16 Feb 2026 00:00:00 +0000

The Role of MOSES: A Benchmarking Resource

This is a Resource and Benchmarking paper. It introduces Molecular Sets (MOSES), a platform designed to standardize the training, comparison, and evaluation of molecular generative models. It provides a standardized dataset, a suite of evaluation metrics, and a collection of baseline models to serve as reference points for the field.

Motivation: The Reproducibility Crisis in Generative Chemistry

Generative models are increasingly popular for drug discovery and material design, capable of exploring the vast chemical space ($10^{23}$ to $10^{80}$ compounds) more efficiently than traditional methods. However, the field faces a significant reproducibility crisis:

Lack of Standardization: There is no consensus on how to properly compare and rank the efficacy of different generative models.
Inconsistent Metrics: Different papers use different metrics or distinct implementations of the same metrics.
Data Variance: Models are often trained on different subsets of chemical databases (like ZINC), making direct comparison impossible.

MOSES aims to solve these issues by providing a unified “measuring stick” for distribution learning models in chemistry.

Core Innovation: Standardizing Chemical Distribution Learning

The core contribution is the standardization of the distribution learning definition for molecular generation. Why focus on distribution learning? Rule-based filters enforce strict boundaries like molecular weight limits. Distribution learning complements this by allowing chemists to apply implicit or soft restrictions. This ensures that generated molecules satisfy hard constraints and reflect complex chemical realities defined by the training distribution. These realities include the prevalence of certain substructures and the avoidance of unstable motifs.

MOSES specifically targets distribution learning by providing:

A Clean, Standardized Dataset: A specific subset of the ZINC Clean Leads collection with rigorous filtering.
Diverse Metrics: A comprehensive suite of metrics that measure validity alongside novelty, diversity (internal and external), chemical properties (properties distribution), and substructure similarity.
Open Source Platform: A Python library molsets that decouples the data and evaluation logic from the model implementation, ensuring everyone measures performance exactly the same way.

Experimental Setup and Baseline Generative Models

The authors benchmarked a wide variety of generative models against the MOSES dataset to establish baselines:

Baselines: Character-level RNN (CharRNN), Variational Autoencoder (VAE), Adversarial Autoencoder (AAE), Junction Tree VAE (JTN-VAE), and LatentGAN.
Non-Neural Baselines: HMM, n-gram models, and a combinatorial generator (randomly connecting fragments).
Evaluation: Models were trained on the standard set and evaluated on:
- Validity/Uniqueness: Can the model generate valid, non-duplicate SMILES? Uniqueness is measured at $k = 1{,}000$ and $k = 10{,}000$ samples.
- Filters: What fraction of generated molecules pass the same medicinal chemistry and PAINS filters used for dataset construction?
- Feature Distribution: Do generated molecules match the physicochemical properties of the training set? Evaluated using the Wasserstein-1 distance on 1D distributions of:
  - LogP: Octanol-water partition coefficient (lipophilicity).
  - SA: Synthetic Accessibility score (ease of synthesis).
  - QED: Quantitative Estimation of Drug-likeness.
  - MW: Molecular Weight.
- Fréchet ChemNet Distance (FCD): Measures similarity in biological/chemical space using the penultimate-layer (second-to-last layer) activations of a pre-trained network (ChemNet).
- Similarity to Nearest Neighbor (SNN): Measures the precision of generation by checking the closest match in the training set (Tanimoto similarity).

Key Findings and Metric Trade-offs

CharRNN Performance: The simple character-level RNN (CharRNN) outperformed more complex models (like VAEs and GANs) on many metrics, achieving the best FCD scores ($0.073$).
Metric Trade-offs: No single metric captures “quality.”
- The Combinatorial Generator achieved 100% validity and high diversity. It struggled with distribution learning metrics (FCD), indicating it explores chemical space broadly without capturing natural distributions.
- VAEs often achieve high Similarity to Nearest Neighbor (SNN) while exhibiting low novelty. The authors suggest this pattern may indicate overfitting to training set prototypes, though they treat this as a hypothesis rather than a proven mechanism.
Implicit Constraints: A major finding was that neural models successfully learned implicit chemical rules (like avoiding PAINS structures) purely from the data distribution.
Recommendation: The authors suggest using FCD/Test for general model ranking, while emphasizing the importance of checking specific metrics (validity, diversity) to diagnose model failure modes.
Limitations of the Benchmark: MOSES focuses on distribution learning and uses FCD as a primary ranking metric. As the authors note, FCD captures multiple aspects of other metrics in a single number but does not give insights into specific issues, so more interpretable metrics are necessary for thorough investigation. The benchmark evaluates only 1D (SMILES) and 2D molecular features, without assessing 3D conformational properties.

Reproducibility Details

Data

The benchmark uses a curated subset of the ZINC Clean Leads collection.

Source Size: ~4.6M molecules (4,591,276 after initial extraction).
Final Size: 1,936,962 molecules.
Splits: Train (1,584,664), Test (176,075), Scaffold Test (176,226).
- Scaffold Test Split: This split is crucial for distinct generalization testing. It contains molecules whose Bemis-Murcko scaffolds are completely absent from the training and test sets. Evaluating on this split strictly tests a model’s ability to generate novel chemical structures (generalization).
Filters Applied:
- Molecular weight: 250 to 350 Da
- Rotatable bonds: $\leq 7$
- XlogP: $\leq 3.5$
- Atom types: C, N, S, O, F, Cl, Br, H
- No charged atoms or cycles > 8 atoms
- Medicinal Chemistry Filters (MCF) and PAINS filters applied.

Evaluation Metrics

MOSES introduces a standard suite of metrics. Key definitions:

Validity: Fraction of valid SMILES strings (via RDKit).
Unique@k: Fraction of unique molecules in the first $k$ valid samples ($k = 1{,}000$ and $k = 10{,}000$).
Filters: Fraction of generated molecules passing the MCF and PAINS filters used during dataset construction. High scores here indicate the model learned implicit chemical validity constraints from the data distribution.
Novelty: Fraction of generated molecules not present in the training set.
Internal Diversity (IntDiv): Average Tanimoto distance between generated molecules ($G$), useful for detecting mode collapse: $$ \text{IntDiv}_p(G) = 1 - \sqrt[p]{\frac{1}{|G|^2} \sum_{m_1, m_2 \in G} T(m_1, m_2)^p} $$
Fragment Similarity (Frag): Cosine similarity of fragment frequency vectors (BRICS decomposition) between generated and test sets.
Scaffold Similarity (Scaff): Cosine similarity of Bemis-Murcko scaffold frequency vectors between sets. Measures how well the model captures higher-level structural motifs.
Similarity to Nearest Neighbor (SNN): The average Tanimoto similarity between a generated molecule’s fingerprint and its nearest neighbor in the reference set. This serves as a measure of precision; high SNN suggests the model produces molecules very similar to the training distribution, potentially indicating memorization if novelty is low. $$ \text{SNN}(G, R) = \frac{1}{|G|} \sum_{m_G \in G} \max_{m_R \in R} T(m_G, m_R) $$
Fréchet ChemNet Distance (FCD): Fréchet distance between the Gaussian approximations (mean and covariance) of penultimate-layer activations from ChemNet. This measures how close the distribution of generated molecules is to the real distribution in chemical/biological space. The authors note that FCD correlates with other metrics. For example, if the generated structures are not diverse enough or the model produces too many duplicates, FCD will decrease because the variance is smaller. The authors suggest using FCD for hyperparameter tuning and final model selection. $$ \text{FCD}(G, R) = |\mu_G - \mu_R|^2 + \text{Tr}(\Sigma_G + \Sigma_R - 2(\Sigma_G \Sigma_R)^{1/2}) $$
Properties Distribution (Wasserstein-1): The 1D Wasserstein-1 distance between the distributions of molecular properties (MW, LogP, SA, QED) in the generated and test sets.

Models & Baselines

The paper selects baselines to represent different theoretical approaches to distribution learning:

Explicit Density Models: Models where the probability mass function $P(x)$ can be computed analytically.
- N-gram: Simple statistical models. They failed to generate valid molecules reliably due to limited long-range dependency modeling.
Implicit Density Models: Models that sample from the distribution without explicitly computing $P(x)$.
- VAE/AAE: Optimizes a lower bound on the log-likelihood (ELBO) or uses adversarial training.
- GANs (LatentGAN): Directly minimizes the distance between real and generated distributions via a discriminator.

Models are also distinguished by their data representation:

String-based (SMILES): Models like CharRNN, VAE, and AAE treat molecules as SMILES strings. SMILES encodes a molecular graph by traversing a spanning tree in depth-first order, storing atom and edge tokens.
Graph-based: JTN-VAE operates directly on molecular subgraphs (junction tree), ensuring chemical validity by construction but often requiring more complex training.

Key baselines implemented in PyTorch (hyperparameters are detailed in Supplementary Information 3 of the original paper):

CharRNN: LSTM-based sequence model (3 layers, 768 hidden units). Trained with Adam ($lr = 10^{-3}$, batch size 64, 80 epochs, learning rate halved every 10 epochs).
VAE: Encoder-decoder architectures (bidirectional GRU encoder, 3-layer GRU decoder with 512 hidden units) with KL regularization.
AAE: Encoder (single layer bidirectional LSTM with 512 units) and decoder (2-layer LSTM with 512 units) initialized with adversarial formulation.
LatentGAN: GAN (5-layer fully connected generator) trained on the latent space of a pre-trained heteroencoder.
JTN-VAE: Tree-structured graph generation.

Code & Hardware Requirements

Code Repository: Available at github.com/molecularsets/moses as well as the PyPI library molsets. The platform provides standard scripts (scripts/run.py to evaluate models end-to-end, and scripts/run_all_models.sh for multi-seed evaluations).
Hardware: The repository supports GPU acceleration via nvidia-docker (defaulting to 10GB shared memory). However, specific training times and exact GPU models used by the authors for the baselines are not formally documented in the source text.
Model Weights: Pre-trained model checkpoints are not natively pre-packaged as standalone downloads; practitioners are expected to re-train the default baselines using the provided scripts.

Artifacts

Artifact	Type	License	Notes
molecularsets/moses	Code	MIT	Official benchmark platform with baseline models and evaluation metrics
molsets (PyPI)	Code	MIT	pip-installable package for dataset access and metric computation
ZINC Clean Leads subset	Dataset	See ZINC terms	Curated dataset of 1,936,962 molecules distributed via the repository

Paper Information

Citation: Polykovskiy, D., Zhebrak, A., Sanchez-Lengeling, B., Golovanov, S., Tatanov, O., Belyaev, S., Kurbanov, R., Artamonov, A., Aladinskiy, V., Veselov, M., Kadurin, A., Johansson, S., Chen, H., Nikolenko, S., Aspuru-Guzik, A., and Zhavoronkov, A. (2020). Molecular Sets (MOSES): A Benchmarking Platform for Molecular Generation Models. Frontiers in Pharmacology, 11, 565644. https://doi.org/10.3389/fphar.2020.565644

Publication: Frontiers in Pharmacology, 2020

@article{polykovskiy2020moses,
  title={Molecular Sets (MOSES): A benchmarking platform for molecular generation models},
  author={Polykovskiy, Daniil and Zhebrak, Alexander and Sanchez-Lengeling, Benjamin and Golovanov, Sergey and Tatanov, Oktai and Belyaev, Stanislav and Kurbanov, Rauf and Artamonov, Aleksey and Aladinskiy, Vladimir and Veselov, Mark and Kadurin, Artur and Johansson, Simon and Chen, Hongming and Nikolenko, Sergey and Aspuru-Guzik, Al{\'a}n and Zhavoronkov, Alex},
  journal={Frontiers in Pharmacology},
  volume={11},
  pages={565644},
  year={2020},
  publisher={Frontiers},
  doi={10.3389/fphar.2020.565644}
}