Evaluation, Benchmarks & Surveys

Benchmark suites, scoring frameworks, evaluation studies, and surveys of the molecular generation field.

Benchmark Suites & Scoring

Paper	Year	Key Idea
GuacaMol	2019	Distribution-learning and goal-directed generation benchmarks
MOSES	2020	Distribution-learning benchmark with curated ZINC subset and distributional metrics
FCD	2018	Adapts FID from image generation to molecules using learned chemical embeddings
PMO	2022	Sample-efficient molecular optimization comparing 25 methods under fixed oracle budget
MolScore	2024	Unified scoring framework wrapping objectives from GuacaMol, MOSES, and others
Tartarus	2023	Realistic inverse design benchmarks using physics-based oracles (DFT, xTB)
SPECTRA	2025	Out-of-domain generalizability evaluation via spectral analysis
MolGenBench	2025	Evaluation across distribution learning, property optimization, and constrained optimization

Docking Benchmarks

Paper	Year	Key Idea
DOCKSTRING	2022	Docking-based benchmarks for ligand design with precomputed scores
SMINA Benchmark	2023	SMINA docking evaluation on realistic binding tasks

Failure Analysis & Tools

Paper	Year	Key Idea
Failure Modes	2019	Trivial models fool distribution-learning metrics; ML scoring functions have exploitable biases
Sample Efficiency	2022	Property filters and diversity metrics substantially re-rank model performance
Avoiding Failure Modes	2022	Apparent failures stem from QSAR model disagreement, not algorithmic exploitation
UnCorrupt SMILES	2023	Transformer-based corrector recovers 60-95% of invalid generator outputs

Surveys & Reviews

Paper	Year	Key Idea
Deep Learning for Molecular Design	2019	Survey of RNNs, VAEs, GANs, and RL approaches with SMILES and graph representations
CLMs for De Novo Drug Design	2023	Review of chemical language models covering architectures and training strategies
Inverse Molecular Design	2022	Review of VAE, GAN, and RL approaches for navigating chemical space
RNNs vs Transformers	2023	Empirical comparison of RNN and Transformer architectures for molecular generation
MolGenSurvey	2022	Survey across 1D string, 2D graph, and 3D geometry representations
Generative AI Drug Design	2024	Comprehensive survey covering VAEs, GANs, diffusion, and flow models

Computational Chemistry

Two-panel plot showing score divergence with disagreeing classifiers vs convergence with agreeing classifiers

Avoiding Failure Modes in Goal-Directed Generation

Shows that divergence between optimization and control scores during goal-directed molecular generation is explained by pre-existing disagreement among QSAR models on the training distribution, not by algorithmic exploitation of model-specific biases.

Computational Chemistry

Grouped bar chart showing CLM architectures (RNN, VAE, GAN, Transformer) across generation strategies

Chemical Language Models for De Novo Drug Design Review

A minireview of chemical language models for de novo molecule design, covering SMILES and SELFIES representations, RNN and Transformer architectures, distribution learning, goal-directed and conditional generation, and prospective experimental validation.

Computational Chemistry

Taxonomy diagram showing four generative model families (VAE, GAN, Diffusion, Flow) connecting to small molecule generation and protein generation subtasks

Generative AI Survey for De Novo Molecule and Protein Design

This survey organizes generative AI for de novo drug design into two themes: small molecule generation (target-agnostic, target-aware, conformation) and protein generation (structure prediction, sequence generation, backbone design, antibody, peptide). It covers four generative model families (VAEs, GANs, diffusion, flow-based), catalogs key datasets and benchmarks, and provides 12 comparative benchmark tables across all subtasks.

Computational Chemistry

Schematic of inverse molecular design paradigm mapping desired properties to molecular structures through generative models

Inverse Molecular Design with ML Generative Models

A foundational review surveying how deep generative models (VAEs, GANs, reinforcement learning) enable inverse molecular design, covering molecular representations, chemical space navigation, and applications from drug discovery to materials engineering.

Computational Chemistry

Bar chart comparing PMO benchmark scores with and without chemical quality filters across five generative methods

Re-evaluating Sample Efficiency in Molecule Generation

A critical reassessment of the PMO benchmark for de novo molecule generation, showing that adding molecular weight, LogP, and diversity filters substantially re-ranks generative models, with Augmented Hill-Climb emerging as the top method.

Computational Chemistry

Bar chart showing deep generative architecture types for molecular design: RNN, VAE, GAN, RL, and hybrid methods

Review: Deep Learning for Molecular Design (2019)

An early and influential review cataloging 45 papers on deep generative modeling for molecules, comparing RNN, VAE, GAN, and reinforcement learning architectures across SMILES and graph-based representations.

Computational Chemistry

Bar chart comparing RNN and Transformer Wasserstein distances across drug-like, peptide-like, and polymer-like generation tasks

RNNs vs Transformers for Molecular Generation Tasks

Compares RNN-based and Transformer-based chemical language models across three molecular generation tasks of increasing complexity, finding that RNNs excel at local features while Transformers handle large molecules better.

Computational Chemistry

Stylized visualization of protein-ligand docking and benchmark performance bars across five drug targets

DOCKSTRING: Docking-Based Benchmarks for Drug Design

DOCKSTRING bundles an AutoDock Vina wrapper, a 260K-molecule docking dataset across 58 protein targets, and pharmaceutically relevant benchmarks for regression, virtual screening, and de novo design.

Computational Chemistry

Diagram showing divergence between optimization score and control scores during molecular optimization

Failure Modes in Molecule Generation & Optimization

Identifies failure modes in molecular generative models, showing that trivial edits fool distribution-learning benchmarks and that ML-based scoring functions introduce exploitable model-specific and data-specific biases during goal-directed optimization.

Computational Chemistry

Two Gaussian distributions in ChemNet activation space with the Frechet distance shown between them

Frechet ChemNet Distance for Molecular Generation

Introduces the Frechet ChemNet Distance (FCD), a single metric that captures chemical validity, biological relevance, and diversity of generated molecules by comparing distributions of learned ChemNet representations.

Computational Chemistry

Grid of six GuacaMol benchmark target molecules: Celecoxib, Troglitazone, Thiothixene, Aripiprazole, Osimertinib, and Sitagliptin

GuacaMol: Benchmarking Models for De Novo Molecular Design

GuacaMol provides an open-source benchmarking framework with 5 distribution-learning and 20 goal-directed tasks to standardize evaluation of de novo molecular design models.

Computational Chemistry

Bar chart comparing molecular generative model performance across six evaluation dimensions including validity, safety, and hit rates

MolGenBench: Benchmarking Molecular Generative Models

MolGenBench introduces a comprehensive benchmark for evaluating molecular generative models in realistic drug discovery settings, spanning de novo design and hit-to-lead optimization across 120 protein targets with 220,005 experimentally validated actives.

Benchmark Suites & Scoring#

Docking Benchmarks#

Failure Analysis & Tools#

Surveys & Reviews#

Benchmark Suites & Scoring

Docking Benchmarks

Failure Analysis & Tools

Surveys & Reviews