Benchmark

GuacaMol: Benchmarking Models for De Novo Molecular Design

GuacaMol provides an open-source benchmarking framework with 5 distribution-learning and 20 goal-directed tasks to standardize evaluation of de novo molecular design models.

Predictive Chemistry

Overview of MoleculeNet dataset categories and task counts across quantum mechanics, physical chemistry, biophysics, and physiology

MoleculeNet: Benchmarking Molecular Machine Learning

MoleculeNet introduces a large-scale benchmark suite for molecular machine learning, curating over 700,000 compounds across 17 datasets with standardized metrics, data splits, and featurization methods integrated into the DeepChem open-source library.

Molecular Generation

Bar chart comparing molecular generative model performance across six evaluation dimensions including validity, safety, and hit rates

MolGenBench: Benchmarking Molecular Generative Models

MolGenBench introduces a comprehensive benchmark for evaluating molecular generative models in realistic drug discovery settings, spanning de novo design and hit-to-lead optimization across 120 protein targets with 220,005 experimentally validated actives.

Molecular Generation

Diagram showing MolScore framework components: scoring functions, evaluation metrics, and benchmark modes

MolScore: Scoring and Benchmarking for Drug Design

MolScore is an open-source framework that unifies scoring functions, evaluation metrics, and benchmarks for generative molecular design, with configurable objectives and GUI support.

Molecular Generation

Sample efficiency curves showing different molecular optimization algorithm families converging at different rates under a fixed oracle budget

PMO: Benchmarking Sample-Efficient Molecular Design

A large-scale benchmark of 25 molecular optimization methods on 23 oracles under constrained oracle budgets, showing that sample efficiency is a critical and often neglected dimension of evaluation.

Molecular Generation

Spectral performance curve showing model accuracy declining as train-test overlap decreases

SPECTRA: Evaluating Generalizability of Molecular AI

Introduces SPECTRA, a framework that generates spectral performance curves to measure how ML model accuracy degrades as train-test overlap decreases across molecular sequencing tasks.

Predictive Chemistry

QSPR surface roughness comparison across molecular representations, showing smooth fingerprint surfaces versus rougher pretrained model surfaces

ROGI-XD: Roughness of Pretrained Molecular Representations

This paper introduces ROGI-XD, a reformulation of the ROuGhness Index that enables fair comparison of QSPR surface roughness across molecular representations of different dimensionalities. Evaluating VAE, GIN, ChemBERTa, and ChemGPT representations, the authors show that pretrained chemical models do not produce smoother structure-property landscapes than simple molecular fingerprints or descriptors.

Molecular Generation

Diagram showing a genetic algorithm for molecules where a parent albuterol molecule undergoes mutation to produce two child molecules, with a selection and repeat loop

Genetic Algorithms as Baselines for Molecule Generation

This position paper demonstrates that genetic algorithms (GAs) perform surprisingly well on molecular generation benchmarks, often outperforming complex deep learning methods. The authors propose the GA criterion: new molecule generation algorithms should demonstrate a clear advantage over GAs.

Molecular Generation

Bar chart comparing SMINA docking scores of CVAE, GVAE, and REINVENT against a random ZINC 10% baseline across eight protein targets

SMINA Docking Benchmark for De Novo Drug Design Models

Proposes a benchmark for de novo drug design using SMINA docking scores across eight drug targets, revealing that popular generative models fail to outperform random ZINC subsets.

Molecular Generation

2D structure of a phenyl-quaterthiophene, a conjugated organic molecule representative of the photovoltaic donor materials benchmarked in the Tartarus platform

Tartarus: Realistic Inverse Molecular Design Benchmarks

Tartarus introduces a modular suite of realistic molecular design benchmarks grounded in computational chemistry simulations. Benchmarking eight generative models reveals that no single algorithm dominates all tasks, and simple genetic algorithms often outperform deep generative models.

Predictive Chemistry

Diagram of the tied two-way transformer architecture with shared encoder, retro and forward decoders, latent variables, and cycle consistency, alongside USPTO-50K accuracy and validity results

Tied Two-Way Transformers for Diverse Retrosynthesis

This paper couples a retrosynthesis transformer with a forward reaction transformer through parameter sharing, cycle consistency checks, and multinomial latent variables. The combined approach reduces top-1 SMILES invalidity to 0.1% on USPTO-50K, improves top-10 accuracy to 78.5%, and achieves 87.3% pathway coverage on a multi-pathway in-house dataset.

Molecular Representations

BARTSmiles ablation study summary showing impact of pre-training strategies on downstream task performance

BARTSmiles: BART Pre-Training for Molecular SMILES

BARTSmiles pre-trains a BART-large model on 1.7 billion SMILES strings from ZINC20 and achieves the best reported results on 11 classification, regression, and generation benchmarks.