Computational Chemistry
Radar chart comparing LLM and human chemist performance across chemistry topics in ChemBench

ChemBench: Evaluating LLM Chemistry Against Experts

ChemBench introduces an automated benchmark of 2,700+ chemistry questions to evaluate LLMs against human expert chemists, revealing that frontier models outperform domain experts on average while struggling with basic tasks and confidence calibration.

Computational Chemistry
Hierarchical pyramid showing ChemEval's four evaluation levels from basic knowledge QA to scientific knowledge deduction

ChemEval: Fine-Grained LLM Evaluation for Chemistry

ChemEval is a four-level, 62-task benchmark for evaluating LLMs across chemical knowledge, literature understanding, molecular reasoning, and scientific deduction, revealing that general LLMs excel at comprehension while chemistry-specific models perform better on domain tasks.

Computational Chemistry
Bar chart comparing LLM safety and quality scores across chemistry benchmark tasks

ChemSafetyBench: Benchmarking LLM Safety in Chemistry

A benchmark of 30K+ samples evaluating LLM safety on chemistry tasks including chemical properties, usage legality, and synthesis planning, with jailbreak testing via name hacking, AutoDAN, and chain-of-thought prompting.

Computational Chemistry
Stylized visualization of protein-ligand docking and benchmark performance bars across five drug targets

DOCKSTRING: Docking-Based Benchmarks for Drug Design

DOCKSTRING bundles an AutoDock Vina wrapper, a 260K-molecule docking dataset across 58 protein targets, and pharmaceutically relevant benchmarks for regression, virtual screening, and de novo design.

Computational Chemistry
Diagram showing divergence between optimization score and control scores during molecular optimization

Failure Modes in Molecule Generation & Optimization

Identifies failure modes in molecular generative models, showing that trivial edits fool distribution-learning benchmarks and that ML-based scoring functions introduce exploitable model-specific and data-specific biases during goal-directed optimization.

Computational Chemistry
Two Gaussian distributions in ChemNet activation space with the Frechet distance shown between them

Frechet ChemNet Distance for Molecular Generation

Introduces the Frechet ChemNet Distance (FCD), a single metric that captures chemical validity, biological relevance, and diversity of generated molecules by comparing distributions of learned ChemNet representations.

Computational Chemistry
Grid of six GuacaMol benchmark target molecules: Celecoxib, Troglitazone, Thiothixene, Aripiprazole, Osimertinib, and Sitagliptin

GuacaMol: Benchmarking Models for De Novo Molecular Design

GuacaMol provides an open-source benchmarking framework with 5 distribution-learning and 20 goal-directed tasks to standardize evaluation of de novo molecular design models.

Computational Chemistry
Overview of MoleculeNet dataset categories and task counts across quantum mechanics, physical chemistry, biophysics, and physiology

MoleculeNet: Benchmarking Molecular Machine Learning

MoleculeNet introduces a large-scale benchmark suite for molecular machine learning, curating over 700,000 compounds across 17 datasets with standardized metrics, data splits, and featurization methods integrated into the DeepChem open-source library.

Computational Chemistry
Bar chart comparing molecular generative model performance across six evaluation dimensions including validity, safety, and hit rates

MolGenBench: Benchmarking Molecular Generative Models

MolGenBench introduces a comprehensive benchmark for evaluating molecular generative models in realistic drug discovery settings, spanning de novo design and hit-to-lead optimization across 120 protein targets with 220,005 experimentally validated actives.

Computational Chemistry
Diagram showing MolScore framework components: scoring functions, evaluation metrics, and benchmark modes

MolScore: Scoring and Benchmarking for Drug Design

MolScore is an open-source framework that unifies scoring functions, evaluation metrics, and benchmarks for generative molecular design, with configurable objectives and GUI support.

Computational Chemistry
Sample efficiency curves showing different molecular optimization algorithm families converging at different rates under a fixed oracle budget

PMO: Benchmarking Sample-Efficient Molecular Design

A large-scale benchmark of 25 molecular optimization methods on 23 oracles under constrained oracle budgets, showing that sample efficiency is a critical and often neglected dimension of evaluation.

Computational Chemistry
Spectral performance curve showing model accuracy declining as train-test overlap decreases

SPECTRA: Evaluating Generalizability of Molecular AI

Introduces SPECTRA, a framework that generates spectral performance curves to measure how ML model accuracy degrades as train-test overlap decreases across molecular sequencing tasks.