Computational Chemistry
Radar chart comparing LLM and human chemist performance across chemistry topics in ChemBench

ChemBench: Evaluating LLM Chemistry Against Experts

ChemBench introduces an automated benchmark of 2,700+ chemistry questions to evaluate LLMs against human expert chemists, revealing that frontier models outperform domain experts on average while struggling with basic tasks and confidence calibration.

Computational Chemistry
Hierarchical pyramid showing ChemEval's four evaluation levels from basic knowledge QA to scientific knowledge deduction

ChemEval: Fine-Grained LLM Evaluation for Chemistry

ChemEval is a four-level, 62-task benchmark for evaluating LLMs across chemical knowledge, literature understanding, molecular reasoning, and scientific deduction, revealing that general LLMs excel at comprehension while chemistry-specific models perform better on domain tasks.

Computational Chemistry
Bar chart comparing LLM safety and quality scores across chemistry benchmark tasks

ChemSafetyBench: Benchmarking LLM Safety in Chemistry

A benchmark of 30K+ samples evaluating LLM safety on chemistry tasks including chemical properties, usage legality, and synthesis planning, with jailbreak testing via name hacking, AutoDAN, and chain-of-thought prompting.

Computational Chemistry
Stylized visualization of protein-ligand docking and benchmark performance bars across five drug targets

DOCKSTRING: Docking-Based Benchmarks for Drug Design

DOCKSTRING bundles an AutoDock Vina wrapper, a 260K-molecule docking dataset across 58 protein targets, and pharmaceutically relevant benchmarks for regression, virtual screening, and de novo design.

Computational Chemistry
Diagram showing divergence between optimization score and control scores during molecular optimization

Failure Modes in Molecule Generation & Optimization

Identifies failure modes in molecular generative models, showing that trivial edits fool distribution-learning benchmarks and that ML-based scoring functions introduce exploitable model-specific and data-specific biases during goal-directed optimization.

Computational Chemistry
Two Gaussian distributions in ChemNet activation space with the Frechet distance shown between them

Frechet ChemNet Distance for Molecular Generation

Introduces the Frechet ChemNet Distance (FCD), a single metric that captures chemical validity, biological relevance, and diversity of generated molecules by comparing distributions of learned ChemNet representations.

Computational Chemistry
Comparison bar chart showing penalized logP scores for GB-GA, GB-GM-MCTS, and ML-based molecular optimization methods

Graph-Based GA and MCTS Generative Model for Molecules

A graph-based genetic algorithm (GB-GA) and a graph-based generative model with Monte Carlo tree search (GB-GM-MCTS) for molecular optimization that match or outperform ML-based generative approaches while being orders of magnitude faster.

Computational Chemistry
Grid of six GuacaMol benchmark target molecules: Celecoxib, Troglitazone, Thiothixene, Aripiprazole, Osimertinib, and Sitagliptin

GuacaMol: Benchmarking Models for De Novo Molecular Design

GuacaMol provides an open-source benchmarking framework with 5 distribution-learning and 20 goal-directed tasks to standardize evaluation of de novo molecular design models.

Computational Chemistry
Overview of MoleculeNet dataset categories and task counts across quantum mechanics, physical chemistry, biophysics, and physiology

MoleculeNet: Benchmarking Molecular Machine Learning

MoleculeNet introduces a large-scale benchmark suite for molecular machine learning, curating over 700,000 compounds across 17 datasets with standardized metrics, data splits, and featurization methods integrated into the DeepChem open-source library.

Computational Chemistry
Bar chart comparing molecular generative model performance across six evaluation dimensions including validity, safety, and hit rates

MolGenBench: Benchmarking Molecular Generative Models

MolGenBench introduces a comprehensive benchmark for evaluating molecular generative models in realistic drug discovery settings, spanning de novo design and hit-to-lead optimization across 120 protein targets with 220,005 experimentally validated actives.

Computational Chemistry
Diagram showing MolScore framework components: scoring functions, evaluation metrics, and benchmark modes

MolScore: Scoring and Benchmarking for Drug Design

MolScore is an open-source framework that unifies scoring functions, evaluation metrics, and benchmarks for generative molecular design, with configurable objectives and GUI support.

Computational Chemistry
Scatter plot showing molecules ranked by perplexity score with color coding for task-relevant (positive delta) versus pretraining-biased (negative delta) generations

Perplexity for Molecule Ranking and CLM Bias Detection

This study applies perplexity, a model-intrinsic metric from NLP, to rank de novo molecular designs generated by SMILES-based chemical language models and introduces a delta score to detect pretraining bias in transfer-learned CLMs.