Benchmark

Bar chart showing retrieval accuracy of chemical language models across four SMILES augmentation types

AMORE: Testing ChemLLM Robustness to SMILES Variants

Introduces AMORE, an embedding-based retrieval framework that evaluates whether chemical language models can recognize the same molecule across different SMILES representations. Results show current models are not robust to identity-preserving augmentations.

Computational Chemistry

Heatmap showing LLM accuracy across nine chemistry coding task categories for four models, with green indicating high accuracy and red indicating low accuracy

Benchmarking Chemistry Knowledge in Code-Gen LLMs

A benchmark of 84 chemistry coding tasks evaluating code-generating LLMs like Codex, showing 72% accuracy with prompt engineering strategies that improve performance by 30 percentage points.

Computational Chemistry

Bar chart comparing LLM, DeBERTa, GCN, and GIN performance on three OGB molecular classification benchmarks

Benchmarking LLMs for Molecular Property Prediction

Benchmarks large language models on six molecular property prediction datasets, finding that LLMs lag behind GNNs but can augment ML models when used collaboratively.

Predictive Chemistry

Bar chart comparing fixed molecular representations (RF, SVM, XGBoost) against learned representations (MolBERT, GROVER) across six property prediction benchmarks under scaffold split

Benchmarking Molecular Property Prediction at Scale

This study trains over 62,000 models to systematically evaluate molecular representations and models for property prediction, finding that traditional ML on fixed descriptors often outperforms deep learning approaches.

Computational Chemistry

Radar chart comparing LLM and human chemist performance across chemistry topics in ChemBench

ChemBench: Evaluating LLM Chemistry Against Experts

ChemBench introduces an automated benchmark of 2,700+ chemistry questions to evaluate LLMs against human expert chemists, revealing that frontier models outperform domain experts on average while struggling with basic tasks and confidence calibration.

Computational Chemistry

Hierarchical pyramid showing ChemEval's four evaluation levels from basic knowledge QA to scientific knowledge deduction

ChemEval: Fine-Grained LLM Evaluation for Chemistry

ChemEval is a four-level, 62-task benchmark for evaluating LLMs across chemical knowledge, literature understanding, molecular reasoning, and scientific deduction, revealing that general LLMs excel at comprehension while chemistry-specific models perform better on domain tasks.

Computational Chemistry

Bar chart comparing LLM safety and quality scores across chemistry benchmark tasks

ChemSafetyBench: Benchmarking LLM Safety in Chemistry

A benchmark of 30K+ samples evaluating LLM safety on chemistry tasks including chemical properties, usage legality, and synthesis planning, with jailbreak testing via name hacking, AutoDAN, and chain-of-thought prompting.

Molecular Generation

Stylized visualization of protein-ligand docking and benchmark performance bars across five drug targets

DOCKSTRING: Docking-Based Benchmarks for Drug Design

DOCKSTRING bundles an AutoDock Vina wrapper, a 260K-molecule docking dataset across 58 protein targets, and pharmaceutically relevant benchmarks for regression, virtual screening, and de novo design.

Molecular Generation

Diagram showing divergence between optimization score and control scores during molecular optimization

Failure Modes in Molecule Generation & Optimization

Identifies failure modes in molecular generative models, showing that trivial edits fool distribution-learning benchmarks and that ML-based scoring functions introduce exploitable model-specific and data-specific biases during goal-directed optimization.

Molecular Generation

Two Gaussian distributions in ChemNet activation space with the Frechet distance shown between them

Frechet ChemNet Distance for Molecular Generation

Introduces the Frechet ChemNet Distance (FCD), a single metric that captures chemical validity, biological relevance, and diversity of generated molecules by comparing distributions of learned ChemNet representations.

Molecular Generation

Grid of six GuacaMol benchmark target molecules: Celecoxib, Troglitazone, Thiothixene, Aripiprazole, Osimertinib, and Sitagliptin

GuacaMol: Benchmarking Models for De Novo Molecular Design

GuacaMol provides an open-source benchmarking framework with 5 distribution-learning and 20 goal-directed tasks to standardize evaluation of de novo molecular design models.

Predictive Chemistry

Overview of MoleculeNet dataset categories and task counts across quantum mechanics, physical chemistry, biophysics, and physiology

MoleculeNet: Benchmarking Molecular Machine Learning

MoleculeNet introduces a large-scale benchmark suite for molecular machine learning, curating over 700,000 compounds across 17 datasets with standardized metrics, data splits, and featurization methods integrated into the DeepChem open-source library.