Computational Chemistry
Diagram showing back translation workflow with forward and reverse models mapping between source and target molecular domains, augmented by unlabeled ZINC molecules

Back Translation for Semi-Supervised Molecule Generation

Adapts back translation from NLP to molecular generation, using unlabeled molecules from ZINC to create synthetic training pairs that improve property optimization and retrosynthesis prediction across Transformer and graph-based architectures.

Computational Chemistry
Heatmap showing LLM accuracy across nine chemistry coding task categories for four models, with green indicating high accuracy and red indicating low accuracy

Benchmarking Chemistry Knowledge in Code-Gen LLMs

A benchmark of 84 chemistry coding tasks evaluating code-generating LLMs like Codex, showing 72% accuracy with prompt engineering strategies that improve performance by 30 percentage points.

Computational Chemistry
Bar chart comparing LLM, DeBERTa, GCN, and GIN performance on three OGB molecular classification benchmarks

Benchmarking LLMs for Molecular Property Prediction

Benchmarks large language models on six molecular property prediction datasets, finding that LLMs lag behind GNNs but can augment ML models when used collaboratively.

Computational Chemistry
Bar chart comparing fixed molecular representations (RF, SVM, XGBoost) against learned representations (MolBERT, GROVER) across six property prediction benchmarks under scaffold split

Benchmarking Molecular Property Prediction at Scale

This study trains over 62,000 models to systematically evaluate molecular representations and models for property prediction, finding that traditional ML on fixed descriptors often outperforms deep learning approaches.

Computational Chemistry
Radar chart comparing LLM and human chemist performance across chemistry topics in ChemBench

ChemBench: Evaluating LLM Chemistry Against Experts

ChemBench introduces an automated benchmark of 2,700+ chemistry questions to evaluate LLMs against human expert chemists, revealing that frontier models outperform domain experts on average while struggling with basic tasks and confidence calibration.

Computational Chemistry
Hierarchical pyramid showing ChemEval's four evaluation levels from basic knowledge QA to scientific knowledge deduction

ChemEval: Fine-Grained LLM Evaluation for Chemistry

ChemEval is a four-level, 62-task benchmark for evaluating LLMs across chemical knowledge, literature understanding, molecular reasoning, and scientific deduction, revealing that general LLMs excel at comprehension while chemistry-specific models perform better on domain tasks.

Computational Chemistry
Bar chart comparing LLM safety and quality scores across chemistry benchmark tasks

ChemSafetyBench: Benchmarking LLM Safety in Chemistry

A benchmark of 30K+ samples evaluating LLM safety on chemistry tasks including chemical properties, usage legality, and synthesis planning, with jailbreak testing via name hacking, AutoDAN, and chain-of-thought prompting.

Computational Chemistry
Stylized visualization of protein-ligand docking and benchmark performance bars across five drug targets

DOCKSTRING: Docking-Based Benchmarks for Drug Design

DOCKSTRING bundles an AutoDock Vina wrapper, a 260K-molecule docking dataset across 58 protein targets, and pharmaceutically relevant benchmarks for regression, virtual screening, and de novo design.

Computational Chemistry
Diagram showing divergence between optimization score and control scores during molecular optimization

Failure Modes in Molecule Generation & Optimization

Identifies failure modes in molecular generative models, showing that trivial edits fool distribution-learning benchmarks and that ML-based scoring functions introduce exploitable model-specific and data-specific biases during goal-directed optimization.

Computational Chemistry
Two Gaussian distributions in ChemNet activation space with the Frechet distance shown between them

Frechet ChemNet Distance for Molecular Generation

Introduces the Frechet ChemNet Distance (FCD), a single metric that captures chemical validity, biological relevance, and diversity of generated molecules by comparing distributions of learned ChemNet representations.

Computational Chemistry
Comparison bar chart showing penalized logP scores for GB-GA, GB-GM-MCTS, and ML-based molecular optimization methods

Graph-Based GA and MCTS Generative Model for Molecules

A graph-based genetic algorithm (GB-GA) and a graph-based generative model with Monte Carlo tree search (GB-GM-MCTS) for molecular optimization that match or outperform ML-based generative approaches while being orders of magnitude faster.

Computational Chemistry
Grid of six GuacaMol benchmark target molecules: Celecoxib, Troglitazone, Thiothixene, Aripiprazole, Osimertinib, and Sitagliptin

GuacaMol: Benchmarking Models for De Novo Molecular Design

GuacaMol provides an open-source benchmarking framework with 5 distribution-learning and 20 goal-directed tasks to standardize evaluation of de novo molecular design models.