Datasets

Diagram showing AllChem's combinatorial synthon assembly pipeline: 7,000 building blocks transformed by 100 reactions into 5 million synthons, which combine in A-B-C topology to represent 10^20 structures

AllChem: Generating and Searching 10^20 Structures

AllChem generates ~5 million synthons by recursively applying ~100 reactions to ~7,000 building blocks, combinatorially representing up to 10^20 complete structures with an A-B-C topology. Topomer shape similarity enables efficient searching of this space, and every hit comes with a proposed synthetic route.

Predictive Chemistry

CHX8 enumeration pipeline from 77,524 structures to 31,497 stable molecules, example strained scaffolds with RSE values, and box plots of relative strain energy distribution by heavy atom count

CHX8: Complete Eight-Carbon Hydrocarbon Space

CHX8 exhaustively enumerates all mathematically feasible hydrocarbons with up to eight carbon atoms (77,524 structures), then DFT-optimizes them to identify 31,497 stable molecules. A universal relative strain energy (RSE) metric referenced to cyclohexane serves as a synthesizability proxy. CHX8 covers 16x more C8 hydrocarbons than GDB-13 and reveals that over 90% of novel structures should be synthetically accessible.

Computational Chemistry

FDB-17 filtering pipeline from GDB-17 (166.4B) through fragment filters (4.6B) to even sampling (10M), with bar charts comparing size distribution and Fsp3 shape complexity against commercial fragments

FDB-17: Fragment Database (10M Molecules)

FDB-17 contains 10 million fragment-like molecules selected from GDB-17’s 166.4 billion entries. Fragment-likeness filters reduce GDB-17 by 36x to 4.6 billion molecules, then even sampling across (HAC, heteroatoms, stereocenters) triplets produces a 460x further reduction to a manageable, diverse library enriched in 3D-shaped molecules.

Computational Chemistry

GDBMedChem pipeline from GDB-17 through medicinal chemistry filters to 10M molecules, with Venn diagram showing 97% unique substructures and property comparison against known drugs

GDBMedChem: Drug-Like Subset of GDB-17 (10M Molecules)

GDBMedChem applies medicinal chemistry-inspired functional group and structural complexity filters to GDB-17, reducing 166.4 billion molecules to 17.8 billion, then evenly samples across molecular size, stereochemistry, and polarity to produce 10 million drug-like molecules. 97% of its substructures are absent from known molecule databases.

Computational Chemistry

Simulated QM9 property landscape scatter plot of HOMO-LUMO gap vs dipole moment, colored by heavy atom count, with example molecules rendered alongside

QM9: Quantum Chemistry Properties of 134k Molecules

QM9 provides B3LYP/6-31G(2df,p)-level geometric, energetic, electronic, and thermodynamic properties for 133,885 small organic molecules (up to 9 heavy atoms of C, N, O, F) drawn from the GDB-17 chemical universe. It is one of the most widely used benchmarks in molecular machine learning.

Predictive Chemistry

Grid of heteroaromatic ring systems rendered with RDKit, showing known ring systems in blue-tinted panels and predicted tractable rings in amber-tinted panels

VEHICLe: Heteroaromatic Rings of the Future

VEHICLe (Virtual Exploratory Heterocyclic Library) is a complete enumeration of 24,867 mono- and bicyclic heteroaromatic ring systems built from C, N, O, S, and H. Of these, only 1,701 have ever appeared in published compounds. A random forest classifier trained on known vs. unknown ring systems predicts that over 3,000 additional ring systems are synthetically tractable.

Computational Chemistry

VQM24 overview showing 9 included elements with valencies, combinatorial scaling of molecular geometries with heavy atom count, and ML learning curves comparing VQM24 vs QM9 difficulty

VQM24: 836k Molecules at DFT and Diffusion QMC

VQM24 exhaustively enumerates all neutral closed-shell molecules with up to 5 heavy atoms from C, N, O, F, Si, P, S, Cl, Br, yielding 258k constitutional isomers and 578k conformers (836k total). Properties are computed at the wB97X-D3/cc-pVDZ level, with diffusion QMC energies for 10,793 molecules up to 4 heavy atoms. ML models show up to 8x higher errors than on QM9, making VQM24 a more challenging benchmark.

Natural Language Processing

Bar chart comparing average benchmark accuracy across seven domain combination configurations showing diversity improves performance

SlimPajama-DC: Data Combinations for LLM Training

Shen et al. empirically analyze how different domain combinations and deduplication strategies in the SlimPajama dataset affect 1.3B model performance. Global deduplication across sources outperforms local deduplication, and increasing domain diversity consistently improves average accuracy, with findings transferring to 7B scale.

Computational Chemistry

ChemLLM pipeline from ChemData structured templates through fine-tuned InternLM2 to ChemBench evaluation

ChemLLM: A Chemical Large Language Model Framework

ChemLLM presents a comprehensive framework for chemistry-specific language modeling, including a 7M-sample instruction tuning dataset (ChemData), a 4,100-question benchmark (ChemBench), and a two-stage fine-tuned model that matches GPT-4 on core chemical tasks.

Computational Chemistry

Radar chart comparing LLM and human chemist performance across chemistry topics in ChemBench

ChemBench: Evaluating LLM Chemistry Against Experts

ChemBench introduces an automated benchmark of 2,700+ chemistry questions to evaluate LLMs against human expert chemists, revealing that frontier models outperform domain experts on average while struggling with basic tasks and confidence calibration.

Molecular Generation

Stylized visualization of protein-ligand docking and benchmark performance bars across five drug targets

DOCKSTRING: Docking-Based Benchmarks for Drug Design

DOCKSTRING bundles an AutoDock Vina wrapper, a 260K-molecule docking dataset across 58 protein targets, and pharmaceutically relevant benchmarks for regression, virtual screening, and de novo design.

Predictive Chemistry

Overview of MoleculeNet dataset categories and task counts across quantum mechanics, physical chemistry, biophysics, and physiology

MoleculeNet: Benchmarking Molecular Machine Learning

MoleculeNet introduces a large-scale benchmark suite for molecular machine learning, curating over 700,000 compounds across 17 datasets with standardized metrics, data splits, and featurization methods integrated into the DeepChem open-source library.