Molecular Databases & Datasets

Dataset cards covering large-scale molecular enumeration databases (GDB-11/13/17, ZINC-22) for virtual screening and drug discovery, and conformer ensemble datasets (GEOM, MARCEL) for molecular property prediction and 3D modeling.

Year	Dataset	Key Idea
2007	GDB-11: Chemical Universe Database (26.4M Molecules)	Systematic enumeration of 26.4M small organic molecules up to 11 heavy atoms
2009	GDB-13: Chemical Universe Database (970M Molecules)	Extension to 970M molecules up to 13 heavy atoms
2012	GDB-17: Chemical Universe Database (166.4B Molecules)	Largest enumeration database with 166.4B molecules up to 17 heavy atoms
2014	QM9: Quantum Chemistry Properties of 134k Molecules	DFT-computed properties for 134k small organic molecules from GDB-9
2017	FDB-17: Fragment Database (10M Molecules)	10M fragment-like molecules evenly sampled from GDB-17 across size, polarity, and complexity
2019	GDBMedChem: Drug-Like Subset of GDB-17 (10M Molecules)	10M drug-like molecules from GDB-17 filtered by medicinal chemistry criteria, 97% novel substructures
2022	GEOM: Energy-Annotated Molecular Conformations Dataset	Energy-annotated molecular conformer ensembles for 3D modeling
2023	ZINC-22: A Multi-Billion Scale Database for Ligand Discovery	Over 37B make-on-demand molecules for virtual screening
2024	MARCEL: Molecular Conformer Ensemble Learning Benchmark	722K+ conformers across 76K+ molecules for conformer ensemble learning
2025	VQM24: 836k Molecules at DFT and Diffusion QMC	Exhaustive enumeration of 836k molecules (9 elements, up to 5 heavy atoms) with DFT and DMC properties

Computational Chemistry

FDB-17 filtering pipeline from GDB-17 (166.4B) through fragment filters (4.6B) to even sampling (10M), with bar charts comparing size distribution and Fsp3 shape complexity against commercial fragments

FDB-17: Fragment Database (10M Molecules)

FDB-17 contains 10 million fragment-like molecules selected from GDB-17’s 166.4 billion entries. Fragment-likeness filters reduce GDB-17 by 36x to 4.6 billion molecules, then even sampling across (HAC, heteroatoms, stereocenters) triplets produces a 460x further reduction to a manageable, diverse library enriched in 3D-shaped molecules.

Computational Chemistry

GDBMedChem pipeline from GDB-17 through medicinal chemistry filters to 10M molecules, with Venn diagram showing 97% unique substructures and property comparison against known drugs

GDBMedChem: Drug-Like Subset of GDB-17 (10M Molecules)

GDBMedChem applies medicinal chemistry-inspired functional group and structural complexity filters to GDB-17, reducing 166.4 billion molecules to 17.8 billion, then evenly samples across molecular size, stereochemistry, and polarity to produce 10 million drug-like molecules. 97% of its substructures are absent from known molecule databases.

Computational Chemistry

Simulated QM9 property landscape scatter plot of HOMO-LUMO gap vs dipole moment, colored by heavy atom count, with example molecules rendered alongside

QM9: Quantum Chemistry Properties of 134k Molecules

QM9 provides B3LYP/6-31G(2df,p)-level geometric, energetic, electronic, and thermodynamic properties for 133,885 small organic molecules (up to 9 heavy atoms of C, N, O, F) drawn from the GDB-17 chemical universe. It is one of the most widely used benchmarks in molecular machine learning.

Computational Chemistry

VQM24 overview showing 9 included elements with valencies, combinatorial scaling of molecular geometries with heavy atom count, and ML learning curves comparing VQM24 vs QM9 difficulty

VQM24: 836k Molecules at DFT and Diffusion QMC

VQM24 exhaustively enumerates all neutral closed-shell molecules with up to 5 heavy atoms from C, N, O, F, Si, P, S, Cl, Br, yielding 258k constitutional isomers and 578k conformers (836k total). Properties are computed at the wB97X-D3/cc-pVDZ level, with diffusion QMC energies for 10,793 molecules up to 4 heavy atoms. ML models show up to 8x higher errors than on QM9, making VQM24 a more challenging benchmark.

Computational Chemistry

ZINC-22 Tranche Browser showing molecular count distribution

ZINC-22: A Multi-Billion Scale Database for Ligand Discovery

ZINC-22 is a multi-billion-scale public database containing over 37 billion make-on-demand molecules. It utilizes distributed infrastructure and specialized search algorithms to support modern ultra-large virtual screening campaigns.

Computational Chemistry

MARCEL dataset Kraken ligand example in 3D conformation

MARCEL: Molecular Conformer Ensemble Learning Benchmark

MARCEL provides a comprehensive benchmark for molecular representation learning with 722K+ conformers across four diverse subsets (Drugs-75K, Kraken, EE, BDE), enabling evaluation of conformer ensemble methods for property prediction in drug discovery and catalysis.

Computational Chemistry

GEOM dataset example molecule: N-(4-pyrimidin-2-yloxyphenyl)acetamide

GEOM: Energy-Annotated Molecular Conformations Dataset

GEOM contains 450k+ molecules with 37M+ conformations, featuring energy annotations from semi-empirical (GFN2-xTB) and DFT methods for property prediction and molecular generation research.

Computational Chemistry

GDB-11 molecule structure showing FC1C2OC1c3c(F)coc23

GDB-11: Chemical Universe Database (26.4M Molecules)

GDB-11 contains 26.4 million systematically generated small organic molecules with up to 11 atoms, establishing the methodology for exploring drug-like chemical space computationally.

Computational Chemistry

GDB-13 molecule structure showing CCCC(O)(CO)CC1CC1CN

GDB-13: Chemical Universe Database (970M Molecules)

GDB-13 contains nearly 1 billion systematically generated small organic molecules with up to 13 atoms, achieving billion-scale chemical space exploration while maintaining drug-like properties.

Computational Chemistry

GDB-17 molecule structure showing complex polycyclic architecture

GDB-17: Chemical Universe Database (166.4B Molecules)

GDB-17 contains 166.4 billion systematically generated small organic molecules with up to 17 atoms. It represents the most comprehensive exploration of drug-relevant chemical space achieved through computational enumeration.