
GDB-11: Chemical Universe Database (26.4M Molecules)
GDB-11 contains 26.4 million systematically generated small organic molecules with up to 11 atoms, establishing the methodology for exploring drug-like chemical space computationally.

GDB-11 contains 26.4 million systematically generated small organic molecules with up to 11 atoms, establishing the methodology for exploring drug-like chemical space computationally.

ICML 2025 methodological paper introducing SPHNet, which uses adaptive network sparsification to overcome the computational bottleneck of tensor products in SE(3)-equivariant networks, achieving up to 7x speedup and 75% memory reduction on DFT Hamiltonian prediction tasks.

Lennard-Jones’s 1932 theoretical paper applying quantum mechanical potential energy surfaces to gas-solid interactions, providing the first unified framework explaining both physisorption and chemisorption as different regions of the same energy landscape.

GDB-13 contains nearly 1 billion systematically generated small organic molecules with up to 13 atoms, achieving billion-scale chemical space exploration while maintaining drug-like properties.

GDB-17 contains 166.4 billion systematically generated small organic molecules with up to 17 atoms. It represents the most comprehensive exploration of drug-relevant chemical space achieved through computational enumeration.

Get a practical overview of the GEOM dataset and learn how it’s advancing 3D molecular machine learning by bridging static graphs and dynamic reality.

A 2025 Vision-Language Model for OCSR that uses graph traversal chain-of-thought reasoning and a two-stage SFT plus GRPO training scheme to handle both printed molecules (including chemical abbreviations like Ph and Et) and hand-drawn structures, achieving strong performance on the new MolRec-Bench benchmark.

SubGrapher introduces a visual fingerprinting approach to Optical Chemical Structure Recognition that detects functional groups directly from images, enabling chemical database searches without full structure reconstruction and handling complex patent images including Markush structures.

Proposes the ‘Optical Chemical Structure Understanding’ (OCSU) task to translate molecular images into multi-level descriptions (motifs, IUPAC, SMILES). Introduces the Vis-CheBI20 dataset and two paradigms: DoubleCheck (OCSR-based) and Mol-VL (OCSR-free).

Proposes Ring-Free Language (RFL) to hierarchically decouple molecular graphs into skeletons, rings, and branches, solving issues with 1D serialization of complex 2D structures. Introduces the Molecular Skeleton Decoder (MSD) to progressively predict these components, achieving strong results on handwritten and printed chemical structure recognition benchmarks.

A 2024 Nature Machine Intelligence paper providing causal evidence that invalid SMILES generation improves chemical language model performance by filtering low-likelihood samples, while validity constraints (as in SELFIES) introduce structural biases that impair distribution learning.

An end-to-end data factory for molecular machine learning that transforms raw chemical formulas (e.g., C6H14) into labeled 3D conformer datasets, using MAYGEN for structural isomer enumeration, RDKit for 3D embedding, and physics-based featurization to address data scarcity in computational drug discovery.