Datasets

Thymol molecular structure diagram for Staker deep learning OCSR

Deep Learning for Molecular Structure Extraction (2019)

This paper presents a two-stage deep learning pipeline to extract chemical structures from documents and convert them to SMILES strings. By training on large-scale synthetic data, the method overcomes the brittleness of rule-based systems and demonstrates high accuracy even on low-resolution and noisy input images.

Computational Chemistry

A cobalt sulfate and ethylenediamine mixture being prepared

Mixfile & MInChI: Machine-Readable Mixture Formats

A 2019 format specification introducing two complementary standards for chemical mixtures. Mixfile provides comprehensive mixture descriptions and MInChI provides compact canonical identifiers. This addresses the long-standing lack of standardized machine-readable formats for multi-component chemical systems.

Computational Chemistry

The transformation from a 2D chemical structure image to a SMILES representation

What is Optical Chemical Structure Recognition (OCSR)?

Discover how OCSR technology bridges the gap between molecular images and machine-readable data, evolving from rule-based systems to modern deep learning models for chemical knowledge extraction.

Computational Chemistry

A colored molecule with annotations, representing the diverse drawing styles found in scientific papers that OCSR models must handle.

MolParser-7M & WildMol: Large-Scale OCSR Datasets

The MolParser project introduces two key datasets: MolParser-7M, the largest training dataset for Optical Chemical Structure Recognition (OCSR) with 7.7M pairs of images and E-SMILES strings, and WildMol, a new 20k-sample benchmark for evaluating models on challenging real-world data. The training data uniquely combines millions of diverse synthetic molecules with 400,000 manually annotated in-the-wild samples.

Computational Chemistry

Optical chemical structure recognition example

MolParser: End-to-End Molecular Structure Recognition

A 2025 end-to-end OCSR system addressing both technical and data challenges, introducing MolParser-7M (7M+ image-text pairs) and MolDet (YOLO-based detector) for extracting and recognizing molecular structures from real-world documents with diverse quality and styles.

Computational Chemistry

ZINC-22 Tranche Browser showing molecular count distribution

ZINC-22: A Multi-Billion Scale Database for Ligand Discovery

ZINC-22 is a multi-billion-scale public database containing over 37 billion make-on-demand molecules. It utilizes distributed infrastructure and specialized search algorithms to support modern ultra-large virtual screening campaigns.

Computational Chemistry

MARCEL dataset Kraken ligand example in 3D conformation

MARCEL: Molecular Conformer Ensemble Learning Benchmark

MARCEL provides a comprehensive benchmark for molecular representation learning with 722K+ conformers across four diverse subsets (Drugs-75K, Kraken, EE, BDE), enabling evaluation of conformer ensemble methods for property prediction in drug discovery and catalysis.

Computational Chemistry

GEOM dataset example molecule: N-(4-pyrimidin-2-yloxyphenyl)acetamide

GEOM: Energy-Annotated Molecular Conformations Dataset

GEOM contains 450k+ molecules with 37M+ conformations, featuring energy annotations from semi-empirical (GFN2-xTB) and DFT methods for property prediction and molecular generation research.

Computational Chemistry

GDB-11 molecule structure showing FC1C2OC1c3c(F)coc23

GDB-11: Chemical Universe Database (26.4M Molecules)

GDB-11 contains 26.4 million systematically generated small organic molecules with up to 11 atoms, establishing the methodology for exploring drug-like chemical space computationally.

Computational Chemistry

GDB-13 molecule structure showing CCCC(O)(CO)CC1CC1CN

GDB-13: Chemical Universe Database (970M Molecules)

GDB-13 contains nearly 1 billion systematically generated small organic molecules with up to 13 atoms, achieving billion-scale chemical space exploration while maintaining drug-like properties.

Computational Chemistry

GDB-17 molecule structure showing complex polycyclic architecture

GDB-17: Chemical Universe Database (166.4B Molecules)

GDB-17 contains 166.4 billion systematically generated small organic molecules with up to 17 atoms. It represents the most comprehensive exploration of drug-relevant chemical space achieved through computational enumeration.

Computational Chemistry

3D conformer ensemble of a drug-like molecule from the GEOM dataset

GEOM Dataset: 3D Molecular Conformer Generation

Get a practical overview of the GEOM dataset and learn how it’s advancing 3D molecular machine learning by bridging static graphs and dynamic reality.