Computational Chemistry

SwinOCSR: Vision Transformers for Chemical OCR

Proposes an end-to-end architecture replacing standard CNN backbones with Swin Transformer to capture global image context. Introduces Multi-label Focal Loss to handle severe token imbalance in chemical datasets.

Computational Chemistry

Deep Learning for Molecular Structure Extraction

This paper presents a two-stage deep learning pipeline to extract chemical structures from documents and convert them to SMILES strings. By training on large-scale synthetic data, the method overcomes the brittleness of rule-based systems and demonstrates high accuracy even on low-resolution and noisy input images.

Computational Chemistry
A cobalt sulfate and ethylenediamine mixture being prepared

Mixfile & MInChI: Machine-Readable Mixture Formats

A 2019 format specification introducing two complementary standards for chemical mixtures. Mixfile provides comprehensive mixture descriptions and MInChI provides compact canonical identifiers. This addresses the long-standing lack of standardized machine-readable formats for multi-component chemical systems.

Computational Chemistry
Chemical diagram showing a generalized Grignard reaction

RInChI: Reaction International Chemical Identifier

A 2018 infrastructure paper introducing RInChI (Reaction InChI), the first standardized format for uniquely identifying chemical reactions through algorithmic hashing and layering, enabling reaction database searching and duplicate detection analogous to how InChI works for individual molecules.

Computational Chemistry
The transformation from a 2D chemical structure image to a SMILES representation

What is Optical Chemical Structure Recognition (OCSR)?

Discover how OCSR technology bridges the gap between molecular images and machine-readable data, evolving from rule-based systems to modern deep learning models for chemical knowledge extraction.

Computational Chemistry
A colored molecule with annotations, representing the diverse drawing styles found in scientific papers that OCSR models must handle.

MolParser-7M & WildMol: Large-Scale OCSR Datasets

The MolParser project introduces two key datasets: MolParser-7M, the largest training dataset for Optical Chemical Structure Recognition (OCSR) with 7.7M pairs of images and E-SMILES strings, and WildMol, a new 20k-sample benchmark for evaluating models on challenging real-world data. The training data uniquely combines millions of diverse synthetic molecules with 400,000 manually annotated in-the-wild samples to significantly enhance model robustness.

Computational Chemistry
Optical chemical structure recognition example

MolParser: End-to-End Molecular Structure Recognition

A 2024 end-to-end OCSR system addressing both technical and data challenges, introducing MolParser-7M (7M+ image-text pairs) and MolDet (YOLO-based detector) for extracting and recognizing molecular structures from real-world documents with diverse quality and styles.

Computational Chemistry
ZINC-22 Tranche Browser showing molecular count distribution

ZINC-22: A Multi-Billion Scale Database for Ligand Discovery

ZINC-22 is a multi-billion-scale public database containing over 37 billion make-on-demand molecules. It utilizes distributed infrastructure and specialized search algorithms to support modern ultra-large virtual screening campaigns.

Computational Chemistry
MARCEL dataset Kraken ligand example in 3D conformation

MARCEL: Molecular Representation & Conformers

MARCEL provides a comprehensive benchmark for molecular representation learning with 722K+ conformers across four diverse subsets (Drugs-75K, Kraken, EE, BDE), enabling evaluation of conformer ensemble methods for property prediction in drug discovery and catalysis.

Computational Chemistry
GEOM dataset example molecule: N-(4-pyrimidin-2-yloxyphenyl)acetamide

GEOM: Energy-Annotated Molecular Conformations

GEOM contains 450k+ molecules with 37M+ conformations, featuring energy annotations from semi-empirical (GFN2-xTB) and DFT methods for property prediction and molecular generation research.

Computational Chemistry
GDB-11 molecule structure showing FC1C2OC1c3c(F)coc23

GDB-11: Chemical Universe Database (26.4M Molecules)

GDB-11 contains 26.4 million systematically generated small organic molecules with up to 11 atoms, establishing the methodology for exploring drug-like chemical space computationally.

Computational Chemistry
GDB-13 molecule structure showing CCCC(O)(CO)CC1CC1CN

GDB-13: Chemical Universe Database (970M Molecules)

GDB-13 contains nearly 1 billion systematically generated small organic molecules with up to 13 atoms, achieving billion-scale chemical space exploration while maintaining drug-like properties.