Optical Chemical Structure Recognition
Overview of the ChemReader pipeline for extracting chemical structures from raster images using Hough transform and OCR

ChemReader: Automated Structure Extraction

This paper presents ChemReader, a fully automated optical structure recognition tool that converts raster images of chemical diagrams into machine-readable formats. It introduces a modified Hough transform for bond detection and a chemical spell checker that improves OCR accuracy from 66% to 87%.

Optical Chemical Structure Recognition
Chemical structure diagram for optical recognition

OSRA: Open Source Optical Structure Recognition

This paper presents OSRA, the first open-source utility for converting graphical chemical structures from documents into machine-readable formats (SMILES/SD). It outlines a pipeline combining existing image processing tools with custom heuristics for bond and atom detection, establishing a foundation for accessible chemical information extraction.

Optical Chemical Structure Recognition
Five-stage pipeline for reconstructing chemical molecules from raster images

Reconstruction of Chemical Molecules from Images

This methodological paper proposes a comprehensive pipeline to digitize chemical structure images. It achieves 97% reconstruction accuracy on benchmarks by combining a topology-preserving vectorizer with a chemical knowledge validation module.

Molecular Representations
D-glucose open-chain aldehyde form converting to beta-D-glucopyranose ring form, illustrating ring-chain tautomerism

InChI and Tautomerism: Toward Comprehensive Treatment

A comprehensive 2020 analysis of the tautomerism problem in chemical databases, compiling 86 tautomeric transformation rules (20 existing, 66 new) and validating them across 400M+ structures to inform algorithmic improvements for InChI V2.

Optical Chemical Structure Recognition
Segment-based chemical structure recognition pipeline for low-quality patent images with touching characters and broken lines

ChemInfty: Chemical Structure Recognition in Patent Images

A 2011 rule-based OCSR system designed specifically for the challenging low-quality images in Japanese patent applications, using segment-based methods to handle pervasive problems like touching characters, merged atom labels with bonds, and broken lines.

Optical Chemical Structure Recognition
Optical chemical structure recognition example

MolParser: End-to-End Molecular Structure Recognition

A 2025 end-to-end OCSR system addressing both technical and data challenges, introducing MolParser-7M (7M+ image-text pairs) and MolDet (YOLO-based detector) for extracting and recognizing molecular structures from real-world documents with diverse quality and styles.

Computational Chemistry
MARCEL dataset Kraken ligand example in 3D conformation

MARCEL: Molecular Conformer Ensemble Learning Benchmark

MARCEL provides a comprehensive benchmark for molecular representation learning with 722K+ conformers across four diverse subsets (Drugs-75K, Kraken, EE, BDE), enabling evaluation of conformer ensemble methods for property prediction in drug discovery and catalysis.

Molecular Simulation
Müller-Brown Potential Energy Surface showing the three minima and two saddle points

The Müller-Brown Potential: A 2D Benchmark Surface

A two-dimensional analytical potential energy surface introduced in 1979 for testing optimization algorithms. It features three minima and curved transition pathways that evaluate an algorithm’s ability to navigate non-trivial topologies.

Molecular Representations
Log-scale plot showing exponential growth of alkane isomer counts from C1 to C40

The Number of Isomeric Hydrocarbons of the Methane Series

A foundational 1931 paper that derives exact recursive formulas for counting alkane structural isomers, correcting historical errors and establishing the first systematic enumeration up to C₄₀.

Computational Chemistry
GEOM dataset example molecule: N-(4-pyrimidin-2-yloxyphenyl)acetamide

GEOM: Energy-Annotated Molecular Conformations Dataset

GEOM contains 450k+ molecules with 37M+ conformations, featuring energy annotations from semi-empirical (GFN2-xTB) and DFT methods for property prediction and molecular generation research.

Molecular Simulation
Müller-Brown Potential Energy Surface showing the three minima and two saddle points

Implementing the Müller-Brown Potential in PyTorch

Step-by-step implementation of the classic Müller-Brown potential in PyTorch, with performance comparisons between analytical and automatic differentiation approaches for molecular dynamics and machine learning applications.

Molecular Simulation
Müller-Brown Potential Energy Surface showing the three minima and two saddle points

Müller-Brown Potential: A PyTorch ML Testbed

A PyTorch testbed for the Müller-Brown potential energy surface, built as ground truth for ML-in-molecular-dynamics work. It pairs analytical and autograd force kernels (the analytical path compiled with torch.compile) with a BAOAB Langevin sampler validated against the canonical distribution and a two-tier test suite.