Computational Chemistry
A colored molecule with annotations, representing the diverse drawing styles found in scientific papers that OCSR models must handle.

MolParser-7M & WildMol: Large-Scale OCSR Datasets

The MolParser project introduces two key datasets: MolParser-7M, the largest training dataset for Optical Chemical Structure Recognition (OCSR) with 7.7M pairs of images and E-SMILES strings, and WildMol, a new 20k-sample benchmark for evaluating models on challenging real-world data. The training data uniquely combines millions of diverse synthetic molecules with 400,000 manually annotated in-the-wild samples.

Computational Chemistry
Optical chemical structure recognition example

MolParser: End-to-End Molecular Structure Recognition

A 2025 end-to-end OCSR system addressing both technical and data challenges, introducing MolParser-7M (7M+ image-text pairs) and MolDet (YOLO-based detector) for extracting and recognizing molecular structures from real-world documents with diverse quality and styles.

Computational Chemistry
Aspirin molecular structure generated from SMILES string

Converting SMILES and SELFIES to 2D Molecular Images

Build a robust Python CLI tool that converts both SMILES and SELFIES notation into publication-quality 2D molecular images, complete with formulas and legends.

Computational Chemistry
SELFIES representation of 2-Fluoroethenimine molecule

SELFIES: A Robust Molecular String Representation

SELFIES is a molecular string representation where every possible string decodes to a valid molecule, solving the invalid-output problem that limits SMILES in generative machine learning.

Computational Chemistry
MARCEL dataset Kraken ligand example in 3D conformation

MARCEL: Molecular Conformer Ensemble Learning Benchmark

MARCEL provides a comprehensive benchmark for molecular representation learning with 722K+ conformers across four diverse subsets (Drugs-75K, Kraken, EE, BDE), enabling evaluation of conformer ensemble methods for property prediction in drug discovery and catalysis.

Computational Chemistry
Benzene molecule with SMILES notation

SMILES: A Compact Notation for Chemical Structures

Comprehensive overview of SMILES notation for chemical structures, covering syntax for atoms, bonds, branches, rings, and stereochemistry, plus its key limitations for machine learning.

Computational Chemistry
GEOM dataset example molecule: N-(4-pyrimidin-2-yloxyphenyl)acetamide

GEOM: Energy-Annotated Molecular Conformations Dataset

GEOM contains 450k+ molecules with 37M+ conformations, featuring energy annotations from semi-empirical (GFN2-xTB) and DFT methods for property prediction and molecular generation research.

Computational Chemistry
Potential energy surface showing molecular conformation space with equilibrium and low energy conformations

DenoiseVAE: Adaptive Noise for Molecular Pre-training

ICLR 2025 paper introducing DenoiseVAE, which learns adaptive, atom-specific noise distributions through a VAE framework to improve denoising-based pre-training for molecular force field prediction, outperforming fixed Gaussian noise approaches on quantum chemistry benchmarks.

Computational Chemistry
Adaptive grid merging visualization for benzene molecule showing multi-resolution spatial discretization

Beyond Atoms: 3D Space Modeling for Molecular Pretraining

ICML 2025 paper introducing SpaceFormer, a Transformer architecture that challenges the atom-centric paradigm by modeling the continuous 3D space surrounding molecules using adaptive multi-resolution grids, ranking first in 10 of 15 molecular property prediction tasks.

Computational Chemistry
3D conformer ensemble of a drug-like molecule from the GEOM dataset

GEOM Dataset: 3D Molecular Conformer Generation

Get a practical overview of the GEOM dataset and learn how it’s advancing 3D molecular machine learning by bridging static graphs and dynamic reality.

Computational Chemistry
Markush structure diagram

SubGrapher: Visual Fingerprinting of Chemical Structures

SubGrapher introduces a visual fingerprinting approach to Optical Chemical Structure Recognition that detects functional groups directly from images, enabling chemical database searches without full structure reconstruction and handling complex patent images including Markush structures.

Computational Chemistry
Diagram showing how Ring-Free Language decouples a molecular graph into skeleton, ring structures, and branch information

RFL: Simplifying Chemical Structure Recognition (AAAI 2025)

Proposes Ring-Free Language (RFL) to hierarchically decouple molecular graphs into skeletons, rings, and branches, solving issues with 1D serialization of complex 2D structures. Introduces the Molecular Skeleton Decoder (MSD) to progressively predict these components, achieving strong results on handwritten and printed chemical structure recognition benchmarks.