Cheminformatics

αExtractor extracts structured chemical information from biomedical literature

αExtractor: Chemical Info from Biomedical Literature

A 2024 deep learning system for optical chemical structure recognition designed specifically for biomedical literature mining, using ResNet-Transformer architecture to handle challenging conditions including low-resolution images, noise, distortions, and even hand-drawn molecular diagrams from scientific documents.

Computational Chemistry

ChemInfty: Chemical Structure Recognition in Patent Images

A 2011 rule-based OCSR system designed specifically for the challenging low-quality images in Japanese patent applications, using segment-based methods to handle pervasive problems like touching characters, merged atom labels with bonds, and broken lines.

Computational Chemistry

Diagram showing MolNexTR's dual-stream architecture: a molecular image feeds into parallel ConvNext and Vision Transformer encoders, producing a SMILES string.

MolNexTR: A Dual-Stream Molecular Image Recognition

MolNexTR proposes a dual-stream architecture combining ConvNext and Vision Transformers to improve molecular image recognition (OCSR). It achieves 81-97% accuracy across diverse benchmarks utilizing simultaneous local and global feature extraction alongside specialized image contamination augmentations.

Computational Chemistry

A colored molecule with annotations, representing the diverse drawing styles found in scientific papers that OCSR models must handle.

MolParser-7M & WildMol: Large-Scale OCSR Datasets

The MolParser project introduces two key datasets: MolParser-7M, the largest training dataset for Optical Chemical Structure Recognition (OCSR) with 7.7M pairs of images and E-SMILES strings, and WildMol, a new 20k-sample benchmark for evaluating models on challenging real-world data. The training data uniquely combines millions of diverse synthetic molecules with 400,000 manually annotated in-the-wild samples.

Computational Chemistry

Optical chemical structure recognition example

MolParser: End-to-End Molecular Structure Recognition

A 2025 end-to-end OCSR system addressing both technical and data challenges, introducing MolParser-7M (7M+ image-text pairs) and MolDet (YOLO-based detector) for extracting and recognizing molecular structures from real-world documents with diverse quality and styles.

Computational Chemistry

ZINC-22 Tranche Browser showing molecular count distribution

ZINC-22: A Multi-Billion Scale Database for Ligand Discovery

ZINC-22 is a multi-billion-scale public database containing over 37 billion make-on-demand molecules. It utilizes distributed infrastructure and specialized search algorithms to support modern ultra-large virtual screening campaigns.

Computational Chemistry

Aspirin molecular structure generated from SMILES string

Converting SMILES and SELFIES to 2D Molecular Images

Build a robust Python CLI tool that converts both SMILES and SELFIES notation into publication-quality 2D molecular images, complete with formulas and legends.

Computational Chemistry

SELFIES representation of 2-Fluoroethenimine molecule

SELFIES: The 100% Robust Molecular String Representation

An in-depth overview of SELFIES, the 100% robust molecular string representation designed to overcome SMILES limitations in machine learning, where every possible string (even random ones) decodes to a valid molecule through local operations, customizable valence rules, and graph-based internal representations.

Computational Chemistry

MARCEL dataset Kraken ligand example in 3D conformation

MARCEL: Molecular Conformer Ensemble Learning Benchmark

MARCEL provides a comprehensive benchmark for molecular representation learning with 722K+ conformers across four diverse subsets (Drugs-75K, Kraken, EE, BDE), enabling evaluation of conformer ensemble methods for property prediction in drug discovery and catalysis.

Computational Chemistry

SMILES: A Compact Notation for Chemical Structures

Comprehensive overview of SMILES notation for chemical structures, covering syntax for atoms, bonds, branches, rings, and stereochemistry, plus its key limitations for machine learning.

Computational Chemistry

Log-scale plot showing exponential growth of alkane isomer counts from C1 to C40

The Number of Isomeric Hydrocarbons of the Methane Series

A foundational 1931 paper that derives exact recursive formulas for counting alkane structural isomers, correcting historical errors and establishing the first systematic enumeration up to C₄₀.

Computational Chemistry

GEOM dataset example molecule: N-(4-pyrimidin-2-yloxyphenyl)acetamide

GEOM: Energy-Annotated Molecular Conformations Dataset

GEOM contains 450k+ molecules with 37M+ conformations, featuring energy annotations from semi-empirical (GFN2-xTB) and DFT methods for property prediction and molecular generation research.