Computational Chemistry
Automatic chemical image recognition pipeline from raster image to structured file

Automatic Recognition of Chemical Images

This methodological paper presents a system for digitizing chemical images into SDF files. It utilizes a custom vectorization algorithm and chemical rule validation, achieving 94% accuracy on benchmark datasets compared to 50% for commercial tools.

Computational Chemistry
Overview of the ChemReader pipeline for extracting chemical structures from raster images using Hough transform and OCR

ChemReader: Automated Structure Extraction

This paper presents ChemReader, a fully automated optical structure recognition tool that converts raster images of chemical diagrams into machine-readable formats. It introduces a modified Hough transform for bond detection and a chemical spell checker that improves OCR accuracy from 66% to 87%.

Computational Chemistry

Hand-Drawn Chemical Diagram Recognition (AAAI 2007)

An early method paper (AAAI ‘07) proposing a multi-stage sketch recognition pipeline. It introduces a domain verification step that uses chemical rules to refine ink parsing, achieving a 27% error reduction over geometric-only baselines.

Computational Chemistry
Optical chemical structure recognition example

IMG2SMI: Translating Molecular Structure Images to SMILES

A 2021 image-to-text approach treating OCSR as an image captioning task. It uses Transformers with SELFIES representation to convert molecular structure diagrams into SMILES strings, enabling extraction of visual chemical knowledge from scientific literature.

Computational Chemistry

Kekulé: OCR-Optical Chemical Recognition

This 1992 paper introduces Kekulé, one of the first complete Optical Chemical Structure Recognition (OCSR) systems. It details a pipeline integrating raster-to-vector conversion, neural network-based OCR, and rule-based logic to convert printed chemical diagrams into connection tables.

Computational Chemistry

OCSR Methods: A Taxonomy of Approaches

A comprehensive categorization of OCSR methods, organizing techniques by their fundamental approach: deep learning, traditional ML, and rule-based systems.

Computational Chemistry
Early optical recognition system converts scanned chemical diagrams to connection tables

Optical Recognition of Chemical Graphics

This paper describes an early prototype system that digitizes chemical structure diagrams from scanned documents. It employs a multi-stage pipeline involving convex bounding polygon extraction, vectorization, and rule-based heuristics to generate MDL Molfiles.

Computational Chemistry
SELFIES robustness demonstration

Invalid SMILES Benefit Chemical Language Models: A Study

A 2024 Nature Machine Intelligence paper providing causal evidence that invalid SMILES generation improves chemical language model performance by filtering low-likelihood samples, while validity constraints (as in SELFIES) introduce structural biases that impair distribution learning.

Computational Chemistry
SELFIES robustness demonstration

SELFIES and the Future of Molecular String Representations

This 2022 perspective paper reviews 250 years of chemical notation evolution and proposes 16 concrete research projects to extend SELFIES beyond traditional organic chemistry into polymers, crystals, and reactions.

Scientific Computing
Grid of complex molecular structures rendered from SELFIES and SMILES strings

Molecular String Renderer: Robust Visualization Tool

A fault-tolerant RDKit wrapper treating molecular visualization as a software engineering problem, implementing strategy pattern for SVG generation with automatic raster fallback, native SELFIES support for generative AI workflows, and strict type safety for reliable batch processing of millions of molecules in training pipelines.

Computational Chemistry
D-glucose open-chain aldehyde form converting to beta-D-glucopyranose ring form, illustrating ring-chain tautomerism

InChI and Tautomerism: Toward Comprehensive Treatment

A comprehensive 2020 analysis of the tautomerism problem in chemical databases, compiling 86 tautomeric transformation rules (20 existing, 66 new) and validating them across 400M+ structures to inform algorithmic improvements for InChI V2.

Computational Chemistry
2D molecular structure diagram of tricyclohexylphosphine showing a central phosphorus atom bonded to three cyclohexyl groups

InChI: The Worldwide Chemical Structure Identifier Standard

A comprehensive 2013 review explaining how InChI emerged as the global standard for chemical structure identifiers, covering its history as a response to the Internet’s need for non-proprietary molecular linking, its governance under IUPAC, and the technical layers that ensure uniqueness across diverse chemical databases.