Computational Chemistry
ChemGrapher pipeline overview showing segmentation and classification stages

ChemGrapher: Deep Learning for Chemical OCR

ChemGrapher replaces rule-based chemical OCR with a deep learning pipeline using semantic segmentation to identify atom and bond candidates, followed by specialized classification networks to resolve stereochemistry and bond multiplicity, significantly outperforming OSRA.

Computational Chemistry
DECIMER: Deep Learning for Chemical Image Recognition

DECIMER: Deep Learning for Chemical Image Recognition

DECIMER adapts the “Show, Attend and Tell” image captioning architecture to translate chemical structure images into SMILES strings. By leveraging massive synthetic datasets generated from PubChem, it demonstrates that deep learning can perform optical chemical recognition without complex, hand-engineered rule systems.

Computational Chemistry
Optical chemical structure recognition example

Img2Mol: Accurate SMILES from Molecular Depictions

A 2021 deep learning system using a two-stage approach for OCSR - first encoding images into continuous CDDD embeddings, then decoding to SMILES - with extensive data augmentation handling rotations, distortions, and rendering variations to achieve fast and robust molecular structure recognition.

Computational Chemistry
Optical Chemical Structure Recognition workflow visualization

Research on Chemical Expression Images Recognition

Proposes a new OCSR workflow that improves recognition rates by separating adhesive chemical symbols and specifically handling virtual/real wedge bonds using vectorization, achieving 90% exact match vs 82.2% for OSRA baseline.

Computational Chemistry
Optical chemical structure recognition example

IMG2SMI: Translating Molecular Structure Images to SMILES

A 2021 image-to-text approach treating OCSR as an image captioning task, using Transformers with SELFIES intermediate representation to convert molecular structure diagrams into SMILES strings, addressing the bottleneck of unlocking visual chemical knowledge from scientific literature and patents.

Scientific Computing
Grid of complex molecular structures rendered from SELFIES and SMILES strings

Molecular String Renderer: Robust Visualization Tool

A fault-tolerant RDKit wrapper treating molecular visualization as a software engineering problem, implementing strategy pattern for SVG generation with automatic raster fallback, native SELFIES support for generative AI workflows, and strict type safety for reliable batch processing of millions of molecules in training pipelines.

Computational Chemistry
D-glucose open-chain aldehyde form converting to beta-D-glucopyranose ring form, illustrating ring-chain tautomerism

InChI and Tautomerism: Toward Comprehensive Treatment

A comprehensive 2020 analysis of the tautomerism problem in chemical databases, introducing 86 new tautomeric transformation rules and proposing algorithmic improvements for InChI V2 to recognize when different molecular representations are the same molecule in different tautomeric states.

Computational Chemistry
2D molecular structure diagram of tricyclohexylphosphine showing a central phosphorus atom bonded to three cyclohexyl groups

InChI: The Worldwide Chemical Structure Identifier Standard

A comprehensive 2013 review explaining how InChI emerged as the global standard for chemical structure identifiers, covering its history as a response to the Internet’s need for non-proprietary molecular linking, its governance under IUPAC, and the technical layers that ensure uniqueness across diverse chemical databases.

Computational Chemistry
Crystal structure of Na8Si46 clathrate displaying dodecahedral and tetrakaidecahedral coordination polyhedra

Making InChI FAIR and Sustainable for Inorganic Chemistry

A 2025 Faraday Discussions paper describing the major overhaul of InChI v1.07 that fixed thousands of bugs, added robust support for inorganic and organometallic compounds, and modernized the system to align with FAIR data principles for chemistry databases.

Computational Chemistry
A cobalt sulfate and ethylenediamine mixture being prepared

Mixfile & MInChI: Machine-Readable Mixture Formats

A 2019 format specification introducing two complementary standards for chemical mixtures. Mixfile provides comprehensive mixture descriptions and MInChI provides compact canonical identifiers. This addresses the long-standing lack of standardized machine-readable formats for multi-component chemical systems.

Computational Chemistry
Colorized electron microscope image of nanostructured indium phosphide surface showing spatially oriented cubic crystallites

NInChI: Toward a Chemical Identifier for Nanomaterials

Can we create a SMILES-like notation for nanomaterials? A collaborative workshop tackles the challenge of representing complex, multi-component nanomaterials with a proposed extension to the established InChI system.

Computational Chemistry
Benzene in SELFIES notation

Recent Advances in the SELFIES Library (2023)

A 2023 software update paper documenting major improvements to the SELFIES Python library, including architectural redesign using directed molecular graphs for faster performance, expanded chemical feature support, semantic constraints for validity, and user-friendly customization APIs that transform SELFIES from proof-of-concept into production-ready tool.