Molecular-Representation

String Representations for Chemical Image Recognition

This empirical study isolates the impact of chemical string representations on image-to-text translation models. It finds that while SMILES offers the highest overall accuracy, SELFIES provides a guarantee of structural validity, offering a trade-off for OCSR tasks.

Computational Chemistry

ChemGrapher pipeline overview showing segmentation and classification stages

ChemGrapher: Deep Learning for Chemical Graph OCSR

ChemGrapher replaces rule-based chemical OCR with a deep learning pipeline using semantic segmentation to identify atom and bond candidates, followed by specialized classification networks to resolve stereochemistry and bond multiplicity, reducing error rates compared to OSRA across all tested styles.

Computational Chemistry

Encoder-decoder architecture translating a chemical structure bitmap into a SMILES string

DECIMER: Deep Learning for Chemical Image Recognition

DECIMER adapts the “Show, Attend and Tell” image captioning architecture to translate chemical structure images into SMILES strings. By leveraging massive synthetic datasets generated from PubChem, it demonstrates that deep learning can perform optical chemical recognition without complex, hand-engineered rule systems.

Computational Chemistry

Ibuprofen molecular structure diagram for Img2Mol OCSR

Img2Mol: Accurate SMILES Recognition from Depictions

A 2021 deep learning system using a two-stage approach for OCSR, encoding images into continuous CDDD embeddings before decoding to SMILES. It leverages extensive data augmentation to handle rotations, distortions, and rendering variations for fast and robust molecular structure recognition.

Computational Chemistry

Optical Chemical Structure Recognition workflow visualization

Research on Chemical Expression Images Recognition

Proposes a new OCSR workflow that improves recognition rates by separating adhesive chemical symbols and specifically handling virtual/real wedge bonds using vectorization, achieving 90% exact match vs 82.2% for OSRA baseline.

Computational Chemistry

Optical chemical structure recognition example

IMG2SMI: Translating Molecular Structure Images to SMILES

A 2021 image-to-text approach treating OCSR as an image captioning task. It uses Transformers with SELFIES representation to convert molecular structure diagrams into SMILES strings, enabling extraction of visual chemical knowledge from scientific literature.

Scientific Computing

Grid of complex molecular structures rendered from SELFIES and SMILES strings

Molecular String Renderer: Robust Visualization Tool

A fault-tolerant RDKit wrapper treating molecular visualization as a software engineering problem, implementing strategy pattern for SVG generation with automatic raster fallback, native SELFIES support for generative AI workflows, and strict type safety for reliable batch processing of millions of molecules in training pipelines.

Computational Chemistry

D-glucose open-chain aldehyde form converting to beta-D-glucopyranose ring form, illustrating ring-chain tautomerism

InChI and Tautomerism: Toward Comprehensive Treatment

A comprehensive 2020 analysis of the tautomerism problem in chemical databases, compiling 86 tautomeric transformation rules (20 existing, 66 new) and validating them across 400M+ structures to inform algorithmic improvements for InChI V2.

Computational Chemistry

2D molecular structure diagram of tricyclohexylphosphine showing a central phosphorus atom bonded to three cyclohexyl groups

InChI: The Worldwide Chemical Structure Identifier Standard

A comprehensive 2013 review explaining how InChI emerged as the global standard for chemical structure identifiers, covering its history as a response to the Internet’s need for non-proprietary molecular linking, its governance under IUPAC, and the technical layers that ensure uniqueness across diverse chemical databases.

Computational Chemistry

Crystal structure of Na8Si46 clathrate displaying dodecahedral and tetrakaidecahedral coordination polyhedra

Making InChI FAIR and Sustainable for Inorganic Chemistry

A 2025 Faraday Discussions paper describing the major overhaul of InChI v1.07 that fixed more than 3000 bugs, added support for inorganic and organometallic compounds, and modernized the system to align with FAIR data principles for chemistry databases.

Computational Chemistry

A cobalt sulfate and ethylenediamine mixture being prepared

Mixfile & MInChI: Machine-Readable Mixture Formats

A 2019 format specification introducing two complementary standards for chemical mixtures. Mixfile provides comprehensive mixture descriptions and MInChI provides compact canonical identifiers. This addresses the long-standing lack of standardized machine-readable formats for multi-component chemical systems.

Computational Chemistry

Colorized electron microscope image of nanostructured indium phosphide surface showing spatially oriented cubic crystallites

NInChI: Toward a Chemical Identifier for Nanomaterials

Can we create a SMILES-like notation for nanomaterials? A collaborative workshop tackles the challenge of representing complex, multi-component nanomaterials with a proposed extension to the established InChI system.