Paper Information

Citation: Musazade, F., Jamalova, N., & Hasanov, J. (2022). Review of techniques and models used in optical chemical structure recognition in images and scanned documents. Journal of Cheminformatics, 14(1), 61. https://doi.org/10.1186/s13321-022-00642-3

Publication: Journal of Cheminformatics 2022

Systematization of OCSR Evolution

This is a Systematization paper ($\Psi_{\text{Systematization}}$). It organizes existing literature into two distinct evolutionary phases: Rule-based systems (1990s-2010s) and Machine Learning-based systems (2015-present). It synthesizes performance metrics across these paradigms to highlight the shift from simple classification to “image captioning” (sequence generation).

Justification: The paper focuses on “organizing and synthesizing existing literature” and answers the core question: “What do we know?” The dominant contribution is systematization based on several key indicators:

  1. Survey Structure: The paper explicitly structures content by categorizing the field into two distinct historical and methodological groups: “Rule-based systems” and “ML-based systems”. It traces the “evolution of approaches from rule-based structure analyses to complex statistical models”, moving chronologically from early tools like OROCS and OSRA (1990s-2000s) to modern Deep Learning approaches like DECIMER and Vision Transformers.

  2. Synthesis of Knowledge: The paper aggregates performance metrics from various distinct studies into unified comparison tables (Table 1 for rule-based and Table 2 for ML-based). It synthesizes technical details of different models, explaining how specific architectures (CNNs, LSTMs, Attention mechanisms) are applied to the specific problem of Optical Chemical Structure Recognition (OCSR).

  3. Identification of Gaps: The authors dedicate specific sections to “Gaps of rule-based systems” and “Gaps of ML-based systems”. It concludes with recommendations for future development, such as the need for “standardized datasets” and specific improvements in image augmentation and evaluation metrics.

Motivation for Digitization in Cheminformatics

The primary motivation is the need to digitize vast amounts of chemical knowledge locked in non-digital formats (e.g., scanned PDFs, older textbooks). This is challenging because:

  1. Representational Variety: A single chemical formula can be drawn in many visually distinct ways (e.g., different orientations, bond styles, fonts).
  2. Legacy Data: Older documents contain noise, low resolution, and disconnected strokes that confuse standard computer vision models.
  3. Lack of Standardization: There is no centralized database or standardized benchmark for evaluating OCSR performance, making comparison difficult.

Key Insights and the Paradigm Shift

The paper provides a structured comparison of the “evolution” of OCSR, specifically identifying the pivot point where the field moved from object detection to NLP-inspired sequence generation.

Key insights include:

  • The Paradigm Shift: Identifying that OCSR has effectively become an “image captioning” problem where the “caption” is a SMILES or InChI string.
  • Metric Critique: It critically analyzes the flaws in current evaluation metrics, noting that Levenshtein Distance (LD) is better than simple accuracy but still fails to capture semantic chemical severity (e.g., mistaking “F” for “S” is worse than a wrong digit).
  • Hybrid Potential: Despite the dominance of ML, the authors argue that rule-based heuristics are still valuable for post-processing validation (e.g., checking element order, sequence structure, and formula correspondence).

Comparative Analysis of Rule-Based vs. ML Systems

As a review paper, it aggregates experimental results from primary sources. It compares:

  • Rule-based systems: OSRA, chemoCR, Imago, Markov Logic OCSR, and various heuristic approaches.
  • ML-based systems: DECIMER (multiple versions), MSE-DUDL, ICMDT (Image Captioning Model based on Deep Transformer-in-Transformer), and other BMS Kaggle competition solutions.

It contrasts these systems using:

  • Datasets: BMS (synthetic, 4M images), PubChem (synthetic), U.S. Patents (real-world scanned).
  • Metrics: Tanimoto similarity (structural overlap) and Levenshtein distance (string edit distance).

Outcomes, Critical Gaps, and Recommendations

  1. Transformers are SOTA: Attention-based encoder-decoder models outperform CNN-RNN hybrids. DECIMER 1.0 achieved 96.47% Tanimoto $= 1.0$ on its test set using an EfficientNet-B3 encoder and Transformer decoder.
  2. Data Hungry: Modern approaches require massive datasets (millions of images) and significant compute. DECIMER 1.0 trained on 39M images for 14 days on TPU, while the original DECIMER took 27 days on a single GPU. Rule-based systems required neither large data nor heavy compute but hit a performance ceiling.
  3. Critical Gaps:
    • Super-atoms: Current models struggle with abbreviated super-atoms (e.g., “Ph”, “COOH”).
    • Stereochemistry: 3D information (wedges/dashes) is often lost or misinterpreted.
    • Resolution: Models are brittle to resolution changes; some require high-res, others fail if images aren’t downscaled.
  4. Recommendation: Future systems should integrate “smart” pre-processing (denoising without cropping) and use domain-specific distance metrics. The authors also note that post-processing formula validation (checking element order, sequence structure, and formula correspondence) increases accuracy by around 5-6% on average. They suggest exploring Capsule Networks as an alternative to CNNs, since capsules add position invariance through routing-by-agreement rather than max-pooling.

Reproducibility

As a review paper, this work does not introduce original code, models, or datasets. The paper itself is open access via the Journal of Cheminformatics. This section summarizes the technical details of the systems reviewed.

Data

The review identifies the following key datasets used for training OCSR models:

DatasetTypeSizeNotes
BMS (Bristol-Myers Squibb)Synthetic~4M images2.4M train / 1.6M test. Used for Kaggle competition. Test images contain noise (salt & pepper, blur) and rotations absent from training images.
PubChemSynthetic~39MGenerated via CDK (Chemistry Development Kit). Used by DECIMER 1.0 (90/10 train/test split).
U.S. Patents (USPTO)ScannedVariableReal-world noise, often low resolution. One of several training sources for MSE-DUDL (alongside PubChem and Indigo, totaling 50M+ samples).
ChemInftyScanned869 imagesOlder benchmark used to evaluate rule-based systems (e.g., Markov Logic OCSR).

Algorithms

The review highlights the progression of algorithms:

  • Rule-Based: Hough transforms for bond detection, vectorization/skeletonization, and OCR for atom labels.
  • Sequence Modeling:
    • Image Captioning: Encoder (CNN/ViT) → Decoder (RNN/Transformer).
    • Tokenization: Parsing InChI/SMILES into discrete tokens (e.g., splitting C13 into C, 13).
    • Beam Search: Used in inference (typical $k=15-20$) to find the most likely chemical string.

Models

Key architectures reviewed:

  • DECIMER 1.0: Uses EfficientNet-B3 (Encoder) and Transformer (Decoder). Predicts SELFIES strings (more robust than SMILES).
  • Swin Transformer: Often used in Kaggle ensembles as the visual encoder due to better handling of variable image sizes.
  • Grid LSTM: Used in older deep learning approaches (MSE-DUDL) to capture spatial dependencies.

Evaluation

Metrics standard in the field:

  • Levenshtein Distance (LD): Edit distance between predicted and ground truth strings. Lower is better. Formally, for two sequences $a$ and $b$ (e.g. SMILES strings) of lengths $|a|$ and $|b|$, the recursive distance $LD(a, b)$ is bounded from $0$ to $\max(|a|, |b|)$.
  • Tanimoto Similarity: Measures overlap of molecular fingerprints ($0.0 - 1.0$). Higher is better. DECIMER 1.0 achieved a Tanimoto of 0.99 on PubChem data (Table 2). Calculated as: $$ \begin{aligned} T(A, B) = \frac{N_c}{N_a + N_b - N_c} \end{aligned} $$ where $N_a$ and $N_b$ are the number of bits set to 1 in fingerprints $A$ and $B$, and $N_c$ is the number of common bits set to 1.
  • 1-1 Match Rate: Exact string matching (accuracy). For DECIMER 1.0, 96.47% of results achieved Tanimoto $= 1.0$.

Hardware

  • Training Cost: High for SOTA. DECIMER 1.0 required ~14 days on TPU. The original DECIMER took ~27 days on a single NVIDIA GPU.
  • Inference: Transformer models are heavy; rule-based systems run on standard CPUs but with lower accuracy.

Citation

@article{musazadeReviewTechniquesModels2022,
  title = {Review of Techniques and Models Used in Optical Chemical Structure Recognition in Images and Scanned Documents},
  author = {Musazade, Fidan and Jamalova, Narmin and Hasanov, Jamaladdin},
  year = 2022,
  month = sep,
  journal = {Journal of Cheminformatics},
  volume = {14},
  number = {1},
  pages = {61},
  doi = {10.1186/s13321-022-00642-3}
}