Paper Information

Citation: Musazade, F., Jamalova, N., & Hasanov, J. (2022). Review of techniques and models used in optical chemical structure recognition in images and scanned documents. Journal of Cheminformatics, 14(1), 61. https://doi.org/10.1186/s13321-022-00642-3

Publication: Journal of Cheminformatics 2022

What kind of paper is this?

This is a Systematization paper ($\Psi_{\text{Systematization}}$). It does not propose a new architecture or release a new dataset. Instead, it organizes existing literature into two distinct evolutionary phases: Rule-based systems (1990s-2010s) and Machine Learning-based systems (2015-present). It synthesizes performance metrics across these paradigms to highlight the shift from simple classification to “image captioning” (sequence generation).

Justification: The paper focuses on “organizing and synthesizing existing literature” and answers the core question: “What do we know?” The dominant contribution is systematization based on several key indicators:

  1. Survey Structure: The paper explicitly structures content by categorizing the field into two distinct historical and methodological groups: “Rule-based systems” and “ML-based systems”. It traces the “evolution of approaches from rule-based structure analyses to complex statistical models”, moving chronologically from early tools like OROCS and OSRA (1990s-2000s) to modern Deep Learning approaches like DECIMER and Vision Transformers.

  2. Synthesis of Knowledge: The paper aggregates performance metrics from various distinct studies into unified comparison tables (Table 1 for rule-based and Table 2 for ML-based). It synthesizes technical details of different models, explaining how specific architectures (CNNs, LSTMs, Attention mechanisms) are applied to the specific problem of Optical Chemical Structure Recognition (OCSR).

  3. Identification of Gaps: The authors dedicate specific sections to “Gaps of rule-based systems” and “Gaps of ML-based systems”. It concludes with recommendations for future development, such as the need for “standardized datasets” and specific improvements in image augmentation and evaluation metrics.

What is the motivation?

The primary motivation is the need to digitize vast amounts of chemical knowledge locked in non-digital formats (e.g., scanned PDFs, older textbooks). This is challenging because:

  1. Representational Variety: A single chemical formula can be drawn in many visually distinct ways (e.g., different orientations, bond styles, fonts).
  2. Legacy Data: Older documents contain noise, low resolution, and disconnected strokes that confuse standard computer vision models.
  3. Lack of Standardization: There is no centralized database or standardized benchmark for evaluating OCSR performance, making comparison difficult.

What is the novelty here?

The paper provides a structured comparison of the “evolution” of OCSR, specifically identifying the pivot point where the field moved from object detection to NLP-inspired sequence generation.

Key insights include:

  • The Paradigm Shift: Identifying that OCSR has effectively become an “image captioning” problem where the “caption” is a SMILES or InChI string.
  • Metric Critique: It critically analyzes the flaws in current evaluation metrics, noting that Levenshtein Distance (LD) is better than simple accuracy but still fails to capture semantic chemical severity (e.g., mistaking “F” for “S” is worse than a wrong digit).
  • Hybrid Potential: Despite the dominance of ML, the authors argue that rule-based heuristics are still necessary for post-processing validation (e.g., valency checks).

What experiments were performed?

As a review paper, it aggregates experimental results from primary sources rather than running new benchmarks. It compares:

  • Rule-based systems: OSRA, ChemOCR, Imago, and various heuristic approaches.
  • ML-based systems: DECIMER (multiple versions), MSE-DUDL, and winning solutions from the BMS Kaggle competition.

It contrasts these systems using:

  • Datasets: BMS (synthetic, 4M images), PubChem (synthetic), U.S. Patents (real-world scanned).
  • Metrics: Tanimoto similarity (structural overlap) and Levenshtein distance (string edit distance).

What were the outcomes and conclusions drawn?

  1. Transformers are SOTA: Attention-based encoder-decoder models (like Vision Transformers) significantly outperform CNN-RNN hybrids, achieving ~96% accuracy on SMILES prediction.
  2. Data Hungry: Modern approaches require massive datasets (millions of images) and significant compute (weeks on TPUs), whereas rule-based systems required neither but hit a performance ceiling.
  3. Critical Gaps:
    • Super-atoms: Current models struggle with abbreviated super-atoms (e.g., “Ph”, “COOH”).
    • Stereochemistry: 3D information (wedges/dashes) is often lost or misinterpreted.
    • Resolution: Models are brittle to resolution changes; some require high-res, others fail if images aren’t downscaled.
  4. Recommendation: Future systems should integrate “smart” pre-processing (denoising without cropping) and use domain-specific distance metrics rather than generic string distance.

Reproducibility Details

This section summarizes the technical details of the systems reviewed to aid in reproducing the State-of-the-Art (SOTA).

Data

The review identifies the following key datasets used for training OCSR models:

DatasetTypeSizeNotes
BMS (Bristol-Myers Squibb)Synthetic~4M images2.4M train / 1.6M test. Used for Kaggle competition. Contains noise (salt & pepper, blur).
PubChemSynthetic~50M+Can be generated via CDK (Chemistry Development Kit). Used by DECIMER.
U.S. Patents (USPTO)ScannedVariableReal-world noise, often low resolution. Used for MSE-DUDL.
ChemInftyScanned~1K imagesOlder benchmark for rule-based systems.

Algorithms

The review highlights the progression of algorithms:

  • Rule-Based: Hough transforms for bond detection, vectorization/skeletonization, and OCR for atom labels.
  • Sequence Modeling:
    • Image Captioning: Encoder (CNN/ViT) → Decoder (RNN/Transformer).
    • Tokenization: Parsing InChI/SMILES into discrete tokens (e.g., splitting C13 into C, 13).
    • Beam Search: Used in inference (typical $k=15-20$) to find the most likely chemical string.

Models

Key architectures reviewed:

  • DECIMER 1.0: Uses EfficientNet-B3 (Encoder) and Transformer (Decoder). Predicts SELFIES strings (more robust than SMILES).
  • Swin Transformer: Often used in Kaggle ensembles as the visual encoder due to better handling of variable image sizes.
  • Grid LSTM: Used in older deep learning approaches (MSE-DUDL) to capture spatial dependencies.

Evaluation

Metrics standard in the field:

  • Levenshtein Distance (LD): Edit distance between predicted and ground truth strings. Lower is better.
  • Tanimoto Similarity: Measures overlap of molecular fingerprints ($0.0 - 1.0$). Higher is better. SOTA is $\approx 0.99$ on synthetic data.
  • 1-1 Match Rate: Exact string matching (accuracy). SOTA is $\approx 96\%$ for Transformers.

Hardware

  • Training Cost: High for SOTA. DECIMER required ~2 weeks on TPUs.
  • Inference: Transformer models are heavy; rule-based systems run on standard CPUs but with lower accuracy.

Citation

@article{musazadeReviewTechniquesModels2022,
  title = {Review of Techniques and Models Used in Optical Chemical Structure Recognition in Images and Scanned Documents},
  author = {Musazade, Fidan and Jamalova, Narmin and Hasanov, Jamaladdin},
  year = 2022,
  month = sep,
  journal = {Journal of Cheminformatics},
  volume = {14},
  number = {1},
  pages = {61},
  doi = {10.1186/s13321-022-00642-3}
}