Paper Information

Citation: Rajan, K., Brinkhaus, H. O., Zielesny, A., & Steinbeck, C. (2020). A review of optical chemical structure recognition tools. Journal of Cheminformatics, 12(1), 60. https://doi.org/10.1186/s13321-020-00465-0

Publication: Journal of Cheminformatics 2020

What kind of paper is this?

This is primarily a Systematization paper ($0.7 \Psi_{\text{Systematization}}$) with a significant Resource component ($0.3 \Psi_{\text{Resource}}$).

It serves as a Systematization because it organizes nearly three decades of research in Optical Chemical Structure Recognition (OCSR), categorizing methods into rule-based systems (e.g., Kekulé, CLiDE, OSRA) and emerging machine-learning approaches (e.g., MSE-DUDL, Chemgrapher). It synthesizes information on roughly 16 distinct tools, many of which are commercial or no longer available.

It acts as a Resource by defining a benchmark for the field. The authors evaluate the three available open-source tools (Imago, MolVec, OSRA) against four distinct datasets to establish baseline performance metrics for accuracy and speed.

What is the motivation?

A vast amount of chemical knowledge remains “hidden” in the primary scientific literature (printed or PDF), conveyed as 2D images rather than machine-readable formats. Because these depictions are not machine-readable, there is a “backlog of decades of chemical literature” that cannot be easily indexed or searched in open-access databases.

While Chemical Named Entity Recognition (NER) exists for text, translating graphical depictions into formats like SMILES or SDfiles requires specialized OCSR tools. The motivation is to enable the automated curation of this legacy data to feed public databases.

What is the novelty here?

The primary novelty is the comprehensive aggregation of the history of the field, which had not been thoroughly reviewed recently. It details the algorithmic evolution from the first system in 1990 to deep learning methods in 2019.

Specific contributions include:

  • Historical Taxonomy: Classification of tools into rule-based vs. machine-learning, and open-source vs. commercial/unavailable.
  • Open Source Benchmark: A comparative performance analysis of the only three open-source tools available at the time (Imago, MolVec, OSRA) on standardized datasets.
  • Algorithmic Breakdown: Detailed summaries of the workflows for closed-source or lost tools (e.g., Kekulé, OROCS, ChemReader) based on their original publications.

What experiments were performed?

The authors performed a benchmark study to evaluate the accuracy and speed of three open-source OCSR tools: MolVec (0.9.7), Imago (2.0), and OSRA (2.1.0).

They tested these tools on four datasets of varying quality and origin:

  1. USPTO: 5,719 images from US patents (high quality).
  2. UOB: 5,740 images from the University of Birmingham (synthetic/generated).
  3. CLEF 2012: 961 images from the CLEF-IP evaluation (well-segmented, clean).
  4. JPO: 450 images from Japanese patents (low quality, noise, Japanese characters).

Evaluation metrics were:

  • Accuracy: Percentage of perfectly recognized structures (verified by converting output to InChI strings and matching against reference InChIs).
  • Speed: Total processing time for the dataset.

What outcomes/conclusions?

Performance Results:

  • MolVec was the fastest tool, processing datasets significantly quicker than competitors (e.g., 28 mins for USPTO vs. 145 mins for OSRA).
  • OSRA performed exceptionally well on clean, well-segmented data (94.90% on CLEF 2012) but was slower.
  • Imago generally lagged in accuracy compared to the other two, particularly on the UOB dataset (63.54% vs ~88% for others).
  • JPO Difficulty: All tools struggled with the noisy Japanese Patent Office dataset (accuracies ranged from 40% to 66%), highlighting issues with noise and non-standard labels.

General Conclusions:

  • No “gold standard” tool currently exists that solves all problems (page segmentation, R-groups, NLP integration).
  • Rule-based approaches dominate the history of the field, but deep learning methods (MSE-DUDL, Chemgrapher) are emerging, though they were closed-source at the time of writing.
  • There is a critical need for tools that can handle full-page recognition (combining segmentation and recognition) rather than just single-molecule clips.

Reproducibility Details

The authors provided sufficient detail to replicate the benchmarking study.

Data

The study used four public datasets. Images were converted to PNG (72 dpi) to ensure compatibility across all tools.

DatasetSizeSourceCharacteristics
USPTO5,719OSRA Validation SetUS Patent images, generally clean.
UOB5,740Univ. of BirminghamSynthetic/generated images.
CLEF 2012961CLEF-IP 2012Well-segmented, high quality.
JPO450Japanese Patent OfficeLow quality, noisy, contains Japanese text.

Algorithms

The paper does not propose a new algorithm but benchmarks existing ones. The execution commands for reproducibility were:

  • Imago: Executed via command line without installation. ./imago_console -dir/image/directory/path
  • MolVec: Executed as a JAR file. java -cp [dependencies] gov.nih.ncats.molvec.Main [input_dir] -outDir [output_dir]
  • OSRA: Installed via Conda (PyOSRA) due to compilation complexity. Required dictionaries for superatoms and spelling. osra -f sdf -a [superatom_dict] -l [spelling_dict] -w [output_file] [input_file]

Models

The specific versions of the open-source software tested were:

ToolVersionTechnologyLicense
MolVec0.9.7Java-based, rule-basedOpen Source
Imago2.0C++, rule-basedOpen Source
OSRA2.1.0C++, rule-basedOpen Source

Evaluation

  • Metric: Perfect structural match. The output SDfile/SMILES was converted to a Standard InChI string and compared to the ground truth InChI. Any deviation counted as a failure.
  • Environment: Linux workstation (Ubuntu 20.04 LTS).

Hardware

The benchmark was performed on a high-end workstation to measure processing time.

  • CPUs: 2x Intel Xeon Silver 4114 (40 threads total).
  • RAM: 64 GB.
  • Parallelization: MolVec had pre-implemented parallelization features that contributed to its speed.

Citation

@article{rajanReviewOpticalChemical2020,
  title = {A Review of Optical Chemical Structure Recognition Tools},
  author = {Rajan, Kohulan and Brinkhaus, Henning Otto and Zielesny, Achim and Steinbeck, Christoph},
  year = 2020,
  month = oct,
  journal = {Journal of Cheminformatics},
  volume = {12},
  number = {1},
  pages = {60},
  issn = {1758-2946},
  doi = {10.1186/s13321-020-00465-0}
}