Review of OCSR Tools (2020)

Paper Information

Citation: Rajan, K., Brinkhaus, H. O., Zielesny, A., & Steinbeck, C. (2020). A review of optical chemical structure recognition tools. Journal of Cheminformatics, 12(1), 60. https://doi.org/10.1186/s13321-020-00465-0

Publication: Journal of Cheminformatics 2020

Systematization and Benchmarking of OCSR

This is primarily a Systematization paper ($0.7 \Psi_{\text{Systematization}}$) with a significant Resource component ($0.3 \Psi_{\text{Resource}}$).

It serves as a Systematization because it organizes nearly three decades of research in Optical Chemical Structure Recognition (OCSR), categorizing methods into rule-based systems (e.g., Kekulé, CLiDE, OSRA) and emerging machine-learning approaches (e.g., MSE-DUDL, Chemgrapher). It synthesizes information on roughly 16 distinct tools, many of which are commercial or no longer available.

It acts as a Resource by defining a benchmark for the field. The authors evaluate the three available open-source tools (Imago, MolVec, OSRA) against four distinct datasets to establish baseline performance metrics for accuracy and speed.

Motivation: Digitizing Legacy Chemical Literature

A vast amount of chemical knowledge remains “hidden” in the primary scientific literature (printed or PDF), conveyed as 2D images. Because these depictions are not machine-readable, there is a “backlog of decades of chemical literature” that cannot be easily indexed or searched in open-access databases.

While Chemical Named Entity Recognition (NER) exists for text, translating graphical depictions into formats like SMILES or SDfiles requires specialized OCSR tools. The motivation is to enable the automated curation of this legacy data to feed public databases.

Core Innovations: Historical Taxonomy and Open Standards

The primary novelty is the comprehensive aggregation of the history of the field, which had not been thoroughly reviewed recently. It details the algorithmic evolution from the first system in 1990 to deep learning methods in 2019.

Specific contributions include:

Historical Taxonomy: Classification of tools into rule-based vs. machine-learning, and open-source vs. commercial/unavailable.
Open Source Benchmark: A comparative performance analysis of the only three open-source tools available at the time (Imago, MolVec, OSRA) on standardized datasets.
Algorithmic Breakdown: Detailed summaries of the workflows for closed-source or lost tools (e.g., Kekulé, OROCS, ChemReader) based on their original publications.

Benchmarking Methodology and Open-Source Evaluation

The authors performed a benchmark study to evaluate the accuracy and speed of three open-source OCSR tools: MolVec (0.9.7), Imago (2.0), and OSRA (2.1.0).

They tested these tools on four datasets of varying quality and origin:

USPTO: 5,719 images from US patents (high quality).
UOB: 5,740 images from the University of Birmingham (synthetic/generated).
CLEF 2012: 961 images from the CLEF-IP evaluation (well-segmented, clean).
JPO: 450 images from Japanese patents (low quality, noise, Japanese characters).

Evaluation metrics were:

Accuracy: Percentage of perfectly recognized structures, mathematically defined as exact string matching between generated and reference standard InChI sequences $\text{Accuracy} = \frac{\text{Correct InChI Matches}}{\text{Total Images}}$ (verified by converting output to InChI strings and matching against reference InChIs).
Speed: Total processing time for the dataset.

Results and General Conclusions

Performance Results:

MolVec was the fastest tool, processing datasets significantly quicker than competitors (e.g., 28 mins for USPTO vs. 145 mins for OSRA).
OSRA performed exceptionally well on clean, well-segmented data (94.90% on CLEF 2012) but was slower.
Imago generally lagged in accuracy compared to the other two, particularly on the UOB dataset (63.54% vs 88% for others).
JPO Difficulty: All tools struggled with the noisy Japanese Patent Office dataset (accuracies ranged from 40% to 66%), highlighting issues with noise and non-standard labels.

General Conclusions:

No “gold standard” tool existed (as of 2020) that solved all problems (page segmentation, R-groups, NLP integration).
Rule-based approaches dominate the history of the field, but deep learning methods (MSE-DUDL, Chemgrapher) were emerging, though they were closed-source at the time of writing.
There was a critical need for tools that could handle full-page recognition (combining segmentation and recognition).

Reproducibility Details

The authors provided sufficient detail to replicate the benchmarking study.

Data

The study used four public datasets. Images were converted to PNG (72 dpi) to ensure compatibility across all tools.

Dataset	Size	Source	Characteristics
USPTO	5,719	OSRA Validation Set	US Patent images, generally clean.
UOB	5,740	Univ. of Birmingham	Synthetic/generated images.
CLEF 2012	961	CLEF-IP 2012	Well-segmented, high quality.
JPO	450	Japanese Patent Office	Low quality, noisy, contains Japanese text.

Algorithms

The paper does not propose a new algorithm but benchmarks existing ones. The execution commands for reproducibility were:

Imago: Executed via command line without installation. ./imago_console -dir/image/directory/path
MolVec: Executed as a JAR file. java -cp [dependencies] gov.nih.ncats.molvec.Main [input_dir] -outDir [output_dir]
OSRA: Installed via Conda (PyOSRA) due to compilation complexity. Required dictionaries for superatoms and spelling. osra -f sdf -a [superatom_dict] -l [spelling_dict] -w [output_file] [input_file]

Models

The specific versions of the open-source software tested were:

Tool	Version	Technology	License
MolVec	0.9.7	Java-based, rule-based	Open Source
Imago	2.0	C++, rule-based	Open Source
OSRA	2.1.0	C++, rule-based	Open Source

Evaluation

Metric: Perfect structural match. The output SDfile/SMILES was converted to a Standard InChI string and compared to the ground truth InChI. Any deviation counted as a failure.
Environment: Linux workstation (Ubuntu 20.04 LTS).

Hardware

The benchmark was performed on a high-end workstation to measure processing time.

CPUs: 2x Intel Xeon Silver 4114 (40 threads total).
RAM: 64 GB.
Parallelization: MolVec had pre-implemented parallelization features that contributed to its speed.

Citation

@article{rajanReviewOpticalChemical2020,
  title = {A Review of Optical Chemical Structure Recognition Tools},
  author = {Rajan, Kohulan and Brinkhaus, Henning Otto and Zielesny, Achim and Steinbeck, Christoph},
  year = 2020,
  month = oct,
  journal = {Journal of Cheminformatics},
  volume = {12},
  number = {1},
  pages = {60},
  issn = {1758-2946},
  doi = {10.1186/s13321-020-00465-0}
}

Paper Information#

Systematization and Benchmarking of OCSR#

Motivation: Digitizing Legacy Chemical Literature#

Core Innovations: Historical Taxonomy and Open Standards#

Benchmarking Methodology and Open-Source Evaluation#

Results and General Conclusions#

Reproducibility Details#

Data#

Algorithms#

Models#

Evaluation#

Hardware#

Citation#