OSRA: Open Source Optical Structure Recognition

Paper Information

Citation: Filippov, I. V., & Nicklaus, M. C. (2009). Optical Structure Recognition Software To Recover Chemical Information: OSRA, An Open Source Solution. Journal of Chemical Information and Modeling, 49(3), 740-743. https://doi.org/10.1021/ci800067r

Publication: J. Chem. Inf. Model. 2009

Additional Resources:

Overview and Motivation

Resource

This paper is a quintessential Infrastructure contribution ($\Psi_{\text{Resource}}$). While it contains significant algorithmic detail, the rhetorical structure and primary goal place it squarely as an infrastructure paper. The dominant contribution is the creation, release, and documentation of a software tool (OSRA).

A vast amount of chemical knowledge is locked in scientific and patent documents as graphical images (Kekulé structures). This is the classic chemical informatics challenge: decades of chemical knowledge are trapped in visual form.

Legacy Data Gap: Historical literature does not use computer-parsable formats, making automated processing of millions of documents impossible without optical recognition. Scientific papers and patents have historically depicted molecules as 2D structural diagrams.
Need for Automation: Manual transcription is not scalable for the hundreds of thousands of documents available. While modern standards like InChI and CML exist, the vast majority of chemical literature remains inaccessible for computational analysis.
Open Source Gap: Before OSRA, only commercial software like CLiDE existed for this task, creating a barrier for academic researchers and limiting reproducibility. no universal or open-source solution was available prior to this work.

Core Innovations and Pipeline

OSRA is claimed to be the first open-source optical structure recognition (OSR) program. The novelty lies in creating an accessible OCSR system with a practical, multi-stage pipeline that combines classical image processing techniques with chemical knowledge.

Key contributions:

Integrated Pipeline: It uniquely combines existing open-source image processing tools (ImageMagick for formats, Potrace for vectorization, GOCR/OCRAD for text) into a chemical recognition workflow. The value is in the assembly and integration.
Vectorization-Based Approach: OSRA uses the Potrace library to convert bitmap images into vector graphics (Bezier curves), then analyzes the geometry of these curves to identify bonds and atoms. This is more robust than angle-based detection methods because it leverages continuous mathematical properties of curves.
Multi-Resolution Processing with Confidence Estimation: The system automatically processes each image at three different resolutions (72, 150, and 300 dpi), generating up to three candidate structures. A learned confidence function trained via linear regression on chemical features (heteroatom count, ring patterns, fragment count) selects the most chemically sensible result.
Resolution Independence: Unlike some predecessors, it is designed to handle over 90 image formats and works independently of specific resolutions or fonts.
Comprehensive Chemical Rules: OSRA implements sophisticated heuristics for chemical structure interpretation:
- Distinguishes bridge bond crossings from tetrahedral carbon centers using graph connectivity rules
- Recognizes stereochemistry from wedge bonds (detected via line thickness gradients)
- Handles old-style aromatic notation (circles inside rings)
- Expands common chemical abbreviations (superatoms like “COOH” or “CF₃”)
- Uses the 75th percentile of bond lengths as the reference length to avoid outlier bias

Methodology and Validation

The authors validated OSRA against both commercial software and manual curation:

Commercial Comparison: They compared OSRA against CLiDE (a commercial OSR tool) using a “small test set” of 11 files provided by Simbiosys containing 42 structures. Performance was measured both as exact match accuracy and as Tanimoto similarity using molecular fingerprints.
Internal Validation: They tested on an internal set of 66 images containing 215 structures, covering various resolutions, color depths, and drawing styles to assess performance at scale and characterize typical error patterns.
Metric Definition: They defined recognition success using both exact matches (“Perfect by InChI”) and Tanimoto similarity (using CACTVS fingerprints). The authors explicitly argued for using Tanimoto similarity as the primary evaluation metric, reasoning that partial recognition (e.g., missing a methyl group) still provides useful chemical information, which binary “correct/incorrect” judgments fail to capture.

Results and Conclusions

Competitive Accuracy: On the small comparative set, OSRA recognized 26 structures perfectly (by InChI) versus CLiDE’s 11, demonstrating that an open-source, rule-based approach could outperform established commercial systems.
Robustness: On the internal diverse set (215 structures), OSRA achieved a 93% average Tanimoto similarity and perfectly recognized 107 structures (50%). Tanimoto similarity above 85% was achieved for 182 structures (85%). This established OSRA as a competitive tool for practical use.
Multi-Resolution Success: The multi-resolution strategy allowed OSRA to handle images with varying quality and formats. The confidence function (with correlation coefficient $r=0.89$) successfully identified which resolution produced the most chemically plausible structure.
Limitations: The authors acknowledge issues with:
- “Imperfect segmentation” leading to missed structures (3 missed in internal set) and false positives (7 in internal set)
- Novel drawing conventions not covered by the implemented heuristics
- Highly degraded or noisy images where vectorization fails
- Hand-drawn structures that deviate significantly from standard chemical drawing practices
- Complex reaction schemes with multiple molecules and arrows
Open-Source Impact: By releasing OSRA as open-source software, the authors enabled widespread adoption and community contribution. This established a foundation for future OCSR research and made the technology accessible to researchers without commercial software budgets.

The work established that rule-based OCSR systems could achieve competitive performance when carefully engineered with chemical knowledge. OSRA became a standard baseline for the field and remained the dominant open-source solution until the rise of deep learning methods over a decade later. The vectorization-based approach and the emphasis on Tanimoto similarity as an evaluation metric influenced subsequent work in the area.

Technical Details

Grayscale Conversion

OSRA uses a non-standard grayscale conversion to preserve light-colored atoms (e.g., yellow sulfur):

$$\text{Gray} = \min(R, G, B)$$

This prevents light colors from being washed out during binarization, unlike the standard weighted formula ($0.3R + 0.59G + 0.11B$).

Image Segmentation

Chemical structures are identified within a page using specific bounding box criteria:

Black pixel density: Must be between 0.0 and 0.2
Aspect ratio: Height-to-width ratio must be between 0.2 and 5.0
Minimum size: Width and height must be >50 pixels at resolutions >150 dpi

Noise Detection and Smoothing

A “noise factor” determines whether anisotropic smoothing is applied:

$$\text{Noise Factor} = \frac{\text{Count of 2-pixel line segments}}{\text{Count of 3-pixel line segments}}$$

Smoothing is applied only if this ratio is between 0.5 and 1.0.

Atom Detection from Bezier Curves

Potrace Bezier control points are flagged as potential atoms if:

The point is classified as a “corner” by Potrace
The vector direction change has a normal component of at least 2 pixels

The normal component criterion is more robust than angle-based detection because angles are difficult to measure accurately in pixelated environments where line thickness is non-zero.

Bond Length Estimation

The reference bond length is computed as the 75th percentile of all detected bond lengths. This avoids bias from outlier bonds (e.g., extremely short or long bonds from recognition errors).

Confidence Function

A linear regression function selects the best result from the multi-scale processing:

$$\text{confidence} = 0.316 - 0.016N_{c} + 0.034N_{N} + 0.067N_{o} + \ldots + 0.330N_{\text{rings5}} + \ldots$$

where $N_C$, $N_N$, $N_O$ represent counts of carbon, nitrogen, oxygen atoms, respectively. It prioritizes structures with more recognized heteroatoms and rings, while penalizing fragment counts. Additional terms account for ring pattern

Minimum size: Width and height must be >50 pixels at resolutions >150 dpi

Noise Detection and Smoothing

A “noise factor” determines whether anisotropic smoothing is applied:

$$\text{Noise Factor} = \frac{\text{Count of 2-pixel line segments}}{\text{Count of 3-pixel line segments}}$$ | Purpose | Dataset | Size | Notes | |———|———|——|——-| | Comparison | “Small test set” (Simbiosys) | 11 files (42 structures) | Used to compare vs. CLiDE | | Validation | Internal Test Set | 66 images (215 structures) | Various resolutions, color depths, styles |

Evaluation

Metrics used to define “Success”:

Metric	Definition
Perfect by InChI	Exact match of the InChI string to the human-curated structure.
Average Tanimoto	Tanimoto similarity (CACTVS fingerprints) between OSRA output and ground truth.
uuuuu	NCI CADD identifier match (topology only; ignores stereochem/charge/tautomers).

Results Table (Comparison):

Tool	Perfect (InChI)	T > 85%	uuuuu Match
OSRA	26 / 42	39 / 42	28 / 42
CLiDE	11 / 42	26 / 42	12 / 42

Software/Dependencies

The system relies on external libraries:

ImageMagick: Image format parsing (supports 90+ formats)
Ghostscript: PDF/PS interpretation
Potrace: Vectorization (converts bitmap to Bezier curves)
GOCR / OCRAD: Optical Character Recognition (heteroatom label recognition)
OpenBabel / RDKit: Chemical backends for connection table compilation
Output Formats: SMILES strings and SD files

Citation

@article{filippovOpticalStructureRecognition2009,
  title = {Optical {{Structure Recognition Software To Recover Chemical Information}}: {{OSRA}}, {{An Open Source Solution}}},
  shorttitle = {Optical {{Structure Recognition Software To Recover Chemical Information}}},
  author = {Filippov, Igor V. and Nicklaus, Marc C.},
  year = {2009},
  month = mar,
  journal = {Journal of Chemical Information and Modeling},
  volume = {49},
  number = {3},
  pages = {740--743},
  doi = {10.1021/ci800067r}
}

The confidence function is a linear regression model trained on chemical features:

$$\text{Confidence} = 0.316 - 0.016N_C + 0.034N_N + 0.067N_O + 0.036N_F + \ldots$$

where $N_C$, $N_N$, $N_O$, $N_F$ represent counts of carbon, nitrogen, oxygen, and fluorine atoms, respectively. Additional terms account for ring counts and fragment counts. The model achieves a correlation coefficient of $r=0.89$.

This function scores the three resolution candidates (72, 150, and 300 dpi), and the highest-scoring structure is selected as the final output.

Data

Test Sets:

CLiDE Comparison: 42 structures from 11 files (Simbiosys small test set)
Internal Validation: 215 structures

Evaluation Metrics:

Exact match accuracy (binary correct/incorrect)
Tanimoto similarity using molecular fingerprints (preferred metric for partial recognition credit)

Models

Pipeline Components:

Image Preprocessing: ImageMagick (supports 90+ formats)
Vectorization: Potrace library (converts bitmap to Bezier curves)
OCR: GOCR and OCRAD (heteroatom label recognition)
Output Formats: SMILES strings and SD files

Paper Information#

Overview and Motivation#

Core Innovations and Pipeline#

Methodology and Validation#

Results and Conclusions#

Technical Details#

Noise Detection and Smoothing#

Evaluation#

Software/Dependencies#

Citation#

Data#

Models#