Paper Information
Citation: Filippov, I. V., & Nicklaus, M. C. (2009). Optical Structure Recognition Software To Recover Chemical Information: OSRA, An Open Source Solution. Journal of Chemical Information and Modeling, 49(3), 740-743. https://doi.org/10.1021/ci800067r
Publication: J. Chem. Inf. Model. 2009
Additional Resources:
What kind of paper is this?
Resource
This paper is a quintessential Infrastructure contribution ($\Psi_{\text{Resource}}$). While it contains significant algorithmic detail, the rhetorical structure and primary goal place it squarely as an infrastructure paper. The dominant contribution is the creation, release, and documentation of a software tool (OSRA) rather than a novel theoretical derivation or scientific discovery.
A vast amount of chemical knowledge is locked in scientific and patent documents as graphical images (Kekulé structures) rather than machine-readable formats like InChI or SMILES. This is the classic chemical informatics challenge: decades of chemical knowledge are trapped in visual form.
- Legacy Data Gap: Historical literature does not use computer-parsable formats, making automated processing of millions of documents impossible without optical recognition. Scientific papers and patents have historically depicted molecules as 2D structural diagrams rather than machine-readable representations.
- Need for Automation: Manual transcription is not scalable for the hundreds of thousands of documents available. While modern standards like InChI and CML exist, the vast majority of chemical literature remains inaccessible for computational analysis.
- Open Source Gap: Before OSRA, only commercial software like CLiDE existed for this task, creating a barrier for academic researchers and limiting reproducibility. No universal or open-source solution was available prior to this work
OSRA is claimed to be the first open-source optical structure recognition (OSR) program. The novelty lies in creating an accessible OCSR system with a practical, multi-stage pipeline that combines classical image processing techniques with chemical knowledge.
Key contributions:
Integrated Pipeline: It uniquely combines existing open-source image processing tools (ImageMagick for formats, Potrace for vectorization, GOCR/OCRAD for text) into a chemical recognition workflow. The value is in the assembly and integration rather than novel algorithm invention.
Vectorization-Based Approach: OSRA uses the Potrace library to convert bitmap images into vector graphics (Bezier curves), then analyzes the geometry of these curves to identify bonds and atoms. This is more robust than angle-based detection methods because it leverages continuous mathematical properties of curves rather than discrete pixel patterns.
Multi-Resolution Processing with Confidence Estimation: The system automatically processes each image at three different resolutions (72, 150, and 300 dpi), generating up to three candidate structures. A learned confidence function trained via linear regression on chemical features (heteroatom count, ring patterns, fragment count) selects the most chemically sensible result.
Resolution Independence: Unlike some predecessors, it is designed to handle over 90 image formats and works independently of specific resolutions or fonts.
Comprehensive Chemical Rules: OSRA implements sophisticated heuristics for chemical structure interpretation:
- Distinguishes bridge bond crossings from tetrahedral carbon centers using graph connectivity rules
- Recognizes stereochemistry from wedge bonds (detected via line thickness gradients)
- Handles old-style aromatic notation (circles inside rings)
- Expands common chemical abbreviations (superatoms like “COOH” or “CF₃”)
- Uses the 75th percentile of bond lengths as the reference length to avoid outlier bias
- Expands common chemical abbreviations (superatoms like “COOH” or “CF₃”) authors validated OSRA against both commercial software and manual curation:
Commercial Comparison: They compared OSRA against CLiDE (a commercial OSR tool) using a “small test set” of 11 files provided by Simbiosys containing 42 structures. Performance was measured both as exact match accuracy and as Tanimoto similarity using molecular fingerprints.
Internal Validation: They tested on an internal set of 66 images containing 215 structures, covering various resolutions, color depths, and drawing styles to assess performance at scale and characterize typical error patterns.
Metric Definition: They defined recognition success using both exact matches (“Perfect by InChI”) and Tanimoto similarity (using CACTVS fingerprints). The authors explicitly argued for using Tanimoto similarity as the primary evaluation metric, reasoning that partial recognition (e.g., missing a methyl group) still provides useful chemical information, which binary “correct/incorrect” judgments fail to capture.
evaluation focused on demonstrating that OSRA could match or exceed commercial alternatives while being fully open-source:
Benchmark Against CLiDE: OSRA was compared to CLiDE, the dominant commercial OCSR tool at the time, on a test set of 42 structures from 11 files selected from the small test set provided by Simbiosys. Performance was measured both as exact match accuracy and as Tanimoto similarity using molecular fingerprints.
Large-Scale Internal Validation: The system was evaluated on an internal dataset of 215 structures to assess performance at scale and characterize typical error patterns.
**Mewere the outcomes and conclusions drawn?
Competitive Accuracy: On the small comparative set, OSRA recognized 26 structures perfectly (by InChI) versus CLiDE’s 11, demonstrating that an open-source, rule-based approach could outperform established commercial systems.
Robustness: On the internal diverse set (215 structures), OSRA achieved a 93% average Tanimoto similarity and perfectly recognized 107 structures (50%). Tanimoto similarity above 85% was achieved for 182 structures (85%). This established OSRA as a competitive tool for practical use.
Multi-Resolution Success: The multi-resolution strategy allowed OSRA to handle images with varying quality and formats. The confidence function (with correlation coefficient $r=0.89$) successfully identified which resolution produced the most chemically plausible structure.
Limitations: The authors acknowledge issues with:
- “Imperfect segmentation” leading to missed structures (3 missed in internal set) and false positives (7 in internal set)
- Novel drawing conventions not covered by the implemented heuristics
- Highly degraded or noisy images where vectorization fails
- Hand-drawn structures that deviate significantly from standard chemical drawing practices
- Complex reaction schemes with multiple molecules and arrows
Impact: The tool provides a scalable way to index chemical databases from images, supported by a released command-line utility and web interface. By releasing OSRA as open-source software, the authors enabled widespread adoption and community contribution, making
- Hand-drawn structures that deviate significantly from standard chemical drawing practices
- Complex reaction schemes with multiple molecules and arrows
Open-Source Impact: By releasing OSRA as open-source software, the authors enabled widespread adoption and community contribution. This established a foundation for future OCSR research and made the technology accessible to researchers without commercial software budgets.
The work established that rule-based OCSR systems could achieve competitive performance when carefully engineered with chemical knowledge. OSRA became a standard baseline for the field and remained the dominant open-source solution until the rise of deep learning methods over a decade later. The vectorization-based approach and the emphasis on Tanimoto similarity as an evaluation metric influenced subsequent work in the area.
Grayscale Conversion
OSRA uses a non-standard grayscale conversion to preserve light-colored atoms (e.g., yellow sulfur):
$$\text{Gray} = \min(R, G, B)$$
This prevents light colors from being washed out during binarization, unlike the standard weighted formula ($0.3R + 0.59G + 0.11B$).
Image Segmentation
Chemical structures are identified within a page using specific bounding box criteria:
- Black pixel density: Must be between 0.0 and 0.2
- Aspect ratio: Height-to-width ratio must be between 0.2 and 5.0
- Minimum size: Width and height must be >50 pixels at resolutions >150 dpi
Noise Detection and Smoothing
A “noise factor” determines whether anisotropic smoothing is applied:
$$\text{Noise Factor} = \frac{\text{Count of 2-pixel line segments}}{\text{Count of 3-pixel line segments}}$$
Smoothing is applied only if this ratio is between 0.5 and 1.0.
Atom Detection from Bezier Curves
Potrace Bezier control points are flagged as potential atoms if:
- The point is classified as a “corner” by Potrace
- The vector direction change has a normal component of at least 2 pixels
The normal component criterion is more robust than angle-based detection because angles are difficult to measure accurately in pixelated environments where line thickness is non-zero.
Bond Length Estimation
The reference bond length is computed as the 75th percentile of all detected bond lengths. This avoids bias from outlier bonds (e.g., extremely short or long bonds from recognition errors).
Confidence Function
A linear regression function selects the best result from the multi-scale processing:
$$\text{confidence} = 0.316 - 0.016N_{c} + 0.034N_{N} + 0.067N_{o} + \ldots + 0.330N_{\text{rings5}} + \ldots$$
where $N_C$, $N_N$, $N_O$ represent counts of carbon, nitrogen, oxygen atoms, respectively. It prioritizes structures with more recognized heteroatoms and rings, while penalizing fragment counts. Additional terms account for ring pattern
- Minimum size: Width and height must be >50 pixels at resolutions >150 dpi
Noise Detection and Smoothing
A “noise factor” determines whether anisotropic smoothing is applied:
$$\text{Noise Factor} = \frac{\text{Count of 2-pixel line segments}}{\text{Count of 3-pixel line segments}}$$ | Purpose | Dataset | Size | Notes | |———|———|——|——-| | Comparison | “Small test set” (Simbiosys) | 11 files (42 structures) | Used to compare vs. CLiDE | | Validation | Internal Test Set | 66 images (215 structures) | Various resolutions, color depths, styles |
Evaluation
Metrics used to define “Success”:
| Metric | Definition |
|---|---|
| Perfect by InChI | Exact match of the InChI string to the human-curated structure. |
| Average Tanimoto | Tanimoto similarity (CACTVS fingerprints) between OSRA output and ground truth. |
| uuuuu | NCI CADD identifier match (topology only; ignores stereochem/charge/tautomers). |
Results Table (Comparison):
| Tool | Perfect (InChI) | T > 85% | uuuuu Match |
|---|---|---|---|
| OSRA | 26 / 42 | 39 / 42 | 28 / 42 |
| CLiDE | 11 / 42 | 26 / 42 | 12 / 42 |
Software/Dependencies
The system relies on external libraries:
- ImageMagick: Image format parsing (supports 90+ formats)
- Ghostscript: PDF/PS interpretation
- Potrace: Vectorization (converts bitmap to Bezier curves)
- GOCR / OCRAD: Optical Character Recognition (heteroatom label recognition)
- OpenBabel / RDKit: Chemical backends for connection table compilation
- Output Formats: SMILES strings and SD files
Citation
@article{filippovOpticalStructureRecognition2009,
title = {Optical {{Structure Recognition Software To Recover Chemical Information}}: {{OSRA}}, {{An Open Source Solution}}},
shorttitle = {Optical {{Structure Recognition Software To Recover Chemical Information}}},
author = {Filippov, Igor V. and Nicklaus, Marc C.},
year = {2009},
month = mar,
journal = {Journal of Chemical Information and Modeling},
volume = {49},
number = {3},
pages = {740--743},
doi = {10.1021/ci800067r}
}
The confidence function is a linear regression model trained on chemical features:
$$\text{Confidence} = 0.316 - 0.016N_C + 0.034N_N + 0.067N_O + 0.036N_F + \ldots$$
where $N_C$, $N_N$, $N_O$, $N_F$ represent counts of carbon, nitrogen, oxygen, and fluorine atoms, respectively. Additional terms account for ring counts and fragment counts. The model achieves a correlation coefficient of $r=0.89$.
This function scores the three resolution candidates (72, 150, and 300 dpi), and the highest-scoring structure is selected as the final output.
Data
Test Sets:
- CLiDE Comparison: 42 structures from 11 files (Simbiosys small test set)
- Internal Validation: 215 structures
Evaluation Metrics:
- Exact match accuracy (binary correct/incorrect)
- Tanimoto similarity using molecular fingerprints (preferred metric for partial recognition credit)
Models
Pipeline Components:
- Image Preprocessing: ImageMagick (supports 90+ formats)
- Vectorization: Potrace library (converts bitmap to Bezier curves)
- OCR: GOCR and OCRAD (heteroatom label recognition)
- Output Formats: SMILES strings and SD files
