Paper Information

Citation: McDaniel, J. R., & Balmuth, J. R. (1992). Kekulé: OCR-Optical Chemical (Structure) Recognition. Journal of Chemical Information and Computer Sciences, 32(4), 373-378. https://doi.org/10.1021/ci00008a018

Publication: Journal of Chemical Information and Computer Sciences, 1992

What kind of paper is this?

This is a Methodological Paper ($\Psi_{\text{Method}}$). It proposes a novel software architecture (“Kekulé”) designed to solve a specific technical problem: the automatic conversion of printed chemical structure diagrams into computer-readable connection tables. The paper focuses on the “how” of the system - detailing the seven-step pipeline from scanning to graph compilation - and validates the method through performance testing on a specific dataset.

What is the motivation?

The primary motivation is to bridge the gap between how chemists communicate (structural diagrams) and how chemical databases store information (connection tables like MOLfiles).

  • Inefficiency of Manual Entry: Manual compilation of structural descriptions is “tedious and highly prone to error”.
  • Redrawing Costs: Even using drawing programs (like ChemDraw ancestors) to capture connectivity is inefficient; redrawing a complex molecule like vitamin $B_{12}$ takes ~20 minutes.
  • Lack of Existing Solutions: Existing OCR systems at the time failed on chemical diagrams because they could not handle the mix of graphics (bonds) and text (atom labels), and struggled with small, mixed fonts.

What is the novelty here?

Kekulé represents the first “successful attempt to integrate all of the required elements” of image processing, OCR, structure editing, and database communication into a complete system.

  • Hybrid OCR Approach: Unlike commercial OCR of the time, it used a custom implementation combining rotation correction (for skew) with a multilayer perceptron neural network trained specifically on small fonts (down to 3.2 points).
  • Heuristic Feature Extraction: The authors developed specific heuristics to handle chemical artifacts, such as an “exhaustive search” for dashed lines (rejecting Hough transforms as unreliable for short segments).
  • Contextual “Spell Checking”: The system uses chemical context to verify OCR results, such as checking atom symbols against a valid list and using bond connections to disambiguate characters (e.g., ’l’ vs bond).

What experiments were performed?

The authors performed a validation study on a diverse set of chemical structures to stress-test the system:

  • Dataset: 444 chemical structures were selected from a wide variety of sources, including the Merck Index, Aldrich Handbook, and ACS Nomenclature Guide, specifically chosen to “test Kekulé’s limits” rather than being purely typical.
  • Metrics:
    • Processing Success: Percentage of structures processed.
    • User Intervention: Average number of prompts per structure for verification.
    • Editing Time: Time required to correct interpretation errors (arbitrary “good” limit set at 30 seconds).

What were the outcomes and conclusions drawn?

  • High Success Rate: 98.9% of the 444 structures were processed successfully.
  • Performance Speed: The average processing time was 9 seconds per structure on an 80486 (33 MHz) processor.
  • Error Modes: The primary bottleneck was broken characters in scanned images (e.g., breaks in ‘H’ or ‘N’ crossbars), which slowed down the OCR significantly.
  • Impact: The system demonstrated that automated interpretation was faster and less error-prone than manual redrawing.

Reproducibility Details

The following details outline the specific technical implementation described in the 1992 paper.

Data

The authors did not release a public dataset but described their test set sources in detail.

PurposeDatasetSizeNotes
EvaluationMixed Chemical Sources444 structuresSourced from Merck Index, Aldrich Handbook, ACS Nomenclature Guide, etc.
Training (OCR)Font ExemplarsUnknown“Exemplars of characters from numerous serif and sanserif fonts”.

Algorithms

The paper details a 7-step pipeline. Key algorithmic choices include:

  • Vectorization:

    • Images are reduced to 1-pixel width using thinning and raster-to-vector translation.
    • An adaptive smoothing algorithm is applied to remove pixel-level jitter.
  • Feature Extraction (Dashed Lines):

    • Hough Transforms were rejected due to poor performance on short line segments.
    • Slope sorting was rejected due to variance in short dashes.
    • Chosen Method: Exhaustive search/testing of all features that might be dashed lines (subset of features).
  • Graph Compilation:

    • Character Grouping: Characters are assembled into strings based on XY adjacency.
    • Node Creation: Character strings become nodes. Vectors with endpoints “too far” from strings create new nodes.
    • Heuristics: Circles are converted to alternating single-double bonds; “thick” bonds between wedges are automatically generated.

Models

The core machine learning component is the OCR engine.

  • Architecture: A multilayer perceptron neural network (fully connected).
  • Input: Normalized characters. Normalization involves rotation (for skew), scaling, under-sampling, and contrast/density adjustments.
  • Output: Ranked probability matches. Outputs above an experimental threshold are retained. If a character is ambiguous (e.g., ‘5’ vs ‘S’), both are kept and resolved via chemical context.
  • Performance: Raw accuracy ~96% on small fonts (compared to ~85% for commercial OCR of the era).

Hardware

The system was developed and tested on hardware typical of the early 1990s.

  • Processor: Intel 80486 at 33 MHz.
  • Scanners: Hewlett-Packard ScanJet (300 dpi) and Logitech ScanMan (400 dpi hand-held).
  • Platform: Microsoft Windows.

Citation

@article{mcdanielKekuleOCRopticalChemical1992,
  title = {Kekulé: {{OCR-optical}} Chemical (Structure) Recognition},
  shorttitle = {Kekulé},
  author = {McDaniel, Joe R. and Balmuth, Jason R.},
  year = 1992,
  month = jul,
  journal = {Journal of Chemical Information and Computer Sciences},
  volume = {32},
  number = {4},
  pages = {373--378},
  issn = {0095-2338, 1520-5142},
  doi = {10.1021/ci00008a018},
  urldate = {2025-12-15},
  langid = {english}
}