Graph Perception for Chemical Structure OCR

Paper Information

Citation: Contreras, M. L., Allendes, C., Alvarez, L. T., & Rozas, R. (1990). Computational perception and recognition of digitized molecular structures. Journal of Chemical Information and Computer Sciences, 30(3), 302-307. https://doi.org/10.1021/ci00067a014

Publication: Journal of Chemical Information and Computer Sciences, 1990

Contribution: Graph Perception and Character Recognition

This is a Methodological Paper ($\Psi_{\text{Method}}$).

It proposes a specific algorithmic pipeline (“graph perception and character recognition”) to solve the technical problem of converting pixelated images of molecules into machine-readable connectivity tables. The dominant contribution is the novel set of algorithms (contour search, circular inspection, matrix parametrization).

Motivation: Automating Chemical Database Entry

The primary motivation is to automate the input of chemical structures into databases.

Problem: Manual input of structures (especially large ones with stereochemistry) is time-consuming and prone to human error.
Gap: Existing methods required significant human intervention. The authors created a system that handles the “graph/skeleton” and the “alphanumeric characters” effectively to speed up entry into systems like ARIUSA or CAD tools.

Algorithmic Novelty: Circular Inspection Processing

The paper introduces a unified “capture-to-recognition” system written in C that handles both type-printed and hand-printed structures. Key novelties include:

Circular Inspection Algorithm: A specific technique for detecting internal rings and multiple bonds by sweeping a radius of 0.3 bond lengths around atoms.
Hybrid Recognition: Combining “graph perception” (vectorizing the lines) with “character recognition” (OCR for atom labels) in a single pipeline.
Matrix Parametrization for OCR: A feature extraction method that assigns hexadecimal IDs to character matrices based on pixel gradients and “semibytes”.

Methodology: Validation via Custom Structure Dataset

The authors validated the system by digitizing and recognizing a set of test structures:

Dataset: 200 type-printed structures and 50 hand-printed structures.
Metric: “Reliability” percentage (correct recognition of the connectivity table).
Speed Comparison: Measured processing time against a “qualified person” performing manual input for an average 20-atom molecule.

Results: Speed and File Size Efficiency

Accuracy: The system achieved 94% reliability for both type- and hand-printed graphs.
Character Recognition: Isolated character recognition achieved >99% reliability.
Speed: The system was 3-5 times faster than manual human input.
Efficiency: The storage required for a recognized molecule (e.g., $C_{19}H_{31}N$) was significantly smaller (4.1 kb) than the raw image bitmap.

Reproducibility Details

Data

The paper does not use a standard external dataset but rather a custom set of structures for validation.

Purpose	Dataset	Size	Notes
Validation	Type-printed structures	200 images	Used to test reliability
Validation	Hand-printed structures	50 images	“Straight enough” drawings required

Algorithms

The paper details three specific algorithmic components crucial for replication:

Graph Perception (Contour Search):
- Sweep: Left-to-right horizontal sweep to find the first pixel.
- Contour Follow: Counter-clockwise algorithm used to trace borders.
- Vertex Detection: A vertex is flagged if the linear trajectory deflection angle is $>18^\circ$.
- Atom Localization: Two or more vertices in a small space indicate an atom position.
Circular Inspection (Branching/Rings):
- Radius: A circle is inspected around each atom with $r = 0.3 \times \text{single bond length}$.
- Branch Detection: “Unknown border pixels” found on this circle trigger new contour searches to find attached bonds or rings.
Character Recognition (Matrix Feature Extraction):
- Separation: Characters are separated into isolated matrices and “relocated” to the top-left corner.
- Parametrization: The matrix is divided into zones. A “semibyte” (4-bit code) is generated by checking for pixel density in specific directions.
- ID Assignment: Matrices are assigned a Hex ID (e.g., 8, 1, 0, 6) based on these semibytes.
- Differentiation: Secondary parameters (concavities, vertical lines) resolve conflicts (e.g., between ‘b’ and ‘h’).

Models

The system does not use learned weights (neural networks). It relies on rule-based topological recognition.

Representation: The final output is a Prolog data structure converted into a connectivity table.
Atom Recognition: Terminal atoms are identified by linear projection; if no pixels are found, it defaults to Carbon.

Hardware

The performance metrics reflect 1990s hardware, useful for historical context or low-resource reimplementation.

Capture: PC-AT microcomputer with HP-Scanjet.
Processing: MicroVax II (8 MB real memory, 159 MB hard disc) running Ultrix-32.
Memory Usage: A $300 \times 300$ dpi image required ~175 kb; a recognized graph required ~1.6 kb.
Time: Processing time per molecule was 0.7 - 1.0 minutes.

Citation

@article{contrerasComputationalPerceptionRecognition1990,
  title = {Computational Perception and Recognition of Digitized Molecular Structures},
  author = {Contreras, M. Leonor and Allendes, Carlos and Alvarez, L. Tomas and Rozas, Roberto},
  year = 1990,
  month = aug,
  journal = {Journal of Chemical Information and Computer Sciences},
  volume = {30},
  number = {3},
  pages = {302--307},
  issn = {0095-2338, 1520-5142},
  doi = {10.1021/ci00067a014}
}

Paper Information#

Contribution: Graph Perception and Character Recognition#

Motivation: Automating Chemical Database Entry#

Algorithmic Novelty: Circular Inspection Processing#

Methodology: Validation via Custom Structure Dataset#

Results: Speed and File Size Efficiency#

Reproducibility Details#

Data#

Algorithms#

Models#

Hardware#

Citation#