Paper Information

Citation: Frasconi, P., Gabbrielli, F., Lippi, M., & Marinai, S. (2014). Markov Logic Networks for Optical Chemical Structure Recognition. Journal of Chemical Information and Modeling, 54(8), 2380-2390. https://doi.org/10.1021/ci5002197

Publication: Journal of Chemical Information and Modeling 2014

Contribution: Probabilistic Method for OCSR

This is a Method paper ($\Psi_{\text{Method}}$).

It proposes a novel algorithmic architecture (MLOCSR) that integrates low-level pattern recognition with a high-level probabilistic reasoning engine based on Markov Logic Networks (MLNs). While it contributes to resources by creating a clustered dataset for evaluation, the primary focus is on demonstrating that probabilistic inference offers a superior methodology to the deterministic, rule-based heuristics employed by previous state-of-the-art systems like OSRA and CLiDE.

Motivation: Overcoming Brittle Rule-Based Systems

Optical Chemical Structure Recognition (OCSR) is critical for converting the vast archive of chemical literature (bitmap images in patents and papers) into machine-readable formats.

  • Limitation of Prior Work: Existing systems (OSRA, CLiDE, ChemReader) rely on “empirical hard-coded geometrical rules” to assemble atoms and bonds. These heuristics are brittle, requiring manual tuning of parameters for different image resolutions and failing when images are degraded or noisy.
  • Gap: Chemical knowledge is typically used only in post-processing (e.g., to fix valency errors).
  • Goal: To create a resolution-independent system that uses probabilistic reasoning to handle noise and ambiguity in graphical primitives.

Core Innovation: Markov Logic Networks for Diagram Interpretation

The core novelty is the application of Markov Logic Networks (MLNs) to the problem of diagram interpretation.

  • Probabilistic Reasoning: The system treats extracted visual elements (lines, text boxes) as “evidence” and uses weighted first-order logic formulas to infer the most likely molecular graph (Maximum A Posteriori inference). The probability of a state $x$ is defined by the MLN log-linear model: $$ P(X=x) = \frac{1}{Z} \exp\left(\sum_{i} w_i n_i(x)\right) $$ where $w_i$ is the weight of the $i$-th formula and $n_i(x)$ is the number of true groundings in $x$.
  • Unified Knowledge Representation: Geometric constraints (e.g., collinearity) and chemical rules (e.g., valency) are encoded in the same logic framework.
  • Methodology and Experimental Setupe low-level extraction module dynamically estimates character size ($T$) and stroke width ($S$) to normalize parameters, removing the dependence on image DPI metadata.

What experiments were performed?

The authors evaluated the system on recognition accuracy against the leading open-source baseline, OSRA (v1.4.0).

  • Datasets:
    • USPTO Clustered: A non-redundant subset of 937 images derived from a larger set of 5,719 US Patent Office images.
    • ChemInfty: 869 images from Japanese patents.
    • Degraded Images: The USPTO set was synthetically degraded at three resampling levels (Low, Medium, High degradation) to test robustness.
  • Metrics:
    • Geometric: Precision, Recall, and $F_1$ scores for individual atoms and bonds.
    • Chemical: Tanimoto similarity (using path fingerprints) and InChI string matching (basic and full stereochemistry).

Results and Conclusions

  • Superior Robustness: MLOCSR significantly outperformed OSRA on degraded images. On high-degradation images, MLOCSR achieved an atom $F_1$ of 80.3% compared to OSRA’s 76.0%.
  • Geometric Accuracy: In clean datasets (USPTO cluster), MLOCSR achieved higher $F_1$ scores for atoms (99.1% vs 97.5%) and bonds (98.8% vs 97.8%).
  • Chemical Fidelity: The system achieved comparable Tanimoto similarity scores (0.948 vs 0.940 for OSRA).
  • Limitation: OSRA slightly outperformed MLOCSR on “Full InChI” matching (81.4% vs 79.4%), indicating the probabilistic model still needs improvement in handling complex stereochemistry.

Reproducibility Details

Data

The study utilized public datasets, with specific preprocessing to ensure non-redundancy.

PurposeDatasetSizeNotes
EvaluationUSPTO Clustered937 imagesSelected via spectral clustering from 5,719 raw images to remove near-duplicates.
EvaluationChemInfty869 imagesGround-truthed dataset from Japanese patent applications (2008).

Algorithms

The pipeline consists of two distinct phases: Low-Level Vectorization and High-Level Inference.

1. Low-Level Extraction (Image Processing)

  • Binarization: Global thresholding followed by morphological closing.
  • Text/Stroke Estimation:
    • Finds text height ($T$) by looking for “N” or “H” characters via OCR, or averaging compatible connected components.
    • Estimates stroke width ($S$) by inspecting pixel density on potential segments identified by Hough transform.
  • Vectorization:
    • Canny Edge Detection + Hough Transform to find lines.
    • Douglas-Peucker algorithm for polygonal approximation of contours.
    • Circle Detection: Finds aromatic rings by checking for circular arrangements of carbon candidates.

2. High-Level Inference (Markov Logic)

  • Evidence Generation: Visual primitives (lines, text boxes, circles) are converted into logical ground atoms (e.g., LineBetweenCpoints(c1, c2)).
  • Inference Engine: Uses MaxWalkSAT for Maximum A Posteriori (MAP) inference to determine the most probable state of query predicates (e.g., DoubleBond(c1, c2)).
  • Parameters: MaxWalkSAT run with 3 tries and 1,000,000 steps per try.

Models

  • Markov Logic Network (MLN):
    • Contains 128 first-order logic formulas.
    • Geometric Rules: Example: VeryCloseCpoints(c1, c2) => SameCarbon(c1, c2) (weighted rule to merge close nodes).
    • Chemical Rules: Example: IsHydroxyl(t) ^ Connected(c,t) => SingleBond(c,t) (imposes valency constraints).
  • OCR Engine: Tesseract is used for character recognition on text connected components.

Evaluation

The authors introduced a bipartite graph matching method to evaluate geometric accuracy when superatoms (e.g., “COOH”) are not expanded.

MetricDetails
Atom/Bond $F_1$Calculated via minimum-weight bipartite matching between predicted graph and ground truth, weighted by Euclidean distance.
InChIStandard unique identifier string. “Basic” ignores stereochemistry; “Full” includes it.
TanimotoJaccard index of path fingerprints between predicted and ground truth molecules.

Hardware

  • Software: Logic inference performed using the Alchemy software package (University of Washington).
  • Web Server: The system was made available at http://mlocsr.dinfo.unifi.it (Note: URL likely inactive).

Citation

@article{frasconiMarkovLogicNetworks2014,
  title = {Markov {{Logic Networks}} for {{Optical Chemical Structure Recognition}}},
  author = {Frasconi, Paolo and Gabbrielli, Francesco and Lippi, Marco and Marinai, Simone},
  year = 2014,
  month = aug,
  journal = {Journal of Chemical Information and Modeling},
  volume = {54},
  number = {8},
  pages = {2380--2390},
  issn = {1549-9596, 1549-960X},
  doi = {10.1021/ci5002197},
  urldate = {2025-10-13},
  langid = {english}
}