Probabilistic OCSR with Markov Logic Networks

Paper Information

Citation: Frasconi, P., Gabbrielli, F., Lippi, M., & Marinai, S. (2014). Markov Logic Networks for Optical Chemical Structure Recognition. Journal of Chemical Information and Modeling, 54(8), 2380-2390. https://doi.org/10.1021/ci5002197

Publication: Journal of Chemical Information and Modeling 2014

Contribution: Probabilistic Method for OCSR

This is a Method paper ($\Psi_{\text{Method}}$).

It proposes a novel algorithmic architecture (MLOCSR) that integrates low-level pattern recognition with a high-level probabilistic reasoning engine based on Markov Logic Networks (MLNs). While it contributes to resources by creating a clustered dataset for evaluation, the primary focus is on demonstrating that probabilistic inference offers a superior methodology to the deterministic, rule-based heuristics employed by previous state-of-the-art systems like OSRA and CLiDE.

Motivation: Overcoming Brittle Rule-Based Systems

Optical Chemical Structure Recognition (OCSR) is critical for converting the vast archive of chemical literature (bitmap images in patents and papers) into machine-readable formats.

Limitation of Prior Work: Existing systems (OSRA, CLiDE, ChemReader) rely on “empirical hard-coded geometrical rules” to assemble atoms and bonds. These heuristics are brittle, requiring manual tuning of parameters for different image resolutions and failing when images are degraded or noisy.
Gap: Chemical knowledge is typically used only in post-processing (e.g., to fix valency errors).
Goal: To create a resolution-independent system that uses probabilistic reasoning to handle noise and ambiguity in graphical primitives.

Core Innovation: Markov Logic Networks for Diagram Interpretation

The core novelty is the application of Markov Logic Networks (MLNs) to the problem of diagram interpretation.

Probabilistic Reasoning: The system treats extracted visual elements (lines, text boxes) as “evidence” and uses weighted first-order logic formulas to infer the most likely molecular graph (Maximum A Posteriori inference). The probability of a state $x$ is defined by the MLN log-linear model: $$ P(X=x) = \frac{1}{Z} \exp\left(\sum_{i} w_i n_i(x)\right) $$ where $w_i$ is the weight of the $i$-th formula and $n_i(x)$ is the number of true groundings in $x$.
Unified Knowledge Representation: Geometric constraints (e.g., collinearity) and chemical rules (e.g., valency) are encoded in the same logic framework.
Methodology and Experimental Setupe low-level extraction module dynamically estimates character size ($T$) and stroke width ($S$) to normalize parameters, removing the dependence on image DPI metadata.

What experiments were performed?

The authors evaluated the system on recognition accuracy against the leading open-source baseline, OSRA (v1.4.0).

Datasets:
- USPTO Clustered: A non-redundant subset of 937 images derived from a larger set of 5,719 US Patent Office images.
- ChemInfty: 869 images from Japanese patents.
- Degraded Images: The USPTO set was synthetically degraded at three resampling levels (Low, Medium, High degradation) to test robustness.
Metrics:
- Geometric: Precision, Recall, and $F_1$ scores for individual atoms and bonds.
- Chemical: Tanimoto similarity (using path fingerprints) and InChI string matching (basic and full stereochemistry).

Results and Conclusions

Superior Robustness: MLOCSR significantly outperformed OSRA on degraded images. On high-degradation images, MLOCSR achieved an atom $F_1$ of 80.3% compared to OSRA’s 76.0%.
Geometric Accuracy: In clean datasets (USPTO cluster), MLOCSR achieved higher $F_1$ scores for atoms (99.1% vs 97.5%) and bonds (98.8% vs 97.8%).
Chemical Fidelity: The system achieved comparable Tanimoto similarity scores (0.948 vs 0.940 for OSRA).
Limitation: OSRA slightly outperformed MLOCSR on “Full InChI” matching (81.4% vs 79.4%), indicating the probabilistic model still needs improvement in handling complex stereochemistry.

Reproducibility Details

Data

The study utilized public datasets, with specific preprocessing to ensure non-redundancy.

Purpose	Dataset	Size	Notes
Evaluation	USPTO Clustered	937 images	Selected via spectral clustering from 5,719 raw images to remove near-duplicates.
Evaluation	ChemInfty	869 images	Ground-truthed dataset from Japanese patent applications (2008).

Algorithms

The pipeline consists of two distinct phases: Low-Level Vectorization and High-Level Inference.

1. Low-Level Extraction (Image Processing)

Binarization: Global thresholding followed by morphological closing.
Text/Stroke Estimation:
- Finds text height ($T$) by looking for “N” or “H” characters via OCR, or averaging compatible connected components.
- Estimates stroke width ($S$) by inspecting pixel density on potential segments identified by Hough transform.
Vectorization:
- Canny Edge Detection + Hough Transform to find lines.
- Douglas-Peucker algorithm for polygonal approximation of contours.
- Circle Detection: Finds aromatic rings by checking for circular arrangements of carbon candidates.

2. High-Level Inference (Markov Logic)

Evidence Generation: Visual primitives (lines, text boxes, circles) are converted into logical ground atoms (e.g., LineBetweenCpoints(c1, c2)).
Inference Engine: Uses MaxWalkSAT for Maximum A Posteriori (MAP) inference to determine the most probable state of query predicates (e.g., DoubleBond(c1, c2)).
Parameters: MaxWalkSAT run with 3 tries and 1,000,000 steps per try.

Models

Markov Logic Network (MLN):
- Contains 128 first-order logic formulas.
- Geometric Rules: Example: VeryCloseCpoints(c1, c2) => SameCarbon(c1, c2) (weighted rule to merge close nodes).
- Chemical Rules: Example: IsHydroxyl(t) ^ Connected(c,t) => SingleBond(c,t) (imposes valency constraints).
OCR Engine: Tesseract is used for character recognition on text connected components.

Evaluation

The authors introduced a bipartite graph matching method to evaluate geometric accuracy when superatoms (e.g., “COOH”) are not expanded.

Metric	Details
Atom/Bond $F_1$	Calculated via minimum-weight bipartite matching between predicted graph and ground truth, weighted by Euclidean distance.
InChI	Standard unique identifier string. “Basic” ignores stereochemistry; “Full” includes it.
Tanimoto	Jaccard index of path fingerprints between predicted and ground truth molecules.

Hardware

Software: Logic inference performed using the Alchemy software package (University of Washington).
Web Server: The system was made available at http://mlocsr.dinfo.unifi.it (Note: URL likely inactive).

Citation

@article{frasconiMarkovLogicNetworks2014,
  title = {Markov {{Logic Networks}} for {{Optical Chemical Structure Recognition}}},
  author = {Frasconi, Paolo and Gabbrielli, Francesco and Lippi, Marco and Marinai, Simone},
  year = 2014,
  month = aug,
  journal = {Journal of Chemical Information and Modeling},
  volume = {54},
  number = {8},
  pages = {2380--2390},
  issn = {1549-9596, 1549-960X},
  doi = {10.1021/ci5002197},
  urldate = {2025-10-13},
  langid = {english}
}

Paper Information#

Contribution: Probabilistic Method for OCSR#

Motivation: Overcoming Brittle Rule-Based Systems#

Core Innovation: Markov Logic Networks for Diagram Interpretation#

What experiments were performed?#

Results and Conclusions#

Reproducibility Details#

Data#

Algorithms#

Models#

Evaluation#

Hardware#

Citation#