Paper Information

Citation: Jurriaans, T., Szarkowska, K., Nalisnick, E., Schwörer, M., Thorne, C., & Akhondi, S. (2023). One Strike, You’re Out: Detecting Markush Structures in Low Signal-to-Noise Ratio Images. arXiv preprint arXiv:2311.14633. https://doi.org/10.48550/arXiv.2311.14633

Publication: arXiv 2023

Additional Resources:

Methodology and Classification

This is a Method paper (Classification: $\Psi_{\text{Method}}$).

It proposes a patch-based classification pipeline to solve a technical failure mode in Optical Chemical Structure Recognition (OCSR). Distinct rhetorical indicators include a strong baseline comparison (CNN vs. traditional ORB), rigorous ablation studies (architecture, pretraining), and a focus on evaluating the filtering efficacy against a known failure mode.

The Markush Structure Challenge

The Problem: Optical Chemical Structure Recognition (OCSR) tools convert 2D images of molecules into machine-readable formats. These tools struggle with “Markush structures,” generic structural templates used frequently in patents that contain variables rather than specific atoms (e.g., $R$, $X$, $Y$).

The Gap: Markush structures are difficult to detect because they often appear as small indicators (a single “R” or variable) within a large image, resulting in a very low Signal-to-Noise Ratio (SNR). Existing OCSR research pipelines typically bypass this by manually excluding these structures from their datasets.

The Goal: To build an automated filter that can identify images containing Markush structures so they can be removed from OCSR pipelines, improving overall database quality without requiring manual data curation.

Patch-Based Classification Pipeline

The core technical contribution is an end-to-end deep learning pipeline explicitly tailored for low-SNR chemical images where standard global resizing or cropping fails due to massive variations in image resolution and pixel scales.

  • Patch Generation: The system slices input images into overlapping patches generated from two offset grids, ensuring that variables falling on boundaries are fully captured in at least one crop.
  • Targeted Annotation: The labels rely on pixel-level bounding boxes around Markush indicators, minimizing the noise that would otherwise overwhelm a full-image classification attempt.
  • Inference Strategy: During inference, the query image is broken into patches, individually classified, and aggregated entirely using a maximum pooling rule where $X = \max_{i=1}^{n} \{ x_i \}$.
  • Evaluation: Provides the first rigorous comparison between fixed-feature extraction (ORB + XGBoost) and end-to-end deep learning for this specific domain.

Experimental Setup and Baselines

The authors compared two distinct paradigms on a manually annotated dataset:

  1. Fixed-Feature Baseline: Used ORB (Oriented FAST and Rotated BRIEF) to detect keypoints and match them against a template bank of known Markush symbols. Features (match counts, Hamming distances) were fed into an XGBoost model.

  2. Deep Learning Method: Fine-tuned ResNet18 and Inception V3 models on the generated image patches.

    • Ablations: Contrasted pretraining sources, evaluating general domain (ImageNet) against chemistry-specific domain (USPTO images).
    • Fine-tuning: Compared full-network fine-tuning against freezing all but the fully connected layers.

To handle significant class imbalance, the primary evaluation metric was the Macro F1 score, defined as:

$$ \text{Macro F1} = \frac{1}{N} \sum_{i=1}^{N} \frac{2 \cdot \text{precision}_i \cdot \text{recall}_i}{\text{precision}_i + \text{recall}_i} $$

Performance Outcomes and Model Superiority

  • CNN Superiority: Deep learning architectures fundamentally outperformed the structured fixed-feature baseline. The best model (Inception V3 pretrained on ImageNet) achieved a patch-level Macro F1 of 0.928, compared to 0.701 (image-level) for the ORB baseline.

  • The Pretraining Surprise: Counterintuitively, ImageNet pretraining consistently outperformed the domain-specific USPTO pretraining. This suggests that the robust, varied features learned across millions of general images transfer better to low-level stroke recognition than features trained on specialized, but smaller, chemical datasets.

  • Full Model Tuning: Unfreezing the entire network yielded significantly higher performance than tuning only the classifier head, indicating that standard low-level visual filters require substantial adaptation to reliably distinguish chemical line drawings.

  • Limitations and Edge Cases: While the ROC AUC of 0.97 implies high reliability, the aggregation metric ($X = \max \{ x_i \}$) is naive. Furthermore, the patching approach creates inherent label noise when a Markush indicator is cleanly bisected by a patch edge, potentially forcing the network to learn incomplete visual features.


Reproducibility Details

Data

The study used a primary dataset labeled by domain experts and a larger auxiliary dataset for evaluation.

PurposeDatasetSizeNotes
Training/ValPrimary Dataset272 ImagesManually annotated with bounding boxes for Markush indicators. Split 60/20/20.
EvaluationAuxiliary Dataset~5.4k Images5117 complete structures, 317 Markush. Used for image-level testing only (no bbox).

Patch Generation:

  • Images are cropped into patches of size 224x224 (ResNet) or 299x299 (Inception).
  • Patches are generated from 2 grids offset by half the patch width/height to ensure annotations aren’t lost on edges.
  • Labeling Rule: A patch is labeled “Markush” if >50% of an annotation’s pixels fall inside it.

Algorithms

ORB (Baseline):

  • Matches query images against a bank of template patches containing Markush indicators.
  • Features: Number of keypoints, number of matches, Hamming distance of best 5 matches.
  • Classifier: XGBoost trained on these features.
  • Hyperparameters: Search over number of features (500-2000) and template patches (50-250).

Training Configuration:

  • Framework: PyTorch with Optuna for optimization.
  • Optimization: 25 trials per configuration.
  • Augmentations: Random perspective shift, posterization, sharpness/blur.

Models

Two main architectures were compared.

ModelInput SizeParametersPretraining Source
ResNet18224x22411.5MImageNet
Inception V3299x29923.8MImageNet & USPTO

Best Configuration: Inception V3, ImageNet weights, Full Model fine-tuning (all layers unfrozen).

Evaluation

Primary metric was Macro F1 due to class imbalance.

MetricBest CNN (Inception V3)Baseline (ORB)Notes
Patch Test F1$0.928 \pm 0.035$N/AORB does not support patch-level
Image Test F1$0.917 \pm 0.014$$0.701 \pm 0.052$CNN aggregates patch predictions
Aux Test F10.9140.533Evaluation on large secondary dataset
ROC AUC0.970.81

Hardware

  • GPU: Tesla V100-SXM2-16GB
  • CPU: Intel Xeon E5-2686 @ 2.30GHz
  • RAM: 64 GB

Citation

@misc{jurriaansOneStrikeYoure2023,
  title = {One {{Strike}}, {{You}}'re {{Out}}: {{Detecting Markush Structures}} in {{Low Signal-to-Noise Ratio Images}}},
  shorttitle = {One {{Strike}}, {{You}}'re {{Out}}},
  author = {Jurriaans, Thomas and Szarkowska, Kinga and Nalisnick, Eric and Schwoerer, Markus and Thorne, Camilo and Akhondi, Saber},
  year = 2023,
  month = nov,
  number = {arXiv:2311.14633},
  eprint = {2311.14633},
  primaryclass = {cs},
  publisher = {arXiv},
  doi = {10.48550/arXiv.2311.14633},
  archiveprefix = {arXiv}
}