Paper Information

Citation: Jurriaans, T., Szarkowska, K., Nalisnick, E., Schwörer, M., Thorne, C., & Akhondi, S. (2023). One Strike, You’re Out: Detecting Markush Structures in Low Signal-to-Noise Ratio Images. arXiv preprint arXiv:2311.14633. https://doi.org/10.48550/arXiv.2311.14633

Publication: arXiv 2023

Additional Resources:

What kind of paper is this?

This is a Method paper (Classification: $\Psi_{\text{Method}}$).

It proposes a specific image processing pipeline (patch-based CNN) to solve a technical failure mode in Optical Chemical Structure Recognition (OCSR). The rhetorical indicators include a strong baseline comparison (CNN vs. ORB), ablation studies on architecture and pretraining, and a focus on “how well” the proposed mechanism filters Markush structures compared to traditional computer vision techniques.

What is the motivation?

The Problem: Optical Chemical Structure Recognition (OCSR) tools convert 2D images of molecules into machine-readable formats. However, they fail catastrophically on “Markush structures” - generic templates used in patents (containing variables like $R$, $X$, $Y$) rather than specific atoms.

The Gap: Markush structures are difficult to detect because they often appear as small indicators (a single “R” or variable) within a large image, resulting in a very low Signal-to-Noise Ratio (SNR). Existing work often manually excludes them rather than classifying them automatically.

The Goal: To build an automated filter that can identify images containing Markush structures so they can be removed from OCSR pipelines, improving overall database quality.

What is the novelty here?

The core contribution is a patch-based classification pipeline designed specifically for low-SNR chemical images.

  • Patch Generation: Instead of resizing or warping large images (which distorts small chemical symbols), the method slices images into overlapping patches.
  • Annotation Strategy: Leverages pixel-level annotations to train on specific “regions of interest” (the Markush indicators) rather than just weak image-level labels.
  • Evaluation: Provides the first rigorous comparison between fixed-feature extraction (ORB + XGBoost) and end-to-end deep learning (ResNet/Inception) for this specific domain.

What experiments were performed?

The authors compared two distinct approaches using a manually annotated dataset:

  1. Fixed-Feature Baseline: Used ORB (Oriented FAST and Rotated BRIEF) to detect keypoints and match them against a “template bank” of known Markush symbols. Features (match counts, Hamming distances) were fed into an XGBoost classifier.

  2. Deep Learning Method: Fine-tuned ResNet18 and Inception V3 models on image patches.

    • Ablations: Compared pretraining on ImageNet (general photos) vs. USPTO (chemical structures).
    • Fine-tuning: Compared freezing layers (FC only) vs. full model training.
  3. Hyperparameter Optimization: Used Optuna to optimize learning rates, augmentation probabilities, and optimizer choices for all configurations.

What were the outcomes and conclusions drawn?

  • CNN Superiority: End-to-end Deep Learning significantly outperformed the fixed-feature method. The best model (Inception V3 pretrained on ImageNet) achieved a 0.928 Macro F1 score, compared to 0.701 for the ORB baseline.

  • Pretraining Surprise: Models pretrained on ImageNet outperformed those pretrained on the domain-specific USPTO dataset, likely due to the robustness of features learned from the massive ImageNet corpus.

  • Full Model Training: Fine-tuning the full model was strictly better than just training the fully connected classification head, confirming that low-level filters needed adaptation for chemical line drawings.

  • Impact: The method effectively filters Markush structures, with an ROC AUC of 0.97, suggesting it is ready for deployment in production OCSR pipelines.


Reproducibility Details

Data

The study used a primary dataset labeled by domain experts and a larger auxiliary dataset for evaluation.

PurposeDatasetSizeNotes
Training/ValPrimary Dataset272 ImagesManually annotated with bounding boxes for Markush indicators. Split 60/20/20.
EvaluationAuxiliary Dataset~5.4k Images5117 complete structures, 317 Markush. Used for image-level testing only (no bbox).

Patch Generation:

  • Images are cropped into patches of size 224x224 (ResNet) or 299x299 (Inception).
  • Patches are generated from 2 grids offset by half the patch width/height to ensure annotations aren’t lost on edges.
  • Labeling Rule: A patch is labeled “Markush” if >50% of an annotation’s pixels fall inside it.

Algorithms

ORB (Baseline):

  • Matches query images against a bank of template patches containing Markush indicators.
  • Features: Number of keypoints, number of matches, Hamming distance of best 5 matches.
  • Classifier: XGBoost trained on these features.
  • Hyperparameters: Search over number of features (500-2000) and template patches (50-250).

Training Configuration:

  • Framework: PyTorch with Optuna for optimization.
  • Optimization: 25 trials per configuration.
  • Augmentations: Random perspective shift, posterization, sharpness/blur.

Models

Two main architectures were compared.

ModelInput SizeParametersPretraining Source
ResNet18224x22411.5MImageNet
Inception V3299x29923.8MImageNet & USPTO

Best Configuration: Inception V3, ImageNet weights, Full Model fine-tuning (all layers unfrozen).

Evaluation

Primary metric was Macro F1 due to class imbalance.

MetricBest CNN (Inception V3)Baseline (ORB)Notes
Patch Test F1$0.928 \pm 0.035$N/AORB does not support patch-level
Image Test F1$0.917 \pm 0.014$$0.701 \pm 0.052$CNN aggregates patch predictions
Aux Test F10.9140.533Evaluation on large secondary dataset
ROC AUC0.970.81

Hardware

  • GPU: Tesla V100-SXM2-16GB
  • CPU: Intel Xeon E5-2686 @ 2.30GHz
  • RAM: 64 GB

Citation

@misc{jurriaansOneStrikeYoure2023,
  title = {One {{Strike}}, {{You}}'re {{Out}}: {{Detecting Markush Structures}} in {{Low Signal-to-Noise Ratio Images}}},
  shorttitle = {One {{Strike}}, {{You}}'re {{Out}}},
  author = {Jurriaans, Thomas and Szarkowska, Kinga and Nalisnick, Eric and Schwoerer, Markus and Thorne, Camilo and Akhondi, Saber},
  year = 2023,
  month = nov,
  number = {arXiv:2311.14633},
  eprint = {2311.14633},
  primaryclass = {cs},
  publisher = {arXiv},
  doi = {10.48550/arXiv.2311.14633},
  archiveprefix = {arXiv}
}