One Strike, You're Out: Detecting Markush Structures

Paper Information

Citation: Jurriaans, T., Szarkowska, K., Nalisnick, E., Schwörer, M., Thorne, C., & Akhondi, S. (2023). One Strike, You’re Out: Detecting Markush Structures in Low Signal-to-Noise Ratio Images. arXiv preprint arXiv:2311.14633. https://doi.org/10.48550/arXiv.2311.14633

Publication: arXiv 2023

Additional Resources:

GitHub Repository

Methodology and Classification

This is a Method paper (Classification: $\Psi_{\text{Method}}$).

It proposes a patch-based classification pipeline to solve a technical failure mode in Optical Chemical Structure Recognition (OCSR). Distinct rhetorical indicators include a baseline comparison (CNN vs. traditional ORB), ablation studies (architecture, pretraining), and a focus on evaluating the filtering efficacy against a known failure mode.

The Markush Structure Challenge

The Problem: Optical Chemical Structure Recognition (OCSR) tools convert 2D images of molecules into machine-readable formats. These tools struggle with “Markush structures,” generic structural templates used frequently in patents that contain variables rather than specific atoms (e.g., $R$, $X$, $Y$).

The Gap: Markush structures are difficult to detect because they often appear as small indicators (a single “R” or variable) within a large image, resulting in a very low Signal-to-Noise Ratio (SNR). Existing OCSR research pipelines typically bypass this by manually excluding these structures from their datasets.

The Goal: To build an automated filter that can identify images containing Markush structures so they can be removed from OCSR pipelines, improving overall database quality without requiring manual data curation.

Patch-Based Classification Pipeline

The core technical contribution is an end-to-end deep learning pipeline tailored for low-SNR chemical images where standard global resizing or cropping fails due to large variations in image resolution and pixel scales.

Patch Generation: The system slices input images into overlapping patches generated from two offset grids, ensuring that variables falling on boundaries are fully captured in at least one crop.
Targeted Annotation: The labels rely on pixel-level bounding boxes around Markush indicators, minimizing the noise that would otherwise overwhelm a full-image classification attempt.
Inference Strategy: During inference, the query image is broken into patches, individually classified, and aggregated entirely using a maximum pooling rule where $X = \max_{i=1}^{n} \{ x_i \}$.
Evaluation: Provides the first systematic comparison between fixed-feature extraction (ORB + XGBoost) and end-to-end deep learning for this specific domain.

Experimental Setup and Baselines

The authors compared two distinct paradigms on a manually annotated dataset:

Fixed-Feature Baseline: Used ORB (Oriented FAST and Rotated BRIEF) to detect keypoints and match them against a template bank of known Markush symbols. Features (match counts, Hamming distances) were fed into an XGBoost model.
Deep Learning Method: Fine-tuned ResNet18 and Inception V3 models on the generated image patches.
- Ablations: Contrasted pretraining sources, evaluating general domain (ImageNet) against chemistry-specific domain (USPTO images).
- Fine-tuning: Compared full-network fine-tuning against freezing all but the fully connected layers.

To handle significant class imbalance, the primary evaluation metric was the Macro F1 score, defined as:

$$ \text{Macro F1} = \frac{1}{N} \sum_{i=1}^{N} \frac{2 \cdot \text{precision}_i \cdot \text{recall}_i}{\text{precision}_i + \text{recall}_i} $$

Performance Outcomes

CNN vs. ORB: Deep learning architectures outperformed the fixed-feature baseline. The best model (Inception V3 pretrained on ImageNet) achieved an image-level Macro F1 of 0.928, compared to 0.701 (image-level) for the ORB baseline, and a patch-level Macro F1 of 0.917.
The Pretraining Surprise: Counterintuitively, ImageNet pretraining consistently outperformed the domain-specific USPTO pretraining. The authors hypothesize that the filters learned from ImageNet pretraining generalize well outside the ImageNet domain, though why the USPTO-pretrained filters underperform remains unclear.
Full Model Tuning: Unfreezing the entire network yielded higher performance than tuning only the classifier head, indicating that standard low-level visual filters require substantial adaptation to reliably distinguish chemical line drawings.
Limitations and Edge Cases: The best CNN achieved an ROC AUC of 0.97 on the primary patch test set, while the ORB baseline scored 0.81 on the auxiliary dataset (the paper notes these ROC curves are not directly comparable due to different evaluation sets). The aggregation metric ($X = \max \{ x_i \}$) is naive and has not been optimized. Furthermore, the patching approach creates inherent label noise when a Markush indicator is cleanly bisected by a patch edge, potentially forcing the network to learn incomplete visual features.

Reproducibility Details

Data

The study used a primary dataset labeled by domain experts and a larger auxiliary dataset for evaluation.

Purpose	Dataset	Size	Notes
Training/Val	Primary Dataset	272 Images	Manually annotated with bounding boxes for Markush indicators. Split 60/20/20.
Evaluation	Auxiliary Dataset	~5.4k Images	5117 complete structures, 317 Markush. Used for image-level testing only (no bbox).

Patch Generation:

Images are cropped into patches of size 224x224 (ResNet) or 299x299 (Inception).
Patches are generated from 2 grids offset by half the patch width/height to ensure annotations aren’t lost on edges.
Labeling Rule: A patch is labeled “Markush” if >50% of an annotation’s pixels fall inside it.

Algorithms

ORB (Baseline):

Matches query images against a bank of template patches containing Markush indicators.
Features: Number of keypoints, number of matches, Hamming distance of best 5 matches.
Classifier: XGBoost trained on these features.
Hyperparameters: Search over number of features (500-2000) and template patches (50-250).

Training Configuration:

Framework: PyTorch with Optuna for optimization.
Optimization: 25 trials per configuration.
Augmentations: Random perspective shift, posterization, sharpness/blur.

Models

Two main architectures were compared.

Model	Input Size	Parameters	Pretraining Source
ResNet18	224x224	11.5M	ImageNet
Inception V3	299x299	23.8M	ImageNet & USPTO

Best Configuration: Inception V3, ImageNet weights, Full Model fine-tuning (all layers unfrozen).

Evaluation

Primary metric was Macro F1 due to class imbalance.

Metric	Best CNN (Inception V3)	Baseline (ORB)	Notes
Patch Test F1	$0.917 \pm 0.014$	N/A	ORB does not support patch-level
Image Test F1	$0.928 \pm 0.035$	$0.701 \pm 0.052$	CNN aggregates patch predictions
Aux Test F1	0.914	0.533	Evaluation on large secondary dataset
ROC AUC	0.97	0.81

Hardware

GPU: Tesla V100-SXM2-16GB
CPU: Intel Xeon E5-2686 @ 2.30GHz
RAM: 64 GB

Artifacts

Artifact	Type	License	Notes
GitHub Repository	Code	Apache-2.0	MSc thesis code: CNN training, ORB baseline, evaluation scripts

The primary dataset was manually annotated by Elsevier domain experts and is not publicly available. The auxiliary dataset (from Elsevier) is also not public. Pre-trained model weights are not released in the repository.

Citation

@misc{jurriaansOneStrikeYoure2023,
  title = {One {{Strike}}, {{You}}'re {{Out}}: {{Detecting Markush Structures}} in {{Low Signal-to-Noise Ratio Images}}},
  shorttitle = {One {{Strike}}, {{You}}'re {{Out}}},
  author = {Jurriaans, Thomas and Szarkowska, Kinga and Nalisnick, Eric and Schwoerer, Markus and Thorne, Camilo and Akhondi, Saber},
  year = 2023,
  month = nov,
  number = {arXiv:2311.14633},
  eprint = {2311.14633},
  primaryclass = {cs},
  publisher = {arXiv},
  doi = {10.48550/arXiv.2311.14633},
  archiveprefix = {arXiv}
}

Paper Information#

Methodology and Classification#

The Markush Structure Challenge#

Patch-Based Classification Pipeline#

Experimental Setup and Baselines#

Performance Outcomes#

Reproducibility Details#

Data#

Algorithms#

Models#

Evaluation#

Hardware#

Artifacts#

Citation#