Paper Information
Citation: Jurriaans, T., Szarkowska, K., Nalisnick, E., Schwörer, M., Thorne, C., & Akhondi, S. (2023). One Strike, You’re Out: Detecting Markush Structures in Low Signal-to-Noise Ratio Images. arXiv preprint arXiv:2311.14633. https://doi.org/10.48550/arXiv.2311.14633
Publication: arXiv 2023
Additional Resources:
Methodology and Classification
This is a Method paper (Classification: $\Psi_{\text{Method}}$).
It proposes a patch-based classification pipeline to solve a technical failure mode in Optical Chemical Structure Recognition (OCSR). Distinct rhetorical indicators include a strong baseline comparison (CNN vs. traditional ORB), rigorous ablation studies (architecture, pretraining), and a focus on evaluating the filtering efficacy against a known failure mode.
The Markush Structure Challenge
The Problem: Optical Chemical Structure Recognition (OCSR) tools convert 2D images of molecules into machine-readable formats. These tools struggle with “Markush structures,” generic structural templates used frequently in patents that contain variables rather than specific atoms (e.g., $R$, $X$, $Y$).
The Gap: Markush structures are difficult to detect because they often appear as small indicators (a single “R” or variable) within a large image, resulting in a very low Signal-to-Noise Ratio (SNR). Existing OCSR research pipelines typically bypass this by manually excluding these structures from their datasets.
The Goal: To build an automated filter that can identify images containing Markush structures so they can be removed from OCSR pipelines, improving overall database quality without requiring manual data curation.
Patch-Based Classification Pipeline
The core technical contribution is an end-to-end deep learning pipeline explicitly tailored for low-SNR chemical images where standard global resizing or cropping fails due to massive variations in image resolution and pixel scales.
- Patch Generation: The system slices input images into overlapping patches generated from two offset grids, ensuring that variables falling on boundaries are fully captured in at least one crop.
- Targeted Annotation: The labels rely on pixel-level bounding boxes around Markush indicators, minimizing the noise that would otherwise overwhelm a full-image classification attempt.
- Inference Strategy: During inference, the query image is broken into patches, individually classified, and aggregated entirely using a maximum pooling rule where $X = \max_{i=1}^{n} \{ x_i \}$.
- Evaluation: Provides the first rigorous comparison between fixed-feature extraction (ORB + XGBoost) and end-to-end deep learning for this specific domain.
Experimental Setup and Baselines
The authors compared two distinct paradigms on a manually annotated dataset:
Fixed-Feature Baseline: Used ORB (Oriented FAST and Rotated BRIEF) to detect keypoints and match them against a template bank of known Markush symbols. Features (match counts, Hamming distances) were fed into an XGBoost model.
Deep Learning Method: Fine-tuned ResNet18 and Inception V3 models on the generated image patches.
- Ablations: Contrasted pretraining sources, evaluating general domain (ImageNet) against chemistry-specific domain (USPTO images).
- Fine-tuning: Compared full-network fine-tuning against freezing all but the fully connected layers.
To handle significant class imbalance, the primary evaluation metric was the Macro F1 score, defined as:
$$ \text{Macro F1} = \frac{1}{N} \sum_{i=1}^{N} \frac{2 \cdot \text{precision}_i \cdot \text{recall}_i}{\text{precision}_i + \text{recall}_i} $$
Performance Outcomes and Model Superiority
CNN Superiority: Deep learning architectures fundamentally outperformed the structured fixed-feature baseline. The best model (Inception V3 pretrained on ImageNet) achieved a patch-level Macro F1 of 0.928, compared to 0.701 (image-level) for the ORB baseline.
The Pretraining Surprise: Counterintuitively, ImageNet pretraining consistently outperformed the domain-specific USPTO pretraining. This suggests that the robust, varied features learned across millions of general images transfer better to low-level stroke recognition than features trained on specialized, but smaller, chemical datasets.
Full Model Tuning: Unfreezing the entire network yielded significantly higher performance than tuning only the classifier head, indicating that standard low-level visual filters require substantial adaptation to reliably distinguish chemical line drawings.
Limitations and Edge Cases: While the ROC AUC of 0.97 implies high reliability, the aggregation metric ($X = \max \{ x_i \}$) is naive. Furthermore, the patching approach creates inherent label noise when a Markush indicator is cleanly bisected by a patch edge, potentially forcing the network to learn incomplete visual features.
Reproducibility Details
Data
The study used a primary dataset labeled by domain experts and a larger auxiliary dataset for evaluation.
| Purpose | Dataset | Size | Notes |
|---|---|---|---|
| Training/Val | Primary Dataset | 272 Images | Manually annotated with bounding boxes for Markush indicators. Split 60/20/20. |
| Evaluation | Auxiliary Dataset | ~5.4k Images | 5117 complete structures, 317 Markush. Used for image-level testing only (no bbox). |
Patch Generation:
- Images are cropped into patches of size 224x224 (ResNet) or 299x299 (Inception).
- Patches are generated from 2 grids offset by half the patch width/height to ensure annotations aren’t lost on edges.
- Labeling Rule: A patch is labeled “Markush” if >50% of an annotation’s pixels fall inside it.
Algorithms
ORB (Baseline):
- Matches query images against a bank of template patches containing Markush indicators.
- Features: Number of keypoints, number of matches, Hamming distance of best 5 matches.
- Classifier: XGBoost trained on these features.
- Hyperparameters: Search over number of features (500-2000) and template patches (50-250).
Training Configuration:
- Framework: PyTorch with Optuna for optimization.
- Optimization: 25 trials per configuration.
- Augmentations: Random perspective shift, posterization, sharpness/blur.
Models
Two main architectures were compared.
| Model | Input Size | Parameters | Pretraining Source |
|---|---|---|---|
| ResNet18 | 224x224 | 11.5M | ImageNet |
| Inception V3 | 299x299 | 23.8M | ImageNet & USPTO |
Best Configuration: Inception V3, ImageNet weights, Full Model fine-tuning (all layers unfrozen).
Evaluation
Primary metric was Macro F1 due to class imbalance.
| Metric | Best CNN (Inception V3) | Baseline (ORB) | Notes |
|---|---|---|---|
| Patch Test F1 | $0.928 \pm 0.035$ | N/A | ORB does not support patch-level |
| Image Test F1 | $0.917 \pm 0.014$ | $0.701 \pm 0.052$ | CNN aggregates patch predictions |
| Aux Test F1 | 0.914 | 0.533 | Evaluation on large secondary dataset |
| ROC AUC | 0.97 | 0.81 |
Hardware
- GPU: Tesla V100-SXM2-16GB
- CPU: Intel Xeon E5-2686 @ 2.30GHz
- RAM: 64 GB
Citation
@misc{jurriaansOneStrikeYoure2023,
title = {One {{Strike}}, {{You}}'re {{Out}}: {{Detecting Markush Structures}} in {{Low Signal-to-Noise Ratio Images}}},
shorttitle = {One {{Strike}}, {{You}}'re {{Out}}},
author = {Jurriaans, Thomas and Szarkowska, Kinga and Nalisnick, Eric and Schwoerer, Markus and Thorne, Camilo and Akhondi, Saber},
year = 2023,
month = nov,
number = {arXiv:2311.14633},
eprint = {2311.14633},
primaryclass = {cs},
publisher = {arXiv},
doi = {10.48550/arXiv.2311.14633},
archiveprefix = {arXiv}
}