Dual Contribution: Method and Data Resource
The paper proposes an architecture (AtomLenz) and training framework (ProbKT* + Edit-Correction) to solve the problem of Optical Chemical Structure Recognition (OCSR) in data-sparse domains. It also releases a curated, relabeled dataset of hand-drawn molecules with atom-level bounding box annotations.
Overcoming Annotation Bottlenecks in OCSR
Optical Chemical Structure Recognition (OCSR) is critical for digitizing chemical literature and lab notes. However, existing methods face three main limitations:
- Generalization Limits: They struggle with sparse or stylistically unique domains, such as hand-drawn images, where massive datasets for pretraining are unavailable.
- Annotation Cost: “Atom-level” methods (which detect individual atoms and bonds) require expensive bounding box annotations, which are rarely available for real-world sketch data.
- Lack of Interpretability/Localization: Pure “Image-to-SMILES” models (like DECIMER) work well but fail to localize the atoms or bonds in the original image, limiting human-in-the-loop review and mechanistic interpretability.
AtomLenz, ProbKT*, and Graph Edit-Correction
The core contribution is AtomLenz, an OCSR framework that achieves atom-level entity detection using only SMILES supervision on target domains. The authors construct an explicit object detection pipeline using Faster R-CNN trained via a composite multi-task loss. The objective aims to optimize a multi-class log loss $L_{cls}$ for predicted class $\hat{c}$ and a regression loss $L_{reg}$ for predicted bounding box coordinates $\hat{b}$:
$$ \mathcal{L} = L_{cls}(c, \hat{c}) + L_{reg}(b, \hat{b}) $$
To bridge the gap between image inputs and the weakly supervised SMILES labels, the system leverages:
- ProbKT (Probabilistic Knowledge Transfer):* Uses probabilistic logic and Hungarian matching to align predicted objects with the “ground truth” derived from the SMILES strings, enabling backpropagation without explicit bounding boxes.
- Graph Edit-Correction: Generates pseudo-labels by solving an optimization problem that finds the smallest edit on the predicted graph such that the corrected graph and the ground-truth SMILES graph become isomorphic, which forces fine-tuning on less frequent atom types. The combination of ProbKT* and Edit-Correction is abbreviated as EditKT*.
- ChemExpert: A chemically sound ensemble strategy that cascades predictions from multiple models (e.g., passing through DECIMER, then AtomLenz), halting at the first output that clears basic RDKit chemical validity checks.
Data Efficiency and Domain Adaptation Experiments
The authors evaluated the model specifically on domain adaptation and sample efficiency, treating hand-drawn molecules as the primary low-data target distribution:
- Pretraining: Initially trained on ~214k synthetic images from ChEMBL explicitly labeled with bounding boxes (generated via RDKit).
- Target Domain Adaptation: Fine-tuned on the Brinkhaus hand-drawn dataset (4,070 images) using purely SMILES supervision.
- Evaluation Sets:
- Hand-drawn test set: 1,018 images.
- ChemPix: 614 out-of-domain hand-drawn images.
- Atom Localization set: 1,000 synthetic images to evaluate precise bounding box capabilities.
- Baselines: Compared against leading OCSR methods, including DECIMER (v2.2.0), Img2Mol, MolScribe, ChemGrapher, and OSRA.
State-of-the-Art Ensembles vs. Standalone Limitations
- SOTA Ensemble Performance: The ChemExpert module (combining AtomLenz and DECIMER) achieved state-of-the-art accuracy on both hand-drawn (63.5%) and ChemPix (51.8%) test sets.
- Data Efficiency under Bottleneck Regimes: AtomLenz effectively bypassed the massive data constraints of competing models. When all methods were retrained from scratch on the same 4,070-sample hand-drawn training set (enriched with atom-level annotations from EditKT*), AtomLenz achieved 33.8% exact accuracy, outperforming baselines like Img2Mol (0.0%), MolScribe (1.3%), and DECIMER (0.1%), illustrating its sample efficiency.
- Localization Success: The base framework achieved strong localization (mAP 0.801), a capability not provided by end-to-end transformers like DECIMER.
- Methodological Tradeoffs: While AtomLenz is highly sample efficient, its standalone performance when fine-tuned on the target domain (33.8% accuracy) underperforms fine-tuned models trained on larger datasets like DECIMER (62.2% accuracy). AtomLenz achieves state-of-the-art results primarily when deployed as part of the ChemExpert ensemble alongside DECIMER, since errors from the two approaches tend to occur on different samples, allowing them to complement each other.
Reproducibility Details
Artifacts
| Artifact | Type | License | Notes |
|---|---|---|---|
| Official Repository (AtomLenz) | Code | MIT | Complete pipeline for AtomLenz, ProbKT*, and Graph Edit-Correction. |
| Pre-trained Models | Model | MIT | Downloadable weights for Faster R-CNN detection backbones. |
| Hand-drawn Dataset (Brinkhaus) | Dataset | Unknown | Images and SMILES used for target domain fine-tuning and evaluation. |
| Relabeled Hand-drawn Dataset | Dataset | Unknown | 1,417 images with bounding box annotations generated via EditKT*. |
| AtomLenz Web Demo | Other | Unknown | Interactive Hugging Face space for testing model inference. |
Data
The study utilizes a mix of large synthetic datasets and smaller curated hand-drawn datasets.
| Purpose | Dataset | Size | Notes |
|---|---|---|---|
| Pretraining | Synthetic ChEMBL | ~214,000 | Generated via RDKit/Indigo. Annotated with atoms, bonds, charges, stereocenters. |
| Fine-tuning | Hand-drawn (Brinkhaus) | 4,070 | Used for weakly supervised adaptation (SMILES only). |
| Evaluation | Hand-drawn Test | 1,018 | |
| Evaluation | ChemPix | 614 | Out-of-distribution hand-drawn images. |
| Evaluation | Atom Localization | 1,000 | Synthetic images with ground truth bounding boxes. |
Algorithms
- Molecular Graph Constructor (Algorithm 1): A rule-based system to assemble the graph from detected objects:
- Filtering: Removes overlapping atom boxes (IoU threshold).
- Node Creation: Merges overlapping charge and stereocenter objects with their corresponding atom objects.
- Edge Creation: Iterates over bond objects; if a bond overlaps with exactly two atoms, an edge is added. If >2, it selects the most probable pair.
- Validation: Checks valency constraints; removes bonds iteratively if constraints are violated.
- Weakly Supervised Training:
- ProbKT*: Uses Hungarian matching to align predicted objects with the “ground truth” implied by the SMILES string, allowing backpropagation without explicit boxes.
- Graph Edit-Correction: Finds the smallest edit on the predicted graph such that the corrected and true SMILES graphs become isomorphic, then uses the correction to generate pseudo-labels for retraining.
Models
- Object Detection Backbone: Faster R-CNN.
- Four distinct models are trained for different entity types: Atoms ($O^a$), Bonds ($O^b$), Charges ($O^c$), and Stereocenters ($O^s$).
- Loss Function: Multi-task loss combining Multi-class Log Loss ($L_{cls}$) and Regression Loss ($L_{reg}$).
- ChemExpert: An ensemble wrapper that prioritizes models based on user preference (e.g., DECIMER first, then AtomLenz). It accepts the first prediction that passes RDKit chemical validity checks.
Evaluation
Primary metrics focused on structural correctness and localization accuracy.
| Metric | Value (Hand-drawn) | Baseline (DECIMER FT) | Notes |
|---|---|---|---|
| Accuracy (T=1) | 33.8% (AtomLenz+EditKT*) | 62.2% | Exact ECFP6 fingerprint match. |
| Tanimoto Sim. | 0.484 | 0.727 | Average similarity. |
| mAP | 0.801 | N/A | Localization accuracy (IoU 0.05-0.35). |
| Ensemble Acc. | 63.5% | 62.2% | ChemExpert (DECIMER + AtomLenz). |
Hardware
- Compute: Experiments utilized the Flemish Supercomputer Center (VSC) resources.
- Note: Specific GPU models (e.g., A100/V100) are not explicitly detailed in the text, but Faster R-CNN training is standard on consumer or enterprise GPUs.
Paper Information
Citation: Oldenhof, M., De Brouwer, E., Arany, A., & Moreau, Y. (2024). Atom-Level Optical Chemical Structure Recognition with Limited Supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024.
Publication venue/year: CVPR 2024
Additional Resources:
BibTeX:
@inproceedings{oldenhofAtomLevelOpticalChemical2024,
title = {Atom-Level Optical Chemical Structure Recognition with Limited Supervision},
author = {Oldenhof, Martijn and De Brouwer, Edward and Arany, Adam and Moreau, Yves},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2024},
eprint = {2404.01743},
archiveprefix = {arXiv},
primaryclass = {cs.CV}
}
