Paper Information
Citation: Khokhlov, I., Krasnov, L., Fedorov, M. V., & Sosnin, S. (2022). Image2SMILES: Transformer-Based Molecular Optical Recognition Engine. Chemistry-Methods, 2(1), e202100069. https://doi.org/10.1002/cmtd.202100069
Publication: Chemistry-Methods 2022
Additional Resources:
What kind of paper is this?
This is primarily a Method paper with a significant Resource component.
- Method: It proposes a specific neural architecture (ResNet backbone + Transformer Decoder) to solve the Optical Chemical Structure Recognition (OCSR) task, answering “How well does this work?” with extensive benchmarks against rule-based systems like OSRA.
- Resource: A core contribution is the “Generate and Train!” paradigm, where the authors release a comprehensive synthetic data generator to overcome the lack of labeled training data in the field.
What is the motivation?
Retrieving chemical structure data from legacy scientific literature is a major bottleneck in cheminformatics.
- Problem: Chemical structures are often “trapped” in image formats (PDFs, scans). Manual extraction is slow, and existing rule-based tools (e.g., OSRA) are brittle when facing diverse drawing styles, “Markush” structures (templates), or visual contamination.
- Gap: Deep learning approaches require massive datasets, but no large-scale annotated dataset of chemical figures exists.
- Goal: To create a robust, data-driven recognition engine that can handle the messiness of real-world chemical publications (e.g., text overlays, arrows, partial overlaps).
What is the novelty here?
- “Generate and Train!” Paradigm: The authors argue that architecture is secondary to data simulation. They developed an advanced augmentation pipeline that simulates geometry (rotation, bonds) alongside specific chemical drawing artifacts like “Markush” variables ($R_1$, $R_2$), functional group abbreviations (e.g., -OMe, -Ph), and visual “contamination” (stray text, arrows).
- FG-SMILES: A modified SMILES syntax designed to handle functional groups and Markush templates as single tokens (pseudo-atoms), allowing the model to predict generalized scaffolds.
- Encoder-Free Architecture: The authors found that a standard Transformer Encoder was unnecessary. They feed the flattened feature map from a ResNet backbone directly into the Transformer Decoder, which improved performance.
What experiments were performed?
- Training: The model was trained on 10 million synthetically generated images derived from PubChem structures, selected via a complexity-biased sampling algorithm.
- Validation (Synthetic): Evaluated on a hold-out set of 1M synthetic images.
- Validation (Real World):
- Dataset A: 332 manually cropped structures from 10 specific articles, excluding reaction schemes.
- Dataset B: 296 structures systematically extracted from Journal of Organic Chemistry (one paper per issue from 2020) to reduce selection bias.
- Comparison: Benchmarked against OSRA (v2.11), a widely used rule-based OCSR tool.
What outcomes/conclusions?
- Performance:
- Synthetic: 90.7% exact match accuracy.
- Real Data (Dataset A): Image2SMILES achieved 79.2% accuracy compared to OSRA’s 62.1%.
- Real Data (Dataset B): Image2SMILES achieved 62.5% accuracy compared to OSRA’s 24.0%.
- Confidence Correlation: There is a strong correlation between the model’s confidence score and prediction validity. Thresholding at 0.995 yields 99.85% accuracy while ignoring 22% of data, enabling high-precision automated pipelines.
- Key Failures: The model struggles with functional groups absent from its training dictionary (e.g., $\text{NMe}_2$, Ms), confusion of R-group indices (R’ vs $R_1$), and explicit hydrogens rendered as groups.
Reproducibility Details
Data
- Source: A subset of 10 million molecules sampled from PubChem.
- Selection Logic: Bias towards complex/rare structures using a “Full Coefficient” (FC) probability metric based on molecule size and ring/atom rarity.
- Formula: $BC=0.1+1.2\left(\frac{n_{\max}-n}{n_{\max}}\right)^{3}$ where $n_{\max}=60$.
- Generation: Uses RDKit for rendering with augmentations: rotation, font size, line thickness, whitespace, and CoordGen (20% probability).
- Contamination: “Visual noise” is stochastically added, including parts of other structures, labels, and arrows cropped from real documents.
- Target Format: FG-SMILES (Functional Group SMILES). Replaces common functional groups with pseudo-atoms (e.g., [Me], [Ph], [NO2]) and supports variable R-group positions using a
vtoken.
Algorithms
- Contamination Augmentation: A dedicated algorithm simulates visual noise (arrows, text) touching or overlapping the main molecule to force robustness.
- Functional Group Resolution: An algorithm identifies overlapping functional group templates (SMARTS) and resolves them to prevent nested group conflicts (e.g., resolving Methyl vs Methoxy).
- Markush Support: Stochastic replacement of substituents with R-group labels ($R_1$, $R’$, etc.) based on a defined probability table (e.g., $P(R)=0.2$, $P(R_1)=0.15$).
Models
- Architecture: “Image-to-Sequence” hybrid model.
- Backbone: ResNet-50, but with the last two residual blocks removed. Output shape: $512 \x48 \x48$.
- Neck: No Transformer Encoder. CNN features are flattened and passed directly to the Decoder.
- Decoder: Standard Transformer Decoder (6 layers).
- Input: Images resized to $384 \x384 \x3$.
- Output: Sequence of FG-SMILES tokens.
Evaluation
- Metric: Binary “Exact Match” (valid/invalid).
- Strict criteria: Stereo and R-group indices must match exactly (e.g., $R’$ vs $R_1$ is a failure).
- Datasets:
- Internal: 5% random split of generated data (500k samples).
- External (Dataset A & B): Manually cropped real-world images from specified journals.
Hardware
- Training: 4 $\times$ Nvidia V100 GPUs + 36 CPU cores.
- Duration: ~2 weeks for training (5 epochs, ~63 hours/epoch). Data generation took 3 days on 80 CPUs.
- Optimizer: RAdam with learning rate $3 \cdot 10^{-4}$.
Citation
@article{khokhlovImage2SMILESTransformerBasedMolecular2022,
title = {Image2SMILES: Transformer-Based Molecular Optical Recognition Engine},
shorttitle = {Image2SMILES},
author = {Khokhlov, Ivan and Krasnov, Lev and Fedorov, Maxim V. and Sosnin, Sergey},
year = {2022},
journal = {Chemistry-Methods},
volume = {2},
number = {1},
pages = {e202100069},
issn = {2628-9725},
doi = {10.1002/cmtd.202100069},
url = {https://chemistry-europe.onlinelibrary.wiley.com/doi/10.1002/cmtd.202100069}
}