Image2SMILES: Transformer OCSR with Synthetic Data Pipeline

Paper Information

Citation: Khokhlov, I., Krasnov, L., Fedorov, M. V., & Sosnin, S. (2022). Image2SMILES: Transformer-Based Molecular Optical Recognition Engine. Chemistry-Methods, 2(1), e202100069. https://doi.org/10.1002/cmtd.202100069

Publication: Chemistry-Methods 2022

Additional Resources:

Contribution: Image2SMILES as a Method and Resource

This is primarily a Method paper with a significant Resource component.

Method: It proposes a specific neural architecture (ResNet backbone and Transformer Decoder) to solve the Optical Chemical Structure Recognition (OCSR) task, answering “How well does this work?” with extensive benchmarks against rule-based systems like OSRA.
Resource: A core contribution is the “Generate and Train!” paradigm, where the authors release a comprehensive synthetic data generator to overcome the lack of labeled training data in the field.

Motivation: Bottlenecks in Recognizing Trapped Chemical Structures

Retrieving chemical structure data from legacy scientific literature is a major bottleneck in cheminformatics.

Problem: Chemical structures are often “trapped” in image formats (PDFs, scans). Manual extraction is slow, and existing rule-based tools (e.g., OSRA) are brittle when facing diverse drawing styles, “Markush” structures (templates), or visual contamination.
Gap: Deep learning approaches require massive datasets, but no large-scale annotated dataset of chemical figures exists.
Goal: To create a robust, data-driven recognition engine that can handle the messiness of real-world chemical publications (e.g., text overlays, arrows, partial overlaps).

Core Innovation: The “Generate and Train!” Pipeline and FG-SMILES

“Generate and Train!” Paradigm: The authors assert that architecture is secondary to data simulation. They developed an advanced augmentation pipeline that simulates geometry (rotation, bonds) alongside specific chemical drawing artifacts like “Markush” variables ($R_1$, $R_2$), functional group abbreviations (e.g., -OMe, -Ph), and visual “contamination” (stray text, arrows).
FG-SMILES: A modified SMILES syntax designed to handle functional groups and Markush templates as single tokens (pseudo-atoms), allowing the model to predict generalized scaffolds.
Encoder-Free Architecture: The authors found that a standard Transformer Encoder was unnecessary. They feed the flattened feature map from a ResNet backbone directly into the Transformer Decoder, which improved performance.

Methodology and Benchmarking Against OSRA

Training: The model was trained on 10 million synthetically generated images derived from PubChem structures, selected via a complexity-biased sampling algorithm.
Validation (Synthetic): Evaluated on a hold-out set of 1M synthetic images.
Validation (Real World):
- Dataset A: 332 manually cropped structures from 10 specific articles, excluding reaction schemes.
- Dataset B: 296 structures systematically extracted from Journal of Organic Chemistry (one paper per issue from 2020) to reduce selection bias.
Comparison: Benchmarked against OSRA (v2.11), a widely used rule-based OCSR tool.

Results: High-Precision Extraction and Key Limitations

Performance:
- Synthetic: 90.7% exact match accuracy.
- Real Data (Dataset A): Image2SMILES achieved 79.2% accuracy compared to OSRA’s 62.1%.
- Real Data (Dataset B): Image2SMILES achieved 62.5% accuracy compared to OSRA’s 24.0%.
Confidence Correlation: There is a strong correlation between the model’s confidence score and prediction validity. Thresholding at 0.995 yields 99.85% accuracy while ignoring 22% of data, enabling high-precision automated pipelines.
Key Failures: The model struggles with functional groups absent from its training dictionary (e.g., $\text{NMe}_2$, Ms), confusion of R-group indices ($R’$ vs $R_1$), and explicit hydrogens rendered as groups.

Reproducibility Details

Data

Source: A subset of 10 million molecules sampled from PubChem.
Selection Logic: Bias towards complex/rare structures using a “Full Coefficient” (FC) probability metric based on molecule size and ring/atom rarity.
- Formula: $BC=0.1+1.2\left(\frac{n_{\max}-n}{n_{\max}}\right)^{3}$ where $n_{\max}=60$.
Generation: Uses RDKit for rendering with augmentations: rotation, font size, line thickness, whitespace, and CoordGen (20% probability).
Contamination: “Visual noise” is stochastically added, including parts of other structures, labels, and arrows cropped from real documents.
Target Format: FG-SMILES (Functional Group SMILES). Replaces common functional groups with pseudo-atoms (e.g., [Me], [Ph], [NO2]) and supports variable R-group positions using a v token.

Algorithms

Contamination Augmentation: A dedicated algorithm simulates visual noise (arrows, text) touching or overlapping the main molecule to force robustness.
Functional Group Resolution: An algorithm identifies overlapping functional group templates (SMARTS) and resolves them to prevent nested group conflicts (e.g., resolving Methyl vs Methoxy).
Markush Support: Stochastic replacement of substituents with R-group labels ($R_1$, $R’$, etc.) based on a defined probability table (e.g., $P(R)=0.2$, $P(R_1)=0.15$).

Models

Architecture: “Image-to-Sequence” hybrid model.
- Backbone: ResNet-50, but with the last two residual blocks removed. Output shape: $512 \times 48 \times 48$.
- Neck: No Transformer Encoder. CNN features are flattened and passed directly to the Decoder.
- Decoder: Standard Transformer Decoder (6 layers).
Input: Images resized to $384 \times 384 \times 3$.
Output: Sequence of FG-SMILES tokens.

Evaluation

Metric: Binary “Exact Match” (valid/invalid).
- Strict criteria: Stereo and R-group indices must match exactly (e.g., $R’$ vs $R_1$ is a failure).
Datasets:
- Internal: 5% random split of generated data (500k samples).
- External (Dataset A & B): Manually cropped real-world images from specified journals.

Hardware

Training: 4 $\times$ Nvidia V100 GPUs + 36 CPU cores.
Duration: ~2 weeks for training (5 epochs, ~63 hours/epoch). Data generation took 3 days on 80 CPUs.
Optimizer: RAdam with learning rate $3 \cdot 10^{-4}$.

Citation

@article{khokhlovImage2SMILESTransformerBasedMolecular2022,
  title = {Image2SMILES: Transformer-Based Molecular Optical Recognition Engine},
  shorttitle = {Image2SMILES},
  author = {Khokhlov, Ivan and Krasnov, Lev and Fedorov, Maxim V. and Sosnin, Sergey},
  year = {2022},
  journal = {Chemistry-Methods},
  volume = {2},
  number = {1},
  pages = {e202100069},
  issn = {2628-9725},
  doi = {10.1002/cmtd.202100069},
  url = {https://chemistry-europe.onlinelibrary.wiley.com/doi/10.1002/cmtd.202100069}
}

Paper Information#

Contribution: Image2SMILES as a Method and Resource#

Motivation: Bottlenecks in Recognizing Trapped Chemical Structures#

Core Innovation: The “Generate and Train!” Pipeline and FG-SMILES#

Methodology and Benchmarking Against OSRA#

Results: High-Precision Extraction and Key Limitations#

Reproducibility Details#

Data#

Algorithms#

Models#

Evaluation#

Hardware#

Citation#