Paper Information

Citation: Khokhlov, I., Krasnov, L., Fedorov, M. V., & Sosnin, S. (2022). Image2SMILES: Transformer-Based Molecular Optical Recognition Engine. Chemistry-Methods, 2(1), e202100069. https://doi.org/10.1002/cmtd.202100069

Publication: Chemistry-Methods 2022

Additional Resources:

What kind of paper is this?

This is primarily a Method paper with a significant Resource component.

  • Method: It proposes a specific neural architecture (ResNet backbone + Transformer Decoder) to solve the Optical Chemical Structure Recognition (OCSR) task, answering “How well does this work?” with extensive benchmarks against rule-based systems like OSRA.
  • Resource: A core contribution is the “Generate and Train!” paradigm, where the authors release a comprehensive synthetic data generator to overcome the lack of labeled training data in the field.

What is the motivation?

Retrieving chemical structure data from legacy scientific literature is a major bottleneck in cheminformatics.

  • Problem: Chemical structures are often “trapped” in image formats (PDFs, scans). Manual extraction is slow, and existing rule-based tools (e.g., OSRA) are brittle when facing diverse drawing styles, “Markush” structures (templates), or visual contamination.
  • Gap: Deep learning approaches require massive datasets, but no large-scale annotated dataset of chemical figures exists.
  • Goal: To create a robust, data-driven recognition engine that can handle the messiness of real-world chemical publications (e.g., text overlays, arrows, partial overlaps).

What is the novelty here?

  • “Generate and Train!” Paradigm: The authors argue that architecture is secondary to data simulation. They developed an advanced augmentation pipeline that simulates geometry (rotation, bonds) alongside specific chemical drawing artifacts like “Markush” variables ($R_1$, $R_2$), functional group abbreviations (e.g., -OMe, -Ph), and visual “contamination” (stray text, arrows).
  • FG-SMILES: A modified SMILES syntax designed to handle functional groups and Markush templates as single tokens (pseudo-atoms), allowing the model to predict generalized scaffolds.
  • Encoder-Free Architecture: The authors found that a standard Transformer Encoder was unnecessary. They feed the flattened feature map from a ResNet backbone directly into the Transformer Decoder, which improved performance.

What experiments were performed?

  • Training: The model was trained on 10 million synthetically generated images derived from PubChem structures, selected via a complexity-biased sampling algorithm.
  • Validation (Synthetic): Evaluated on a hold-out set of 1M synthetic images.
  • Validation (Real World):
    • Dataset A: 332 manually cropped structures from 10 specific articles, excluding reaction schemes.
    • Dataset B: 296 structures systematically extracted from Journal of Organic Chemistry (one paper per issue from 2020) to reduce selection bias.
  • Comparison: Benchmarked against OSRA (v2.11), a widely used rule-based OCSR tool.

What outcomes/conclusions?

  • Performance:
    • Synthetic: 90.7% exact match accuracy.
    • Real Data (Dataset A): Image2SMILES achieved 79.2% accuracy compared to OSRA’s 62.1%.
    • Real Data (Dataset B): Image2SMILES achieved 62.5% accuracy compared to OSRA’s 24.0%.
  • Confidence Correlation: There is a strong correlation between the model’s confidence score and prediction validity. Thresholding at 0.995 yields 99.85% accuracy while ignoring 22% of data, enabling high-precision automated pipelines.
  • Key Failures: The model struggles with functional groups absent from its training dictionary (e.g., $\text{NMe}_2$, Ms), confusion of R-group indices (R’ vs $R_1$), and explicit hydrogens rendered as groups.

Reproducibility Details

Data

  • Source: A subset of 10 million molecules sampled from PubChem.
  • Selection Logic: Bias towards complex/rare structures using a “Full Coefficient” (FC) probability metric based on molecule size and ring/atom rarity.
    • Formula: $BC=0.1+1.2\left(\frac{n_{\max}-n}{n_{\max}}\right)^{3}$ where $n_{\max}=60$.
  • Generation: Uses RDKit for rendering with augmentations: rotation, font size, line thickness, whitespace, and CoordGen (20% probability).
  • Contamination: “Visual noise” is stochastically added, including parts of other structures, labels, and arrows cropped from real documents.
  • Target Format: FG-SMILES (Functional Group SMILES). Replaces common functional groups with pseudo-atoms (e.g., [Me], [Ph], [NO2]) and supports variable R-group positions using a v token.

Algorithms

  • Contamination Augmentation: A dedicated algorithm simulates visual noise (arrows, text) touching or overlapping the main molecule to force robustness.
  • Functional Group Resolution: An algorithm identifies overlapping functional group templates (SMARTS) and resolves them to prevent nested group conflicts (e.g., resolving Methyl vs Methoxy).
  • Markush Support: Stochastic replacement of substituents with R-group labels ($R_1$, $R’$, etc.) based on a defined probability table (e.g., $P(R)=0.2$, $P(R_1)=0.15$).

Models

  • Architecture: “Image-to-Sequence” hybrid model.
    • Backbone: ResNet-50, but with the last two residual blocks removed. Output shape: $512 \x48 \x48$.
    • Neck: No Transformer Encoder. CNN features are flattened and passed directly to the Decoder.
    • Decoder: Standard Transformer Decoder (6 layers).
  • Input: Images resized to $384 \x384 \x3$.
  • Output: Sequence of FG-SMILES tokens.

Evaluation

  • Metric: Binary “Exact Match” (valid/invalid).
    • Strict criteria: Stereo and R-group indices must match exactly (e.g., $R’$ vs $R_1$ is a failure).
  • Datasets:
    • Internal: 5% random split of generated data (500k samples).
    • External (Dataset A & B): Manually cropped real-world images from specified journals.

Hardware

  • Training: 4 $\times$ Nvidia V100 GPUs + 36 CPU cores.
  • Duration: ~2 weeks for training (5 epochs, ~63 hours/epoch). Data generation took 3 days on 80 CPUs.
  • Optimizer: RAdam with learning rate $3 \cdot 10^{-4}$.

Citation

@article{khokhlovImage2SMILESTransformerBasedMolecular2022,
  title = {Image2SMILES: Transformer-Based Molecular Optical Recognition Engine},
  shorttitle = {Image2SMILES},
  author = {Khokhlov, Ivan and Krasnov, Lev and Fedorov, Maxim V. and Sosnin, Sergey},
  year = {2022},
  journal = {Chemistry-Methods},
  volume = {2},
  number = {1},
  pages = {e202100069},
  issn = {2628-9725},
  doi = {10.1002/cmtd.202100069},
  url = {https://chemistry-europe.onlinelibrary.wiley.com/doi/10.1002/cmtd.202100069}
}