Paper Information

Citation: Khokhlov, I., Krasnov, L., Fedorov, M. V., & Sosnin, S. (2022). Image2SMILES: Transformer-Based Molecular Optical Recognition Engine. Chemistry-Methods, 2(1), e202100069. https://doi.org/10.1002/cmtd.202100069

Publication: Chemistry-Methods 2022

Additional Resources:

What kind of paper is this?

This is primarily a Method paper with a significant Resource component.

  • Method: It proposes a specific neural architecture (ResNet backbone + Transformer Decoder) to solve the Optical Chemical Structure Recognition (OCSR) task, answering “How well does this work?” with extensive benchmarks against rule-based systems like OSRA.
  • Resource: A core contribution is the “Generate and Train!” paradigm, where the authors release a comprehensive synthetic data generator to overcome the lack of labeled training data in the field.

What is the motivation?

Retrieving chemical structure data from legacy scientific literature is a major bottleneck in cheminformatics.

  • Problem: Chemical structures are often “trapped” in image formats (PDFs, scans). Manual extraction is slow, and existing rule-based tools (e.g., OSRA) are brittle when facing diverse drawing styles, “Markush” structures (templates), or visual contamination.
  • Gap: Deep learning approaches require massive datasets, but no large-scale annotated dataset of chemical figures exists.
  • Goal: To create a robust, data-driven recognition engine that can handle the messiness of real-world chemical publications (e.g., text overlays, arrows, partial overlaps).

What is the novelty here?

  • “Generate and Train!” Paradigm: The authors argue that architecture is secondary to data simulation. They developed an advanced augmentation pipeline that simulates not just geometry (rotation, bonds), but also specific chemical drawing artifacts like “Markush” variables ($R_1$, $R_2$), functional group abbreviations (e.g., -OMe, -Ph), and visual “contamination” (stray text, arrows).
  • FG-SMILES: A modified SMILES syntax designed to handle functional groups and Markush templates as single tokens (pseudo-atoms), allowing the model to predict generalized scaffolds rather than just explicit atoms.
  • Encoder-Free Architecture: The authors found that a standard Transformer Encoder was unnecessary. They feed the flattened feature map from a ResNet backbone directly into the Transformer Decoder, which improved performance.

What experiments were performed?

  • Training: The model was trained on 10 million synthetically generated images derived from PubChem structures, selected via a complexity-biased sampling algorithm.
  • Validation (Synthetic): Evaluated on a hold-out set of 1M synthetic images.
  • Validation (Real World):
    • Dataset A: 332 manually cropped structures from 10 specific articles, excluding reaction schemes.
    • Dataset B: 296 structures systematically extracted from Journal of Organic Chemistry (one paper per issue from 2020) to reduce selection bias.
  • Comparison: Benchmarked against OSRA (v2.11), a widely used rule-based OCSR tool.

What outcomes/conclusions?

  • Performance:
    • Synthetic: 90.7% exact match accuracy.
    • Real Data (Dataset A): Image2SMILES achieved 79.2% accuracy compared to OSRA’s 62.1%.
    • Real Data (Dataset B): Image2SMILES achieved 62.5% accuracy compared to OSRA’s 24.0%.
  • Confidence Correlation: There is a strong correlation between the model’s confidence score and prediction validity. Thresholding at 0.995 yields 99.85% accuracy while ignoring 22% of data, enabling high-precision automated pipelines.
  • Key Failures: The model struggles with functional groups absent from its training dictionary (e.g., $\text{NMe}_2$, Ms), confusion of R-group indices (R’ vs $R_1$), and explicit hydrogens rendered as groups.

Reproducibility Details

Data

  • Source: A subset of 10 million molecules sampled from PubChem.
  • Selection Logic: Bias towards complex/rare structures using a “Full Coefficient” (FC) probability metric based on molecule size and ring/atom rarity.
    • Formula: $BC=0.1+1.2\left(\frac{n_{\max}-n}{n_{\max}}\right)^{3}$ where $n_{\max}=60$.
  • Generation: Uses RDKit for rendering with augmentations: rotation, font size, line thickness, whitespace, and CoordGen (20% probability).
  • Contamination: “Visual noise” is stochastically added, including parts of other structures, labels, and arrows cropped from real documents.
  • Target Format: FG-SMILES (Functional Group SMILES). Replaces common functional groups with pseudo-atoms (e.g., [Me], [Ph], [NO2]) and supports variable R-group positions using a v token.

Algorithms

  • Contamination Augmentation: A dedicated algorithm simulates visual noise (arrows, text) touching or overlapping the main molecule to force robustness.
  • Functional Group Resolution: An algorithm identifies overlapping functional group templates (SMARTS) and resolves them to prevent nested group conflicts (e.g., resolving Methyl vs Methoxy).
  • Markush Support: Stochastic replacement of substituents with R-group labels ($R_1$, $R’$, etc.) based on a defined probability table (e.g., $P(R)=0.2$, $P(R_1)=0.15$).

Models

  • Architecture: “Image-to-Sequence” hybrid model.
    • Backbone: ResNet-50, but with the last two residual blocks removed. Output shape: $512 \times 48 \times 48$.
    • Neck: No Transformer Encoder. CNN features are flattened and passed directly to the Decoder.
    • Decoder: Standard Transformer Decoder (6 layers).
  • Input: Images resized to $384 \times 384 \times 3$.
  • Output: Sequence of FG-SMILES tokens.

Evaluation

  • Metric: Binary “Exact Match” (valid/invalid).
    • Strict criteria: Stereo and R-group indices must match exactly (e.g., $R’$ vs $R_1$ is a failure).
  • Datasets:
    • Internal: 5% random split of generated data (500k samples).
    • External (Dataset A & B): Manually cropped real-world images from specified journals.

Hardware

  • Training: 4 $\times$ Nvidia V100 GPUs + 36 CPU cores.
  • Duration: ~2 weeks for training (5 epochs, ~63 hours/epoch). Data generation took 3 days on 80 CPUs.
  • Optimizer: RAdam with learning rate $3 \cdot 10^{-4}$.

Citation

@article{khokhlovImage2SMILESTransformerBasedMolecular2022,
  title = {Image2SMILES: Transformer-Based Molecular Optical Recognition Engine},
  shorttitle = {Image2SMILES},
  author = {Khokhlov, Ivan and Krasnov, Lev and Fedorov, Maxim V. and Sosnin, Sergey},
  year = {2022},
  journal = {Chemistry-Methods},
  volume = {2},
  number = {1},
  pages = {e202100069},
  issn = {2628-9725},
  doi = {10.1002/cmtd.202100069},
  url = {https://chemistry-europe.onlinelibrary.wiley.com/doi/10.1002/cmtd.202100069}
}