Paper Information

Citation: Krasnov, L., Khokhlov, I., Fedorov, M. V., & Sosnin, S. (2021). Transformer-based artificial neural networks for the conversion between chemical notations. Scientific Reports, 11(1), 14798. https://doi.org/10.1038/s41598-021-94082-y

Publication: Scientific Reports 2021

Additional Resources:

What kind of paper is this?

This is primarily a Method paper with significant elements of Position.

  • Method: The authors propose a specific neural architecture (Transformer with custom tokenization) and a verification pipeline (round-trip check) to solve the SMILES $\leftrightarrow$ IUPAC translation task. They rigorously benchmark this against rule-based baselines (OPSIN).
  • Position: The authors explicitly argue for a paradigm shift, suggesting that “heavy” neural architectures should replace complex, costly rule-based legacy systems even for “exact” algorithmic tasks.

What is the motivation?

  • Complexity of Naming: Generating IUPAC names manually is error-prone and requires deep algorithmic knowledge.
  • Lack of Open Source Tools: While open-source tools exist for Name-to-Structure (e.g., OPSIN), there were no open-source tools for the inverse “Structure-to-Name” conversion at the time of writing.
  • Cost of Development: Developing rule-based converters “from scratch” is prohibitively expensive and time-consuming compared to training a neural model on existing data.

What is the novelty here?

  • Struct2IUPAC: The first effective open-source neural model for converting SMILES to IUPAC names, treating chemical translation as a Neural Machine Translation (NMT) problem.
  • Verification Loop: A novel inference pipeline that generates multiple candidates via beam search and validates them using a reverse converter (OPSIN) to ensure the generated name maps back to the original structure.
  • Custom Tokenization: A manually curated rule-based tokenizer for IUPAC names that handles specific chemical suffixes, prefixes, and stereochemical markers.

What experiments were performed?

  • Accuracy Benchmarking: The models were tested on a held-out subset of 100,000 molecules from PubChem. The authors measured accuracy across different beam sizes (1, 3, 5).
  • Comparison to Rules: The neural IUPAC2Struct model was compared directly against the rule-based OPSIN tool.
  • Stress Testing:
    • Sequence Length: Evaluated performance across varying token lengths, identifying a “sweet spot” (10-60 tokens) and failure modes for very short (e.g., methane) or long molecules.
    • Stereochemistry: Tested on “stereo-dense” compounds, observing a performance drop but generally robust handling of stereocenters.
    • Tautomers: Verified the model’s ability to handle different tautomeric forms (e.g., Guanine and Uracil variants).
  • Latency Analysis: Benchmarked inference speeds on CPU vs. GPU relative to output sequence length.

What were the outcomes and conclusions drawn?

  • High Accuracy: The Struct2IUPAC model achieved 98.9% accuracy (Beam 5 with verification). The reverse model (IUPAC2Struct) achieved 99.1%, comparable to OPSIN’s 99.4%.
  • Chemical “Intuition”: The model demonstrated an ability to infer logic rather than just memorize, correctly generating multiple valid IUPAC names for single molecules where naming ambiguity exists (e.g., parent group selection).
  • Production Readiness: Inference on GPU takes less than 0.5 seconds even for long names, making it viable for production use.
  • Paradigm Shift: The authors conclude that neural networks are a viable, cost-effective alternative to developing rule-based algorithms for legacy notation conversion.

Reproducibility Details

Data

The study utilized the PubChem database.

PurposeDatasetSizeNotes
TotalPubChem~95MFiltered for RDKit compatibility
TrainingSplit A47,312,235Random 50% split
TestingSplit B47,413,850Random 50% split
  • Cleaning: Molecules that could not be processed by RDKit were removed. Molecules containing tokens not in the tokenizer (e.g., aromatic selenium) were excluded.
  • Availability: A subset of 100,000 test molecules is available on Zenodo and GitHub.

Algorithms

  • Tokenization:
    • SMILES: Character-based tokenization.
    • IUPAC: Custom rule-based tokenizer splitting suffixes (-one, -al), prefixes (-oxy, -di), and special symbols ((, ), R(S)).
  • Verification Step:
    1. Generate $N$ names using Beam Search ($N=5$).
    2. Reverse translate the candidate name using OPSIN.
    3. Check if the OPSIN structure matches the original input SMILES.
    4. Display the first verified match; otherwise, report failure.

Models

  • Architecture: Standard Transformer with 6 encoder layers and 6 decoder layers.
  • Hyperparameters:
    • Attention Heads: 8
    • Attention Dimension ($d_{\text{model}}$): 512
    • Feed-Forward Dimension ($d_{\text{ff}}$): 2048
  • Training: Two separate models were trained: Struct2IUPAC (SMILES $\to$ IUPAC) and IUPAC2Struct (IUPAC $\to$ SMILES).

Evaluation

Evaluation was performed on a random subset of 100,000 molecules from the test set.

MetricTaskBeam SizeAccuracy
Exact MatchStruct2IUPAC196.1%
Exact MatchStruct2IUPAC598.9%
Exact MatchIUPAC2Struct196.6%
Exact MatchIUPAC2Struct599.1%
  • Robustness: Accuracy drops significantly for augmented (non-canonical) SMILES (37.16%) and stereo-enriched compounds (66.52%).

Hardware

  • Training Infrastructure: 4 $\times$ Tesla V100 GPUs and 36 CPUs.
  • Training Time: Approximately 10 days under full load.
  • Inference Speed: <0.5s per molecule on GPU; scale is linear with output token length.

Citation

@article{krasnovTransformerbasedArtificialNeural2021a,
  title = {Transformer-Based Artificial Neural Networks for the Conversion between Chemical Notations},
  author = {Krasnov, Lev and Khokhlov, Ivan and Fedorov, Maxim V. and Sosnin, Sergey},
  year = 2021,
  month = jul,
  journal = {Scientific Reports},
  volume = {11},
  number = {1},
  pages = {14798},
  publisher = {Nature Publishing Group},
  doi = {10.1038/s41598-021-94082-y}
}