Paper Information

Citation: Rajan, K., Zielesny, A., & Steinbeck, C. (2021). STOUT: SMILES to IUPAC names using neural machine translation. Journal of Cheminformatics, 13(1), 34. https://doi.org/10.1186/s13321-021-00512-4

Publication: Journal of Cheminformatics 2021

Additional Resources:

What kind of paper is this?

This is primarily a Method paper, with a strong secondary contribution as a Resource paper.

  • Method: It proposes a neural machine translation (NMT) architecture to approximate the complex, rule-based algorithm of IUPAC naming, treating it as a language translation task.
  • Resource: It provides an open-source tool and trained models to the community, addressing a gap where such functionality was previously limited to proprietary software.

What is the motivation?

The International Union of Pure and Applied Chemistry (IUPAC) naming scheme is universally accepted but algorithmically complex. Generating these names correctly is challenging for humans, and automated generation is largely missing from major open-source toolkits like CDK, RDKit, or Open Babel. While reliable commercial tools exist (e.g., ChemAxon’s molconvert), there was a lack of open-source alternatives for the scientific community. STOUT aims to fill this gap using a data-driven approach.

What is the novelty here?

  • Language Translation Approach: The authors treat chemical representations (SMILES/SELFIES) and IUPAC names as two different languages, applying Neural Machine Translation (NMT) to translate between them.
  • Use of SELFIES: The work demonstrates the superiority of SELFIES (Self-Referencing Embedded Strings) over SMILES for deep learning tokenization in this specific task.
  • Hardware Acceleration: The paper explicitly benchmarks and highlights the necessity of Tensor Processing Units (TPUs) for training large-scale chemical language models, reducing training time from months to days.

What experiments were performed?

  • Data Scale: The model was trained on datasets of 30 million and 60 million molecules derived from PubChem.
  • Hardware Benchmarking: Training efficiency was compared between an nVidia Tesla V100 GPU and Google TPU v3-8/v3-32 units.
  • Bidirectional Translation: The system was tested on two distinct tasks:
    1. Forward: SELFIES → IUPAC names
    2. Reverse: IUPAC names → SELFIES
  • Validation: Performance was evaluated on a held-out test set of 2.2 million molecules.

What were the outcomes and conclusions drawn?

  • High Accuracy: The model achieved an average BLEU score of ~90% and a Tanimoto similarity index > 0.9 for both translation directions.
  • Generalization: Even when predictions were textually incorrect (low BLEU score), the chemical structures often remained highly similar or identical (high Tanimoto similarity), indicating the model learned the “language of chemistry”.
  • Impact of Data Size: The model trained on 60 million molecules consistently outperformed the 30 million molecule model.
  • Hardware Necessity: Training on TPUs was found to be up to 54 times faster than GPUs, making training on such large datasets feasible (reducing epoch time from ~27 hours to ~30 minutes).

Reproducibility Details

Data

The dataset was curated from PubChem (111 million molecules).

Preprocessing & Filtering:

  • Explicit hydrogens removed; converted to canonical SMILES.
  • Filtering Rules: MW < 1500 Da, no counter ions, limited element set (C, H, O, N, P, S, F, Cl, Br, I, Se, B), no hydrogen isotopes, 3-40 bonds, no charged groups.
  • Ground Truth Generation: ChemAxon’s molconvert (Marvin Suite 20.15) was used to generate target IUPAC names for training.
  • Representation: All SMILES were converted to SELFIES for training.
PurposeDatasetSizeNotes
TrainingPubChem Filtered30M & 60MTwo distinct training sets created.
TestingPubChem Held-out2.2MMolecules not present in training sets; uniform token frequency.

Algorithms

  • Tokenization:
    • SELFIES: Split by brackets [ and ].
    • IUPAC: Split by punctuation ((, ), {, }, [, ], -, ., ,) and a specific list of chemical morphemes (e.g., methyl, benzene, fluoro).
    • Padding: SELFIES padded to 48 chars; IUPAC padded to 78 chars. “Start” and “End” tokens added.
  • Optimization: Adam optimizer with a learning rate of 0.0005.
  • Loss Function: Sparse categorical cross-entropy.

Models

  • Architecture: Encoder-Decoder sequence-to-sequence network with Attention.
  • Components:
    • Encoder/Decoder: Recurrent Neural Networks (RNN) using Gated Recurrent Units (GRU).
    • Attention: Bahdanau soft attention mechanism.
    • Embedding: Decoder output passes through an embedding layer before concatenation with the context vector.
  • Implementation: Python 3 with TensorFlow 2.3.0.

Evaluation

Metrics used to assess translation quality:

MetricDetailsResult (60M Model)Notes
BLEU ScoreNLTK sentence BLEU (unigram to 4-gram)0.94 (IUPAC → SELFIES)Measures text similarity.
Tanimoto SimilarityPubChem fingerprints via CDK0.98 (Valid IUPAC names)Measures structural similarity after back-translating predictions to SMILES using OPSIN.

Hardware

Comparison of hardware efficiency for training large chemical language models:

HardwareBatch SizeTime per Epoch (15M subset)Speedup Factor
GPU (Tesla V100)256~27 hours1x
TPU v3-81024 (Global)~2 hours13x
TPU v3-321024 (Global)~0.5 hours54x

Citation

@article{rajanSTOUTSMILESIUPAC2021,
  title = {STOUT: SMILES to IUPAC Names Using Neural Machine Translation},
  shorttitle = {STOUT},
  author = {Rajan, Kohulan and Zielesny, Achim and Steinbeck, Christoph},
  year = {2021},
  month = apr,
  journal = {Journal of Cheminformatics},
  volume = {13},
  number = {1},
  pages = {34},
  issn = {1758-2946},
  doi = {10.1186/s13321-021-00512-4},
  urldate = {2025-09-22},
  abstract = {Chemical compounds can be identified through a graphical depiction, a suitable string representation, or a chemical name. A universally accepted naming scheme for chemistry was established by the International Union of Pure and Applied Chemistry (IUPAC) based on a set of rules. Due to the complexity of this ruleset a correct chemical name assignment remains challenging for human beings and there are only a few rule-based cheminformatics toolkits available that support this task in an automated manner. Here we present STOUT (SMILES-TO-IUPAC-name translator), a deep-learning neural machine translation approach to generate the IUPAC name for a given molecule from its SMILES string as well as the reverse translation, i.e. predicting the SMILES string from the IUPAC name. In both cases, the system is able to predict with an average BLEU score of about 90% and a Tanimoto similarity index of more than 0.9. Also incorrect predictions show a remarkable similarity between true and predicted compounds.},
  langid = {english},
  keywords = {Attention mechanism,Chemical language,Deep neural network,DeepSMILES,IUPAC names,Neural machine translation,Recurrent neural network,SELFIES,SMILES},
  file = {/Users/hunterheidenreich/Zotero/storage/SI2DFVAJ/Rajan et al. - 2021 - STOUT SMILES to IUPAC names using neural machine translation.pdf}
}