Contribution: Translating Chemistry as a Language

This is primarily a Method paper, with a strong secondary contribution as a Resource paper.

  • Method: It proposes a neural machine translation (NMT) architecture to approximate the complex, rule-based algorithm of IUPAC naming, treating it as a language translation task.
  • Resource: It provides an open-source tool and trained models to the community, addressing a gap where such functionality was previously limited to proprietary software.

Motivation: Democratizing IUPAC Nomenclature

The International Union of Pure and Applied Chemistry (IUPAC) naming scheme is universally accepted but algorithmically complex. Generating these names correctly is challenging for humans, and automated generation is largely missing from major open-source toolkits like CDK, RDKit, or Open Babel. While reliable commercial tools exist (e.g., ChemAxon’s molconvert), there was a lack of open-source alternatives for the scientific community. STOUT aims to fill this gap using a data-driven approach.

Core Innovation: Sequence-to-Sequence Naming

  • Language Translation Approach: The authors treat chemical representations (SMILES/SELFIES) and IUPAC names as two different languages, applying Neural Machine Translation (NMT) to translate between them.
  • Use of SELFIES: The work establishes SELFIES (Self-Referencing Embedded Strings) as a robust choice over SMILES for deep learning tokenization in this specific task, capitalizing on its syntactic robustness.
  • Hardware Acceleration: The paper benchmarks GPU versus TPU training and highlights the practical necessity of Tensor Processing Units (TPUs) for training large-scale chemical language models, reducing training time by an order of magnitude.

Methodology & Translation Validation

  • Data Scale: The model was trained on datasets of 30 million and 60 million molecules derived from PubChem.
  • Hardware Benchmarking: Training efficiency was compared between an nVidia Tesla V100 GPU and Google TPU v3-8/v3-32 units.
  • Bidirectional Translation: The system was tested on two distinct tasks:
    1. Forward: SELFIES → IUPAC names
    2. Reverse: IUPAC names → SELFIES
  • Validation: Performance was evaluated on a held-out test set of 2.2 million molecules.

Translation Accuracy & Hardware Scaling

  • High Accuracy: The model achieved an average BLEU score of ~90% and a Tanimoto similarity index > 0.9 for both translation directions.
  • Generalization: Even when predictions were textually mismatched (low BLEU score), the underlying chemical structures often remained highly similar (high Tanimoto similarity), suggesting the system captures fundamental chemical semantics rather than merely memorizing strings.
  • Impact of Data Size: Expanding training from 30 million to 60 million molecules yielded consistent performance gains without saturating.
  • Hardware Necessity: Training on TPUs proved up to 54 times faster than a standard GPU baseline (Tesla V100), making scaling highly computationally tractable.

Reproducibility

ArtifactTypeLicenseNotes
STOUT (GitHub)CodeMITCurrent repo hosts STOUT V2.0 transformer models; V1 RNN code available in earlier commits
PubChemDatasetPublic DomainSource of 111M molecules; 30M/60M training subsets not directly provided

Data

The dataset was curated from PubChem (111 million molecules). Note that the specific 30M and 60M subsets are not directly linked in the publication repository, which means a user would have to reconstruct the filtering process.

Preprocessing & Filtering:

  • Explicit hydrogens removed; converted to canonical SMILES.
  • Filtering Rules: MW < 1500 Da, no counter ions, limited element set (C, H, O, N, P, S, F, Cl, Br, I, Se, B), no hydrogen isotopes, 3-40 bonds, no charged groups.
  • Ground Truth Generation: ChemAxon’s molconvert (Marvin Suite 20.15) was used to generate target IUPAC names for training.
  • Representation: All SMILES were converted to SELFIES for training.
PurposeDatasetSizeNotes
TrainingPubChem Filtered30M & 60MTwo distinct training sets created.
TestingPubChem Held-out2.2MMolecules not present in training sets; uniform token frequency.

Algorithms

  • Tokenization:
    • SELFIES: Split iteratively by brackets [ and ].
    • IUPAC: Split via punctuation ((, ), {, }, [, ], -, ., ,) and a discrete set of sub-word chemical morphemes (e.g., methyl, benzene, fluoro).
    • Padding: SELFIES padded to 48 tokens; IUPAC padded to 78 tokens. “Start” and “End” sequence markers append each chain.
  • Optimization: Adam optimizer instantiated with a learning rate of $0.0005$.
  • Objective Function: Sparse categorical cross-entropy, assessing prediction probabilities for token $i$ over vocabulary $V$: $$ \mathcal{L} = -\sum_{i=1}^{V} y_i \log(\hat{y}_i) $$

Models

  • Architecture: Encoder-Decoder sequence-to-sequence network with Bahdanau attention mechanism context weighting.
  • Components:
    • Encoder/Decoder: Recurrent Neural Networks (RNN) constructed using Gated Recurrent Units (GRU).
    • Attention: Bahdanau (additive) soft attention, which calculates alignment scores to softly weight encoder hidden states natively: $$ e_{tj} = v_a^\top \tanh(W_a s_{t-1} + U_a h_j) $$
    • Embedding: Decoder output passes through a continuous embedding layer before concatenating with the attention context vector.
  • Implementation: Python 3 backend using TensorFlow 2.3.0. Note: The linked GitHub repository currently defaults to the STOUT V2.0 transformer models, so researchers aiming to reproduce this specific V1 RNN paper should reference the older tag/commit history.

Evaluation

Metrics heavily emphasize both linguistic accuracy and cheminformatic structural correctness:

MetricDetailsResult (60M Model)Notes
BLEU ScoreNLTK sentence BLEU (unigram to 4-gram)0.94 (IUPAC $\to$ SELFIES)Exact text overlap. Serves as a strictly syntactic proxy.
Tanimoto SimilarityPubChem fingerprints via CDK0.98 (Valid IUPAC names)Evaluates substructure alignment over bit vectors, $T(A, B) = \frac{\vert A \cap B \vert}{\vert A \cup B \vert}$.

Hardware

Comparison of hardware efficiency for training large chemical language models:

HardwareBatch SizeTime per Epoch (15M subset)Speedup Factor
GPU (Tesla V100)256~27 hours1x
TPU v3-81024 (Global)~2 hours13x
TPU v3-321024 (Global)~0.5 hours54x

Paper Information

Citation: Rajan, K., Zielesny, A., & Steinbeck, C. (2021). STOUT: SMILES to IUPAC names using neural machine translation. Journal of Cheminformatics, 13(1), 34. https://doi.org/10.1186/s13321-021-00512-4

Publication: Journal of Cheminformatics 2021

@article{rajanSTOUTSMILESIUPAC2021,
  title = {STOUT: SMILES to IUPAC Names Using Neural Machine Translation},
  shorttitle = {STOUT},
  author = {Rajan, Kohulan and Zielesny, Achim and Steinbeck, Christoph},
  year = {2021},
  month = apr,
  journal = {Journal of Cheminformatics},
  volume = {13},
  number = {1},
  pages = {34},
  issn = {1758-2946},
  doi = {10.1186/s13321-021-00512-4},
  urldate = {2025-09-22},
  abstract = {Chemical compounds can be identified through a graphical depiction, a suitable string representation, or a chemical name. A universally accepted naming scheme for chemistry was established by the International Union of Pure and Applied Chemistry (IUPAC) based on a set of rules. Due to the complexity of this ruleset a correct chemical name assignment remains challenging for human beings and there are only a few rule-based cheminformatics toolkits available that support this task in an automated manner. Here we present STOUT (SMILES-TO-IUPAC-name translator), a deep-learning neural machine translation approach to generate the IUPAC name for a given molecule from its SMILES string as well as the reverse translation, i.e. predicting the SMILES string from the IUPAC name. In both cases, the system is able to predict with an average BLEU score of about 90% and a Tanimoto similarity index of more than 0.9. Also incorrect predictions show a remarkable similarity between true and predicted compounds.},
  langid = {english},
  keywords = {Attention mechanism,Chemical language,Deep neural network,DeepSMILES,IUPAC names,Neural machine translation,Recurrent neural network,SELFIES,SMILES}
}

Additional Resources: