Chemical Name Translation on Hunter Heidenreich | ML Research Scientist

Transformer Name-to-SMILES with Atom Count Losses

Thu, 26 Mar 2026 00:00:00 +0000

Translating Chemical Names to Structures with Transformers

This is a Method paper that proposes using Transformer-based sequence-to-sequence models to predict chemical compound structures (represented as SMILES strings) from chemical compound names. The primary contribution is the application of neural machine translation techniques to the name-to-structure problem, along with two domain-specific improvements: an atom-count constraint loss function and a multi-task learning approach that jointly predicts SMILES and InChI strings.

Why Rule-Based Name-to-Structure Fails for Synonyms

Chemical compound names come in several varieties. IUPAC names follow systematic nomenclature and are well-handled by rule-based parsers like OPSIN. Database IDs (e.g., CAS registry numbers) can be resolved by dictionary lookup. The third category, Synonyms (which includes abbreviations, common names, and other informal designations), is problematic because naming patterns are complex and widely variable.

In preliminary experiments, rule-based tools achieved F-measures of 0.878 to 0.960 on IUPAC names but only 0.719 to 0.758 on Synonyms. This performance gap motivates a data-driven approach. The authors frame name-to-SMILES prediction as a machine translation problem: the source language is the chemical compound name and the target language is the SMILES string. A neural model trained on millions of name-SMILES pairs can learn patterns that rule-based systems miss, particularly for non-systematic nomenclature.

Atom-Count Constraints and Multi-Task Learning

The paper introduces two improvements over a vanilla Transformer seq2seq model.

Atom-Count Constraint Loss

A correct structure prediction must contain the right number of atoms of each element. The authors add an auxiliary loss that penalizes the squared difference between the predicted and true atom counts for each element. The predicted atom counts are obtained by summing Gumbel-softmax outputs across all decoded positions.

For the $i$-th output token, the Gumbel-softmax probability vector is:

$$ y_{ij} = \frac{\exp\left((\log(\pi_{ij}) + g_{ij}) / \tau\right)}{\sum_{k=1}^{|\mathcal{V}|} \exp\left((\log(\pi_{ik}) + g_{ik}) / \tau\right)} $$

where $\pi_{ij}$ is the model’s softmax output, $g_{ij}$ is a Gumbel noise sample, and $\tau = 0.1$ is the temperature. The predicted token frequency vector is $\mathbf{y}^{pred} = \sum_{i=1}^{m} \mathbf{y}_i$, and the atom-count loss is:

$$ \mathcal{L}_{atom} = \frac{1}{|A|} \sum_{a \in A} \left(N_a(T) - y_{idx(a)}^{pred}\right)^2 $$

where $A$ is the set of chemical elements in the vocabulary, $N_a(T)$ returns the number of atoms of element $a$ in the correct SMILES string $T$, and $idx(a)$ returns the vocabulary index of element $a$. Only element tokens (e.g., “C”, “O”) are counted; bond symbols (e.g., “=”, “#”) are excluded.

The combined objective is:

$$ \mathcal{L}_{smiles} + \lambda_{atom} \mathcal{L}_{atom} $$

with $\lambda_{atom} = 0.7$.

Multi-Task SMILES/InChI Prediction

SMILES and InChI strings encode the same chemical structure in different formats. The authors hypothesize that jointly predicting both representations can improve the shared encoder. The multi-task model shares the encoder between a SMILES decoder and an InChI decoder, minimizing:

$$ \mathcal{L}_{smiles} + \lambda_{inchi} \mathcal{L}_{inchi} $$

where $\mathcal{L}_{inchi} = -\log P(I | X; \boldsymbol{\theta}_{enc}, \boldsymbol{\theta}_{inchi})$ and $\lambda_{inchi} = 0.3$.

Experimental Setup and Evaluation

Dataset

The dataset was constructed from PubChem dump data (97M compound records). Chemical compound names categorized as Synonyms were paired with canonical SMILES strings (converted via RDKit). Database-like IDs were filtered out using regular expressions. Duplicate names mapping to different CIDs were removed.

Split	Size
Training	5,000,000
Development	1,113
Test	11,194

Model Configuration

The Transformer uses 6 encoder/decoder layers, 8 attention heads, 512-dimensional embeddings, and 0.1 dropout. Training used label-smoothing cross-entropy ($\epsilon = 0.1$), Adam optimizer ($\beta_1 = 0.9$, $\beta_2 = 0.98$), and a warmup schedule with peak learning rate 0.0005 over 4,000 steps followed by inverse square root decay. Models were trained for 300,000 update steps. Final predictions averaged the last 10 checkpoints and used beam search (beam size 4, length penalty $\alpha = 0.6$, max output length 200).

Tokenization

Three tokenization strategies were compared:

BPE: Byte pair encoding learned on chemical compound names (500 merge operations) via fastBPE
OPSIN-TK: The OPSIN rule-based tokenizer
OPSIN-TK+BPE: A hybrid where OPSIN handles tokenizable names and BPE handles the rest

SMILES tokens were identified by regular expressions (elements as single tokens, remaining symbols as characters). InChI strings were tokenized by SentencePiece (vocabulary size 1,000).

Baselines

OPSIN: Open-source rule-based parser
Tool A and Tool B: Two commercially available name-to-structure tools

Results

Method	Tokenizer	Recall	Precision	F-measure
OPSIN	Rule-based	0.693	0.836	0.758
Tool A	Rule-based	0.711	0.797	0.752
Tool B	Rule-based	0.653	0.800	0.719
Transformer	BPE	0.793	0.806	0.799
+ atomnum	BPE	0.798	0.808	0.803
+ inchigen	BPE	0.810	0.819	0.814
Transformer	OPSIN-TK+BPE	0.763	0.873	0.814
+ atomnum	OPSIN-TK+BPE	0.768	0.876	0.818
+ inchigen	OPSIN-TK+BPE	0.779	0.886	0.829
Transformer	OPSIN-TK	0.755	0.868	0.808
+ atomnum	OPSIN-TK	0.757	0.867	0.808
+ inchigen	OPSIN-TK	0.754	0.869	0.807

The best configuration (inchigen with OPSIN-TK+BPE) achieved an F-measure of 0.829, surpassing OPSIN by 0.071 points. The multi-task learning approach (inchigen) consistently outperformed the atom-count constraint alone (atomnum) across all tokenizer settings.

Key Findings and Error Analysis

The Transformer-based approach produced grammatically correct SMILES strings (parseable by RDKit) for 99% of test examples, compared to 81.6-88.4% for the rule-based tools. Even when predictions were incorrect, they tended to be structurally similar to the correct answer. Using MACCS fingerprints and Jaccard (Tanimoto) similarity, the average similarity between incorrectly predicted and correct structures was 0.753.

The OPSIN-TK tokenizer yielded higher precision than BPE because approximately 11.5% (1,293 of 11,194) of test compounds could not be tokenized by OPSIN, reducing the number of outputs. BPE-based tokenizers achieved higher recall by covering all inputs. The hybrid OPSIN-TK+BPE approach balanced both, achieving the highest overall F-measure.

Limitations: The paper does not evaluate on IUPAC names separately with the Transformer models (only comparing rule-based tools on IUPAC). The atom-count constraint and multi-task learning are not combined in a single model. The dataset is released but the training code is not. Hardware details and training times are not reported. The evaluation uses only exact-match F-measure and Jaccard similarity, without measuring partial credit for nearly-correct structures.

Future work: The authors plan to explore additional tokenization methods, combine the atom-count constraint with multi-task learning, and apply the constraint loss to other chemistry problems including chemical reaction prediction.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Training	PubChem Synonyms (custom split)	5,000,000 pairs	Chemical compound names to canonical SMILES
Development	PubChem Synonyms (custom split)	1,113 pairs	Filtered for duplicates
Test	PubChem Synonyms (custom split)	11,194 pairs	Filtered for duplicates; released as benchmark

The authors state the dataset is released for future research. The data was constructed from the PubChem dump (97M compound records) using RDKit for SMILES canonicalization. Database-like IDs were removed with regular expressions and duplicate names across CIDs were filtered.

Algorithms

Transformer seq2seq (6 layers, 8 heads, 512-dim embeddings)
BPE tokenization via fastBPE (500 merge operations)
SentencePiece for InChI tokenization (vocabulary size 1,000)
Gumbel-softmax atom-count constraint ($\tau = 0.1$, $\lambda_{atom} = 0.7$)
Multi-task SMILES/InChI loss ($\lambda_{inchi} = 0.3$)
Adam optimizer ($\beta_1 = 0.9$, $\beta_2 = 0.98$, $\epsilon = 10^{-8}$)
Label smoothing ($\epsilon = 0.1$), 300K training steps
Beam search (beam size 4, length penalty $\alpha = 0.6$)

Models

Standard Transformer architecture following Vaswani et al. (2017). No pre-trained weights or model checkpoints are released.

Evaluation

Metric	Best Value	Model	Notes
F-measure	0.829	inchigen (OPSIN-TK+BPE)	Highest overall
Precision	0.886	inchigen (OPSIN-TK+BPE)	Highest overall
Recall	0.810	inchigen (BPE)	Highest overall
Grammatical correctness	99%	inchigen (BPE)	SMILES parseable by RDKit
Avg. Jaccard similarity (errors)	0.753	inchigen (BPE)	On incorrect predictions only

Hardware

Not reported.

Paper Information

Citation: Omote, Y., Matsushita, K., Iwakura, T., Tamura, A., & Ninomiya, T. (2020). Transformer-based Approach for Predicting Chemical Compound Structures. Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, 154-162. https://doi.org/10.18653/v1/2020.aacl-main.19

@inproceedings{omote2020transformer,
  title={Transformer-based Approach for Predicting Chemical Compound Structures},
  author={Omote, Yutaro and Matsushita, Kyoumoto and Iwakura, Tomoya and Tamura, Akihiro and Ninomiya, Takashi},
  booktitle={Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing},
  pages={154--162},
  year={2020},
  publisher={Association for Computational Linguistics},
  doi={10.18653/v1/2020.aacl-main.19}
}

Neural Machine Translation of Chemical Nomenclature

Thu, 26 Mar 2026 00:00:00 +0000

A Method for Neural Translation of Chemical Names

This is a Method paper that introduces deep learning approaches for translating chemical nomenclature between English and Chinese. The primary contribution is demonstrating that character-level sequence-to-sequence neural networks (both CNN-based and LSTM-based) can serve as viable alternatives to hand-crafted rule-based translation systems for chemical names. The work compares two neural architectures against an existing rule-based tool on bilingual chemical name datasets.

Bridging the English-Chinese Chemical Nomenclature Gap

English and Chinese are the two most widely used languages for chemical nomenclature worldwide. Translation between them is important for chemical data processing, especially for converting Chinese chemical names extracted via named entity recognition into English names that existing name-to-structure tools can parse. Rule-based translation between these languages faces considerable challenges:

Chinese chemical names lack word boundaries (no spaces), making segmentation difficult.
Word order is often reversed between English and Chinese chemical names (e.g., “ethyl acetate” maps to characters meaning “acetate-ethyl” in Chinese).
The same English morpheme can map to different Chinese characters depending on chemical context (e.g., “ethyl” translates differently in “ethyl acetate” vs. “ethyl alcohol”).
Trivial names, especially for natural products, follow irregular translation patterns or are transliterations.

Building comprehensive rule sets requires a formally trained chemist fluent in both languages, making rule-based approaches expensive and fragile.

Character-Level Sequence-to-Sequence Translation

The core idea is to treat chemical name translation as a character-level machine translation task, applying encoder-decoder architectures with attention mechanisms. Two architectures are proposed:

CNN-based architecture: Three 1D convolutional layers encode the input character sequence. A decoder with three 1D convolutional layers processes the target sequence offset by one timestep, combined with attention mechanism layers that connect encoder and decoder outputs. Two additional 1D convolutional layers produce the final decoded output sequence.

LSTM-based architecture: An LSTM encoder converts the input sequence into two state vectors. An LSTM decoder is trained with teacher forcing, using the encoder’s state vectors as its initial state, and generating the target sequence offset by one timestep.

Both models operate at the character level. Input chemical name strings are transformed into embedding vectors, with the vocabulary size equal to the number of unique characters in the respective language (100 unique characters for English names, 2,056 unique characters for Chinese names).

Experimental Setup and Comparison with Rule-Based Tool

Datasets

The authors built two directional datasets from a manually curated corpus of scientific literature maintained at their institution:

En2Ch (English to Chinese): 30,394 name pairs after deduplication
Ch2En (Chinese to English): 37,207 name pairs after deduplication

The datasets cover systematic compound names through trivial names. For names with multiple valid translations, the most commonly used translation was selected. Each dataset was split 80/20 for training and validation.

Model Configuration

Both neural network models used the following hyperparameters:

Batch size: 64
Epochs: 100
Latent dimensionality: 256 (encoding and decoding space)
Implementation: Python 3.7 with Keras 2.3 and TensorFlow backend

Evaluation Metrics

The models were evaluated on five metrics across both translation directions:

Success Rate: Percentage of inputs that produced any output
String Matching Accuracy: Exact match with the single target name
Data Matching Accuracy: Exact match allowing any valid translation from the corpus
Manual Spot Check: Blind evaluation of 100 random samples per approach
Running Time: Wall-clock time on the same hardware

Baseline

The rule-based comparison system operates in three steps: disassemble the input name into word fragments, translate each fragment, and reassemble into the target language. This tool had been deployed as an online service with over one million uses at the time of publication.

Key Findings and Limitations

Main Results

Metric	CNN	LSTM	Rule-based
Success Rate En2Ch	100%	100%	75.97%
Success Rate Ch2En	100%	100%	59.90%
String Match En2Ch	82.92%	89.64%	39.81%
String Match Ch2En	78.11%	55.44%	43.77%
Data Match En2Ch	84.44%	90.82%	45.15%
Data Match Ch2En	80.22%	57.40%	44.91%
Manual Check En2Ch	90.00%	89.00%	80.00%
Manual Check Ch2En	82.00%	61.00%	78.00%
Time En2Ch (s)	1423	190	288
Time Ch2En (s)	1876	303	322

Both neural approaches achieved 100% success rate (always producing output), while the rule-based tool failed on 24% and 40% of inputs for En2Ch and Ch2En respectively. The rule-based tool’s failures were concentrated on Chinese names lacking word boundaries and on trivial names of natural products.

For English-to-Chinese translation, LSTM performed best at 89.64% string matching accuracy (90.82% data matching), followed by CNN at 82.92%. For Chinese-to-English, CNN substantially outperformed LSTM (78.11% vs. 55.44% string matching), suggesting that LSTM had difficulty with long-term dependencies in Chinese character sequences. The authors observed that many LSTM errors appeared at the ends of chemical names.

Analysis by Name Type

The CNN-based approach outperformed LSTM on CAS names (80% vs. 52% in manual checks) and was more robust for longer names. The rule-based tool showed consistent performance regardless of name length, suggesting it was more suited to regular systematic names but struggled with the diversity of real-world chemical nomenclature.

Limitations

Performance depends heavily on training data quality and quantity.
Neither neural approach was validated on an external test set outside the institution’s corpus.
The CNN model was considerably slower (5-6x) than the other two approaches.
No comparison against modern transformer-based NMT architectures (the study predates widespread adoption of transformers for this task).
The dataset is relatively small by modern NMT standards (30-37K pairs).
The authors noted that some neural translations were actually better than the target labels, suggesting the evaluation metrics understate true performance.

Future Directions

The authors suggest that combining CNN and LSTM architectures could yield further improvements, and that the approach has practical applications in scientific publishing (Chinese journals requiring English abstracts) and chemical database interoperability.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Training/Validation (En2Ch)	Curated bilingual corpus	30,394 pairs	80/20 split, from SIOC chemical data system
Training/Validation (Ch2En)	Curated bilingual corpus	37,207 pairs	80/20 split, from SIOC chemical data system
Testing (En2Ch)	Held-out validation split	6,079 records	Same source
Testing (Ch2En)	Held-out validation split	7,441 records	Same source

Training data, Python code for both models, and result data are provided as supplementary files with the paper.

Algorithms

Character-level CNN encoder-decoder with attention (3+3+2 conv layers)
Character-level LSTM encoder-decoder with teacher forcing
Batch size: 64, epochs: 100, latent dim: 256

Models

Both models implemented in Python 3.7 with Keras 2.3 / TensorFlow. No pre-trained weights are released separately, but the training code is provided as supplementary material.

Evaluation

Metric	Best Value (En2Ch)	Best Value (Ch2En)	Notes
Success Rate	100% (both DL)	100% (both DL)	Rule-based: 75.97% / 59.90%
String Matching	89.64% (LSTM)	78.11% (CNN)	Best neural model per direction
Data Matching	90.82% (LSTM)	80.22% (CNN)	Allows multiple valid translations
Manual Spot Check	90.00% (CNN)	82.00% (CNN)	Blind evaluation of 100 samples

Hardware

Not specified in the paper. Running times reported but hardware details not provided.

Artifacts

Artifact	Type	License	Notes
Supplementary files	Code + Data	CC-BY 4.0	Training data, CNN/LSTM code, results (Additional files 1-6)
SIOC Translation Tool	Other	Not specified	Rule-based baseline tool, online service

Paper Information

Citation: Xu, T., Chen, W., Zhou, J., Dai, J., Li, Y., & Zhao, Y. (2020). Neural machine translation of chemical nomenclature between English and Chinese. Journal of Cheminformatics, 12, 50. https://doi.org/10.1186/s13321-020-00457-0

@article{xu2020neural,
  title={Neural machine translation of chemical nomenclature between English and Chinese},
  author={Xu, Tingjun and Chen, Weiming and Zhou, Junhong and Dai, Jingfang and Li, Yingyong and Zhao, Yingli},
  journal={Journal of Cheminformatics},
  volume={12},
  pages={50},
  year={2020},
  doi={10.1186/s13321-020-00457-0},
  publisher={Springer}
}

Translating InChI to IUPAC Names with Transformers

Sat, 20 Dec 2025 00:00:00 +0000

Primary Contribution: A Transformer-Based Method

This is primarily a Method paper. It adapts a specific architecture (Transformer) to a specific task (InChI-to-IUPAC translation) and evaluates its performance against both machine learning and commercial baselines. It also has a secondary Resource contribution, as the trained model and scripts are released as open-source software.

Motivation: The Bottleneck in Algorithmic IUPAC Nomenclature

Generating correct IUPAC names is difficult due to the comprehensive but complex rules defined by the International Union of Pure and Applied Chemistry. Commercial software generates names from structures but remains closed-source with opaque methodologies and frequent inter-package disagreements. Open identifiers like InChI and SMILES lack direct human readability. This creates a need for an open, automated method to generate informative IUPAC names from standard identifiers like InChI, which are ubiquitous in online chemical databases.

Novelty: Treating Chemical Translation as a Character-Level Sequence

The key novelty is treating chemical nomenclature translation as a character-level sequence-to-sequence problem using a Transformer architecture, specifically using InChI as the source language.

Standard Neural Machine Translation (NMT) uses sub-word tokenization. This model processes InChI and predicts IUPAC names character-by-character.
It demonstrates that character-level tokenization outperforms byte-pair encoding or unigram models for this specific chemical task.
It uses InChI’s standardization to avoid the canonicalization issues inherent in SMILES-based approaches.
The attention mechanism allows the decoder to align specific parts of the generated IUPAC name with corresponding structural features in the source InChI string, operating via the standard scaled dot-product attention: $$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$

Methodology & Experimental Validation

Training: The model was trained on 10 million InChI/IUPAC pairs sampled from PubChem using a character-level objective. The model is supervised using categorical cross-entropy loss across the vocabulary of characters: $$ \mathcal{L} = -\sum_{i=1}^{N} y_i \log(\hat{y}_i) $$
Ablation Studies: The authors experimentally validated architecture choices, finding that LSTM models and sub-word tokenization (BPE) performed worse than the Transformer with character tokenization. They also optimized dropout rates.
Performance Benchmarking: The model was evaluated on a held-out test set of 200,000 samples. Performance was quantified primarily by Whole-Name Accuracy and Normalized Edit Distance (based on the Damerau-Levenshtein distance, scaled by the maximum string length).
Commercial Comparison: The authors compared their model against four major commercial packages (ACD/I-Labs, ChemAxon, Mestrelab, and PubChem’s Lexichem). However, this evaluation used a highly limited test set of only 100 molecules, restricting the statistical confidence of the external baseline.
Error Analysis: They analyzed performance across different chemical classes (organics, charged species, macrocycles, inorganics) and visualized attention coefficients to interpret model focus.

Key Results and the Inorganic Challenge

High Accuracy on Organics: The model achieved 91% whole-name accuracy on the test set, performing particularly well on organic compounds.
Comparable to Commercial Tools: On the limited 100-molecule benchmark, the edit distance between the model’s predictions and commercial packages (15-23%) was similar to the variation found between the commercial packages themselves (16-21%).
Limitations on Inorganics: The model performed poorly on inorganic (14% accuracy) and organometallic compounds (20% accuracy). This is attributed to inherent data limitations in the standard InChI format (which deliberately disconnects metal atoms from their ligands) and low training data coverage for those classes.
Character-Level Superiority: Character-level tokenization was found to be essential; byte-pair encoding reduced accuracy significantly.

Reproducibility Details

Data

The dataset was derived from PubChem’s public FTP server (CID-SMILES.gz and CID-IUPAC.gz).

Purpose	Dataset	Size	Notes
Raw	PubChem	100M pairs	Filtered for length (InChI < 200 chars, IUPAC < 150 chars). 132k unparseable SMILES dropped.
Training	Subsampled	10M pairs	Random sample from the filtered set.
Validation	Held-out	10,000 samples	Limited to InChI length > 50 chars.
Test	Held-out	200,000 samples	Limited to InChI length > 50 chars.
Tokenization	Vocab	InChI: 66 chars IUPAC: 70 chars	Character-level tokenization. Spaces treated as tokens.

Algorithms

Framework: OpenNMT-py 2.0.0 (using PyTorch). Training scripts and vocabularies are available as supplementary files to the original publication. Pre-trained model weights are hosted on Zenodo.
Architecture Type: Transformer Encoder-Decoder.
Optimization: ADAM optimizer ($\beta_1=0.9, \beta_2=0.998$).
Learning Rate: Linear warmup over 8000 steps to 0.0005, then decayed by inverse square root of iteration.
Regularization:
- Dropout: 0.1 (applied to dense and attentional layers).
- Label Smoothing: Magnitude 0.1.
Training Strategy: Teacher forcing used for both training and validation.
Gradient Accumulation: Gradients accumulated over 4 batches before updating parameters.
Inference: Beam search with width 10 and length penalty 1.0.

Models

Structure: 6 layers in encoder, 6 layers in decoder.
Attention: 8 heads per attention sub-layer.
Dimensions:
- Feed-forward hidden state size: 2048.
- Embedding vector length: 512.
Initialization: Glorot’s method.
Position: Positional encoding added to word vectors.

Evaluation

Metrics reported include Whole-Name Accuracy (percentage of exact matches) and Normalized Edit Distance (Damerau-Levenshtein, scale 0-1).

Metric	Value	Baseline	Notes
Accuracy (All)	91%	N/A	Test set of 200k samples.
Accuracy (Inorganic)	14%	N/A	Limited by InChI format and data.
Accuracy (Organometallic)	20%	N/A	Limited by InChI format and data.
Accuracy (Charged)	79%	N/A	Test set subset.
Accuracy (Rajan)	72%	N/A	Comparative ML model (STOUT).
Edit Dist (Organic)	$0.02 \pm 0.03$	N/A	Very high similarity for organics.
Edit Dist (Inorganic)	$0.32 \pm 0.20$	N/A	Poor performance on inorganics.
Edit Dist (Organometallic)	$0.37 \pm 0.24$	N/A	Poor performance on organometallics.

Hardware

GPU: Tesla K80.
Training Time: 7 days.
Throughput: ~6000 tokens/sec (InChI) and ~3800 tokens/sec (IUPAC).
Batch Size: 4096 tokens (approx. 30 compounds).

Artifacts

Artifact	Type	License	Notes
InChI to IUPAC model	Model	CC BY 4.0	Pre-trained Transformer weights (551 MB), requires OpenNMT-py 2.0.0
PubChem FTP	Dataset	Public Domain	Source data: CID-SMILES.gz and CID-IUPAC.gz
Training scripts & vocabularies	Code	Unknown	Included as supplementary files with the publication

Paper Information

Citation: Handsel, J., Matthews, B., Knight, N. J., & Coles, S. J. (2021). Translating the InChI: Adapting Neural Machine Translation to Predict IUPAC Names from a Chemical Identifier. Journal of Cheminformatics, 13(1), 79. https://doi.org/10.1186/s13321-021-00535-x

Publication: Journal of Cheminformatics 2021

@article{handselTranslatingInChIAdapting2021a,
  title = {Translating the {{InChI}}: Adapting Neural Machine Translation to Predict {{IUPAC}} Names from a Chemical Identifier},
  shorttitle = {Translating the {{InChI}}},
  author = {Handsel, Jennifer and Matthews, Brian and Knight, Nicola J. and Coles, Simon J.},
  year = 2021,
  month = oct,
  journal = {Journal of Cheminformatics},
  volume = {13},
  number = {1},
  pages = {79},
  issn = {1758-2946},
  doi = {10.1186/s13321-021-00535-x},
  urldate = {2025-12-20},
  abstract = {We present a sequence-to-sequence machine learning model for predicting the IUPAC name of a chemical from its standard International Chemical Identifier (InChI). The model uses two stacks of transformers in an encoder-decoder architecture, a setup similar to the neural networks used in state-of-the-art machine translation. Unlike neural machine translation, which usually tokenizes input and output into words or sub-words, our model processes the InChI and predicts the IUPAC name character by character. The model was trained on a dataset of 10 million InChI/IUPAC name pairs freely downloaded from the National Library of Medicine's online PubChem service. Training took seven days on a Tesla K80 GPU, and the model achieved a test set accuracy of 91\%. The model performed particularly well on organics, with the exception of macrocycles, and was comparable to commercial IUPAC name generation software. The predictions were less accurate for inorganic and organometallic compounds. This can be explained by inherent limitations of standard InChI for representing inorganics, as well as low coverage in the training data.},
  langid = {english},
  keywords = {Attention,GPU,InChI,IUPAC,seq2seq,Transformer}
}

Struct2IUPAC: Translating SMILES to IUPAC via Transformers

Sat, 20 Dec 2025 00:00:00 +0000

Struct2IUPAC as a Methodological Shift

This is primarily a Method paper with significant elements of Position.

Method: The authors propose a specific neural architecture (Transformer with custom tokenization) and a verification pipeline (round-trip check) to solve the SMILES $\leftrightarrow$ IUPAC translation task. They rigorously benchmark this against rule-based baselines (OPSIN).
Position: The authors explicitly argue for a paradigm shift, suggesting that “heavy” neural architectures should replace complex, costly rule-based legacy systems even for “exact” algorithmic tasks.

The Cost of Rule-Based Chemical Naming

Complexity of Naming: Generating IUPAC names manually is error-prone and requires deep algorithmic knowledge.
Lack of Open Source Tools: While open-source tools exist for Name-to-Structure (e.g., OPSIN), there were no open-source tools for the inverse “Structure-to-Name” conversion at the time of writing.
Cost of Development: Developing rule-based converters “from scratch” is prohibitively expensive and time-consuming compared to training a neural model on existing data.

Struct2IUPAC Core Innovation

Struct2IUPAC: The first effective open-source neural model for converting SMILES to IUPAC names, treating chemical translation as a Neural Machine Translation (NMT) problem.
Verification Loop: A novel inference pipeline that generates multiple candidates via beam search and validates them using a reverse converter (OPSIN) to ensure the generated name maps back to the original structure.
Custom Tokenization: A manually curated rule-based tokenizer for IUPAC names that handles specific chemical suffixes, prefixes, and stereochemical markers.

Experimental Setup and Stress Testing

Accuracy Benchmarking: The models were tested on a held-out subset of 100,000 molecules from PubChem. The authors measured accuracy across different beam sizes (1, 3, 5).
Comparison to Rules: The neural IUPAC2Struct model was compared directly against the rule-based OPSIN tool.
Stress Testing:
- Sequence Length: Evaluated performance across varying token lengths, identifying a “sweet spot” (10-60 tokens) and failure modes for very short (e.g., methane) or long molecules.
- Stereochemistry: Tested on “stereo-dense” compounds. The authors define a “stereo-density” index ($I$) as the ratio of stereocenters ($S$) to total tokens ($N$): $$I = \frac{S}{N}$$ They observed a performance drop for these dense molecules, though the model still handled many stereocenters robustly.
- Tautomers: Verified the model’s ability to handle different tautomeric forms (e.g., Guanine and Uracil variants).
Latency Analysis: Benchmarked inference speeds on CPU vs. GPU relative to output sequence length.

Benchmarks and Outcomes

High Accuracy: The Struct2IUPAC model achieved 98.9% accuracy (Beam 5 with verification). The reverse model (IUPAC2Struct) achieved 99.1%, comparable to OPSIN’s 99.4%.
Distribution Modeling vs. Intuition: The authors claim the model infers “chemical logic,” because it correctly generates multiple valid IUPAC names for single molecules where naming ambiguity exists (e.g., parent group selection). However, this more likely reflects the Transformer successfully modeling the high-frequency conditional probability distribution of synonymous names present in the PubChem training data, rather than learning intrinsic chemical rules.
Production Readiness: Inference on GPU takes less than 0.5 seconds even for long names, making it viable for production use.
Paradigm Shift: The authors conclude that neural networks are a viable, cost-effective alternative to developing rule-based algorithms for legacy notation conversion.

Reproducibility Details

Data

The study utilized the PubChem database.

Purpose	Dataset	Size	Notes
Total	PubChem	~95M	Filtered for RDKit compatibility
Training	Split A	47,312,235	Random 50% split
Testing	Split B	47,413,850	Random 50% split

Cleaning: Molecules that could not be processed by RDKit were removed. Molecules containing tokens not in the tokenizer (e.g., aromatic selenium) were excluded.
Availability: A subset of 100,000 test molecules is available on GitHub (data/test_100000.csv) and Zenodo. The full train/test splits are not explicitly provided.

Algorithms

Tokenization:
- SMILES: Character-based tokenization.
- IUPAC: Custom rule-based tokenizer splitting suffixes (-one, -al), prefixes (-oxy, -di), and special symbols ((, ), R(S)).
Verification Step:
1. Generate $N$ names using Beam Search ($N=5$).
2. Reverse translate the candidate name using OPSIN.
3. Check if the OPSIN structure matches the original input SMILES.
4. Display the first verified match; otherwise, report failure.

Models

Architecture: Standard Transformer with 6 encoder layers and 6 decoder layers.
Hyperparameters:
- Attention Heads: 8
- Attention Dimension ($d_{\text{model}}$): 512
- Feed-Forward Dimension ($d_{\text{ff}}$): 2048
Training Objective: The models were trained using standard autoregressive cross-entropy loss over the target token sequence $y$ given the input string $x$: $$\mathcal{L} = - \sum_{t=1}^{T} \log P(y_t \mid y_{
Training: Two separate models were trained: Struct2IUPAC (SMILES $\to$ IUPAC) and IUPAC2Struct (IUPAC $\to$ SMILES).
Availability: Code for model architecture is provided in the GitHub repository. Pre-trained weights for the IUPAC2Struct model are available, but the Struct2IUPAC model weights are not publicly released, meaning researchers would need to retrain that model on their own PubChem data to reproduce those results.

Evaluation

Evaluation was performed on a random subset of 100,000 molecules from the test set.

Metric	Task	Beam Size	Accuracy
Exact Match	Struct2IUPAC	1	96.1%
Exact Match	Struct2IUPAC	5	98.9%
Exact Match	IUPAC2Struct	1	96.6%
Exact Match	IUPAC2Struct	5	99.1%

Robustness: Accuracy drops significantly for augmented (non-canonical) SMILES (37.16%) and stereo-enriched compounds (66.52%).

Hardware

Training Infrastructure: 4 $\times$ Tesla V100 GPUs and 36 CPUs.
Training Time: Approximately 10 days under full load.
Inference Speed: <0.5s per molecule on GPU; scale is linear with output token length.

Artifacts

Artifact	Type	License	Notes
IUPAC2Struct (GitHub)	Code	MIT	Transformer code and pre-trained IUPAC2Struct model
Test data (Zenodo)	Dataset	Unknown	100k test molecules, OPSIN failure cases, model failure cases
Struct2IUPAC web demo	Other	N/A	Online interface for SMILES to IUPAC conversion

Paper Information

Citation: Krasnov, L., Khokhlov, I., Fedorov, M. V., & Sosnin, S. (2021). Transformer-based artificial neural networks for the conversion between chemical notations. Scientific Reports, 11(1), 14798. https://doi.org/10.1038/s41598-021-94082-y

Publication: Scientific Reports 2021

@article{krasnovTransformerbasedArtificialNeural2021a,
  title = {Transformer-Based Artificial Neural Networks for the Conversion between Chemical Notations},
  author = {Krasnov, Lev and Khokhlov, Ivan and Fedorov, Maxim V. and Sosnin, Sergey},
  year = 2021,
  month = jul,
  journal = {Scientific Reports},
  volume = {11},
  number = {1},
  pages = {14798},
  publisher = {Nature Publishing Group},
  doi = {10.1038/s41598-021-94082-y}
}

Additional Resources:

STOUT: SMILES to IUPAC Names via Neural Machine Translation

Sat, 20 Dec 2025 00:00:00 +0000

Contribution: Translating Chemistry as a Language

This is primarily a Method paper, with a strong secondary contribution as a Resource paper.

Method: It proposes a neural machine translation (NMT) architecture to approximate the complex, rule-based algorithm of IUPAC naming, treating it as a language translation task.
Resource: It provides an open-source tool and trained models to the community, addressing a gap where such functionality was previously limited to proprietary software.

Motivation: Democratizing IUPAC Nomenclature

The International Union of Pure and Applied Chemistry (IUPAC) naming scheme is universally accepted but algorithmically complex. Generating these names correctly is challenging for humans, and automated generation is largely missing from major open-source toolkits like CDK, RDKit, or Open Babel. While reliable commercial tools exist (e.g., ChemAxon’s molconvert), there was a lack of open-source alternatives for the scientific community. STOUT aims to fill this gap using a data-driven approach.

Core Innovation: Sequence-to-Sequence Naming

Language Translation Approach: The authors treat chemical representations (SMILES/SELFIES) and IUPAC names as two different languages, applying Neural Machine Translation (NMT) to translate between them.
Use of SELFIES: The work establishes SELFIES (Self-Referencing Embedded Strings) as a robust choice over SMILES for deep learning tokenization in this specific task, capitalizing on its syntactic robustness.
Hardware Acceleration: The paper benchmarks GPU versus TPU training and highlights the practical necessity of Tensor Processing Units (TPUs) for training large-scale chemical language models, reducing training time by an order of magnitude.

Methodology & Translation Validation

Data Scale: The model was trained on datasets of 30 million and 60 million molecules derived from PubChem.
Hardware Benchmarking: Training efficiency was compared between an nVidia Tesla V100 GPU and Google TPU v3-8/v3-32 units.
Bidirectional Translation: The system was tested on two distinct tasks:
1. Forward: SELFIES → IUPAC names
2. Reverse: IUPAC names → SELFIES
Validation: Performance was evaluated on a held-out test set of 2.2 million molecules.

Translation Accuracy & Hardware Scaling

High Accuracy: The model achieved an average BLEU score of ~90% and a Tanimoto similarity index > 0.9 for both translation directions.
Generalization: Even when predictions were textually mismatched (low BLEU score), the underlying chemical structures often remained highly similar (high Tanimoto similarity), suggesting the system captures fundamental chemical semantics rather than merely memorizing strings.
Impact of Data Size: Expanding training from 30 million to 60 million molecules yielded consistent performance gains without saturating.
Hardware Necessity: Training on TPUs proved up to 54 times faster than a standard GPU baseline (Tesla V100), making scaling highly computationally tractable.

Reproducibility

Artifact	Type	License	Notes
STOUT (GitHub)	Code	MIT	Current repo hosts STOUT V2.0 transformer models; V1 RNN code available in earlier commits
PubChem	Dataset	Public Domain	Source of 111M molecules; 30M/60M training subsets not directly provided

Data

The dataset was curated from PubChem (111 million molecules). Note that the specific 30M and 60M subsets are not directly linked in the publication repository, which means a user would have to reconstruct the filtering process.

Preprocessing & Filtering:

Explicit hydrogens removed; converted to canonical SMILES.
Filtering Rules: MW < 1500 Da, no counter ions, limited element set (C, H, O, N, P, S, F, Cl, Br, I, Se, B), no hydrogen isotopes, 3-40 bonds, no charged groups.
Ground Truth Generation: ChemAxon’s molconvert (Marvin Suite 20.15) was used to generate target IUPAC names for training.
Representation: All SMILES were converted to SELFIES for training.

Purpose	Dataset	Size	Notes
Training	PubChem Filtered	30M & 60M	Two distinct training sets created.
Testing	PubChem Held-out	2.2M	Molecules not present in training sets; uniform token frequency.

Algorithms

Tokenization:
- SELFIES: Split iteratively by brackets [ and ].
- IUPAC: Split via punctuation ((, ), {, }, [, ], -, ., ,) and a discrete set of sub-word chemical morphemes (e.g., methyl, benzene, fluoro).
- Padding: SELFIES padded to 48 tokens; IUPAC padded to 78 tokens. “Start” and “End” sequence markers append each chain.
Optimization: Adam optimizer instantiated with a learning rate of $0.0005$.
Objective Function: Sparse categorical cross-entropy, assessing prediction probabilities for token $i$ over vocabulary $V$: $$ \mathcal{L} = -\sum_{i=1}^{V} y_i \log(\hat{y}_i) $$

Models

Architecture: Encoder-Decoder sequence-to-sequence network with Bahdanau attention mechanism context weighting.
Components:
- Encoder/Decoder: Recurrent Neural Networks (RNN) constructed using Gated Recurrent Units (GRU).
- Attention: Bahdanau (additive) soft attention, which calculates alignment scores to softly weight encoder hidden states natively: $$ e_{tj} = v_a^\top \tanh(W_a s_{t-1} + U_a h_j) $$
- Embedding: Decoder output passes through a continuous embedding layer before concatenating with the attention context vector.
Implementation: Python 3 backend using TensorFlow 2.3.0. Note: The linked GitHub repository currently defaults to the STOUT V2.0 transformer models, so researchers aiming to reproduce this specific V1 RNN paper should reference the older tag/commit history.

Evaluation

Metrics heavily emphasize both linguistic accuracy and cheminformatic structural correctness:

Metric	Details	Result (60M Model)	Notes
BLEU Score	NLTK sentence BLEU (unigram to 4-gram)	0.94 (IUPAC $\to$ SELFIES)	Exact text overlap. Serves as a strictly syntactic proxy.
Tanimoto Similarity	PubChem fingerprints via CDK	0.98 (Valid IUPAC names)	Evaluates substructure alignment over bit vectors, $T(A, B) = \frac{\vert A \cap B \vert}{\vert A \cup B \vert}$.

Hardware

Comparison of hardware efficiency for training large chemical language models:

Hardware	Batch Size	Time per Epoch (15M subset)	Speedup Factor
GPU (Tesla V100)	256	~27 hours	1x
TPU v3-8	1024 (Global)	~2 hours	13x
TPU v3-32	1024 (Global)	~0.5 hours	54x

Paper Information

Citation: Rajan, K., Zielesny, A., & Steinbeck, C. (2021). STOUT: SMILES to IUPAC names using neural machine translation. Journal of Cheminformatics, 13(1), 34. https://doi.org/10.1186/s13321-021-00512-4

Publication: Journal of Cheminformatics 2021

@article{rajanSTOUTSMILESIUPAC2021,
  title = {STOUT: SMILES to IUPAC Names Using Neural Machine Translation},
  shorttitle = {STOUT},
  author = {Rajan, Kohulan and Zielesny, Achim and Steinbeck, Christoph},
  year = {2021},
  month = apr,
  journal = {Journal of Cheminformatics},
  volume = {13},
  number = {1},
  pages = {34},
  issn = {1758-2946},
  doi = {10.1186/s13321-021-00512-4},
  urldate = {2025-09-22},
  abstract = {Chemical compounds can be identified through a graphical depiction, a suitable string representation, or a chemical name. A universally accepted naming scheme for chemistry was established by the International Union of Pure and Applied Chemistry (IUPAC) based on a set of rules. Due to the complexity of this ruleset a correct chemical name assignment remains challenging for human beings and there are only a few rule-based cheminformatics toolkits available that support this task in an automated manner. Here we present STOUT (SMILES-TO-IUPAC-name translator), a deep-learning neural machine translation approach to generate the IUPAC name for a given molecule from its SMILES string as well as the reverse translation, i.e. predicting the SMILES string from the IUPAC name. In both cases, the system is able to predict with an average BLEU score of about 90% and a Tanimoto similarity index of more than 0.9. Also incorrect predictions show a remarkable similarity between true and predicted compounds.},
  langid = {english},
  keywords = {Attention mechanism,Chemical language,Deep neural network,DeepSMILES,IUPAC names,Neural machine translation,Recurrent neural network,SELFIES,SMILES}
}

Additional Resources:

STOUT V2.0: Transformer-Based SMILES to IUPAC Translation

Sat, 20 Dec 2025 00:00:00 +0000

Paper Contribution & Methodological Scope

Method (Primary) / Resource (Secondary)

This paper presents a Methodological contribution by developing and validating a Transformer-based neural machine translation model (STOUT V2) for bidirectional chemical nomenclature (SMILES $\leftrightarrow$ IUPAC). It systematically compares this new architecture against previous RNN-based baselines (STOUT V1) and performs ablation studies on tokenization strategies.

It also serves as a significant Resource contribution by generating a massive training dataset of nearly 1 billion SMILES-IUPAC pairs (curated via commercial Lexichem software) and releasing the resulting models and code as open-source tools for chemical naming.

The Need for Robust Open-Source IUPAC Nomenclature Rules

Assigning systematic IUPAC names to chemical structures requires adherence to complex rules, challenging human consistency. Deterministic, rule-based software options like OpenEye Lexichem and ChemAxon are reliable commercial solutions. Existing open-source tools like OPSIN focus on parsing names to structures.

The previous version of STOUT (V1), based on RNNs/GRUs, achieved ~90% BLEU accuracy, with known limitations in capturing long-distance dependencies required for stereochemistry handling. This work uses the sequence-learning capabilities of Transformers combined with large-scale datasets to create a competitive open-source IUPAC naming tool.

Architectural Shift and Billion-Scale Training

The primary advancements over previous iterations address both architecture and dataset scale:

Architecture Shift: Moving from an RNN-based Seq2Seq model to a Transformer-based architecture (4 layers, 8 heads), which captures intricate chemical patterns better than GRUs.
Billion-Scale Training: Training on a dataset of nearly 1 billion molecules (combining PubChem and ZINC15), significantly larger than the 60 million used for STOUT V1.
Tokenization Strategy: Determining that character-wise tokenization for IUPAC names is superior to word-wise tokenization in terms of both accuracy and training efficiency (15% faster).

Experimental Validation and Scaling Limits

The authors conducted three primary experiments to validate bidirectional translation (SMILES $\rightarrow$ IUPAC and IUPAC $\rightarrow$ SMILES):

Experiment 1 (Optimization): Assessed the impact of dataset size (1M vs 10M vs 50M) and tokenization strategy on SMILES-to-IUPAC performance.
Experiment 2 (Scaling): Trained models on 110 million PubChem molecules for both forward and reverse translation tasks to test performance on longer sequences.
Experiment 3 (Generalization): Trained on the full ~1 billion dataset (PubChem + ZINC15) for both translation directions.
External Validation: Benchmarked against an external dataset from ChEBI (1,485 molecules) and ChEMBL34 to test generalization to unseen data.

Evaluation Metrics:

Textual Accuracy: BLEU scores (1-4) and Exact String Match.
Chemical Validity: Retranslation of generated names back to SMILES using OPSIN, followed by Tanimoto similarity checks (PubChem fingerprints) against the original input.

Translation Accuracy and Structural Validity

Superior Performance: STOUT V2 achieved an average BLEU score of 0.99 (vs 0.94 for V1). While exact string matches varied by experiment (83-89%), the model notably achieved a perfect BLEU score (1.0) on 97.49% of a specific test set where STOUT V1 only reached 66.65%.
Structural Validity (“Near Misses”): When the generated name differed from the ground truth string, the re-generated structure often remained chemically valid. The model maintained an average Tanimoto similarity $T(A,B)$ of 0.68 for these divergent names between bit-vector fingerprints $A$ and $B$, roughly defined as: $$ T(A,B) = \frac{\sum (A \cap B)}{\sum (A \cup B)} $$ Critique: Note that an average Tanimoto coefficient of 0.68 typically suggests moderate structural similarity/drift, not an almost-identical “near miss” (which would be $>0.85$). This implies the model constructs chemically related but structurally distinct outputs when it fails exact string matching.
Tokenization: Character-level splitting for IUPAC names outperformed word-level splitting and was more computationally efficient.
Data Imbalance & Generalization: The model’s drop in performance for sequences >600 characters highlights a systemic issue in open chemical databases: long, highly complex SMILES strings are significantly underrepresented. Even billion-scale training datasets are still bound by the chemical diversity of their source material.
Limitations:
- Preferred Names (PINs): The model mimics Lexichem’s naming conventions, generating valid IUPAC names distinct from strict Preferred IUPAC Names (PINs).
- Sequence Length: Performance degrades for very long SMILES (>600 characters) due to scarcity in the training data.
- Algorithmic Distillation Bottleneck: Because the 1 billion training pairs were generated entirely by OpenEye’s Lexichem, STOUT V2 acts as a knowledge distillation of that specific commercial algorithm. The model learns Lexichem’s heuristic mapping, specific dialects, and potential systematic errors, rather than deriving true nomenclature rules from first principles.

Reproducibility Details

Data

The training data was derived from PubChem and ZINC15. Ground truth IUPAC names were generated using OpenEye Lexichem TK 2.8.1 to ensure consistency.

Purpose	Dataset	Size	Notes
Training (Exp 1)	PubChem Subset	1M, 10M, 50M	Selected via MaxMin algorithm for diversity
Training (Exp 2)	PubChem	110M	Filtered for SMILES length < 600
Training (Exp 3)	PubChem + ZINC15	~1 Billion	999,637,326 molecules total
Evaluation	ChEBI	1,485	External validation set, non-overlapping with training

Preprocessing:

SMILES: Canonicalized, isomeric, and kekulized using RDKit (v2023.03.1).
Formatting: Converted to TFRecord format in 100 MB chunks for TPU efficiency.

Algorithms

SMILES Tokenization: Regex-based splitting. Atoms (e.g., “Cl”, “Au”), bonds, brackets, and digits are separate tokens.
IUPAC Tokenization: Character-wise split was selected as the optimal strategy (treating every character as a token).
Optimization: Adam optimizer with a custom learning rate scheduler based on model dimensions.
Loss Function: Trained to minimize the Sparse Categorical Cross-Entropy $L$, masking padding tokens. For a correctly predicted target class $t$ alongside probabilities $p_i$, the masked loss is represented mathematically as: $$ L = - \sum_{i=1}^{m} m_i y_{i} \log(p_{i}) $$ where $m_i$ masks padded positions.
Code Availability: The main STOUT V2 repository contains the inference package. The training pipeline/instructions (originally linked to a separate repo that is currently a 404) can still be found within the Zenodo archive release.

Models

The model follows the standard Transformer architecture from “Attention is All You Need” (Vaswani et al.).

Architecture: 4 Transformer layers (encoder/decoder stack).
Attention: Multi-head attention with 8 heads.
Dimensions: Embedding size ($d_{model}$) = 512; Feed-forward dimension ($d_{ff}$) = 2048.
Regularization: Dropout rate of 0.1.
Context Window: Max input length (SMILES) = 600; Max output length (IUPAC) = 700-1000.
Weights: Model weights for forward and reverse architectures are available via Zenodo (v3).

Evaluation

Evaluation focused on both string similarity and chemical structural integrity.

Metric	Scope	Method
BLEU Score	N-gram overlap	Compared predicted IUPAC string to Ground Truth.
Exact Match	Accuracy	Binary 1/0 check for identical strings.
Tanimoto	Structural Similarity	Predicted Name $\rightarrow$ OPSIN $\rightarrow$ SMILES $\rightarrow$ Fingerprint comparison to input.

Artifacts

Artifact	Type	License	Notes
STOUT V2 GitHub	Code	MIT	Inference package (PyPI: STOUT-pypi)
Model Weights (Zenodo v3)	Model	Unknown	Forward and reverse translation weights
Code Snapshot (Zenodo)	Code	Unknown	Training pipeline archive
Web Application	Other	Unknown	Demo with Ketcher, bulk submission, DECIMER integration

Hardware

Training was conducted entirely on Google Cloud Platform (GCP) TPUs.

STOUT V1: Trained on TPU v3-8.
STOUT V2: Trained on TPU v4-128 pod slices (128 nodes).
Large Scale (Exp 3): Trained on TPU v4-256 pod slice (256 nodes).
Training Time: Average of 15 hours and 2 minutes per epoch for the 1 billion dataset.
Framework: TensorFlow 2.15.0-pjrt with Keras.

Paper Information

Citation: Rajan, K., Zielesny, A., & Steinbeck, C. (2024). STOUT V2.0: SMILES to IUPAC name conversion using transformer models. Journal of Cheminformatics, 16(146). https://doi.org/10.1186/s13321-024-00941-x

Publication: Journal of Cheminformatics 2024

@article{rajanSTOUTV20SMILES2024,
  title = {{{STOUT V2}}.0: {{SMILES}} to {{IUPAC}} Name Conversion Using Transformer Models},
  shorttitle = {{{STOUT V2}}.0},
  author = {Rajan, Kohulan and Zielesny, Achim and Steinbeck, Christoph},
  year = 2024,
  month = dec,
  journal = {Journal of Cheminformatics},
  volume = {16},
  number = {1},
  pages = {146},
  issn = {1758-2946},
  doi = {10.1186/s13321-024-00941-x}
}

Additional Resources:

Web Application (Includes Ketcher drawing, bulk submission, and DECIMER integration)
DECIMER Project
STOUT V1 Note
Zenodo Archive (Code Snapshot)