Molecular Representations

SELFormer architecture diagram showing SELFIES token input flowing through a RoBERTa transformer encoder to molecular property predictions

SELFormer: A SELFIES-Based Molecular Language Model

SELFormer is a transformer-based chemical language model that uses SELFIES instead of SMILES as input. Pretrained on 2M ChEMBL compounds via masked language modeling, it achieves strong classification performance on MoleculeNet tasks, outperforming ChemBERTa-2 by ~12% on average across BACE, BBBP, and HIV.

Molecular Representations

ChemBERTa-3 visualization showing muscular arms lifting a stack of building blocks representing molecular data with SMILES notation, symbolizing the power and scalability of the open-source training framework

ChemBERTa-3: Open Source Chemical Foundation Models

ChemBERTa-3 provides a unified, scalable infrastructure for pretraining and benchmarking chemical foundation models. It addresses reproducibility gaps in previous studies like MoLFormer through standardized scaffold splitting and open-source tooling.

Molecular Representations

ChemBERTa-2 visualization showing flowing SMILES strings in blue tones representing molecular data streams

ChemBERTa-2: Scaling Molecular Transformers to 77M

This work investigates the scaling hypothesis for molecular transformers, training RoBERTa models on 77M SMILES from PubChem. It compares Masked Language Modeling (MLM) against Multi-Task Regression (MTR) pretraining, finding that MTR yields better downstream performance but is computationally heavier.

Molecular Representations

ChemBERTa masked language modeling visualization showing SMILES string CC(=O)O with masked tokens

ChemBERTa: Molecular Property Prediction via Transformers

This paper introduces ChemBERTa, a RoBERTa-based model pretrained on 77M SMILES strings. It systematically evaluates the impact of pretraining dataset size, tokenization strategies, and input representations (SMILES vs. SELFIES) on downstream MoleculeNet tasks, finding that performance scales positively with data size.

Molecular Representations

Diagram showing molecular structure passing through a neural network to produce IUPAC chemical nomenclature document

STOUT V2.0: Transformer-Based SMILES to IUPAC Translation

STOUT V2.0 uses Transformers trained on ~1 billion SMILES-IUPAC pairs to accurately translate chemical structures into systematic names (and vice-versa), outperforming its RNN predecessor.

Molecular Representations

Vintage wooden device labeled 'The Molecular Interpreter - Model 1974' with vacuum tubes, showing SMILES to IUPAC name translation

STOUT: SMILES to IUPAC Names via Neural Machine Translation

STOUT (SMILES-TO-IUPAC-name translator) uses neural machine translation to convert chemical line notations to IUPAC names and vice versa, achieving ~90% BLEU score. It addresses the lack of open-source tools for algorithmic IUPAC naming.

Molecular Representations

Diagram showing Struct2IUPAC workflow: molecular structure (SMILES) passing through Transformer to generate IUPAC name, with round-trip verification loop

Struct2IUPAC: Translating SMILES to IUPAC via Transformers

This paper proposes a Transformer-based approach (Struct2IUPAC) to convert chemical structures to IUPAC names, challenging the dominance of rule-based systems. Trained on ~47M PubChem examples, it achieves near-perfect accuracy using a round-trip verification step with OPSIN.

Molecular Representations

Transformer encoder-decoder architecture processing InChI string character-by-character to produce IUPAC chemical name

Translating InChI to IUPAC Names with Transformers

This study presents a sequence-to-sequence Transformer model that translates InChI identifiers into IUPAC names character-by-character. Trained on 10 million PubChem pairs, it achieves 91% accuracy on organic compounds, performing comparably to commercial software.

Molecular Representations

Invalid SMILES Benefit Chemical Language Models: A Study

A 2024 Nature Machine Intelligence paper providing causal evidence that invalid SMILES generation improves chemical language model performance by filtering low-likelihood samples, while validity constraints (as in SELFIES) introduce structural biases that impair distribution learning.

Molecular Representations

SELFIES and the Future of Molecular String Representations

This 2022 perspective paper reviews 250 years of chemical notation evolution and proposes 16 concrete research projects to extend SELFIES beyond traditional organic chemistry into polymers, crystals, and reactions.

Molecular Representations

D-glucose open-chain aldehyde form converting to beta-D-glucopyranose ring form, illustrating ring-chain tautomerism

InChI and Tautomerism: Toward Comprehensive Treatment

A comprehensive 2020 analysis of the tautomerism problem in chemical databases, compiling 86 tautomeric transformation rules (20 existing, 66 new) and validating them across 400M+ structures to inform algorithmic improvements for InChI V2.

Molecular Representations

2D molecular structure diagram of tricyclohexylphosphine showing a central phosphorus atom bonded to three cyclohexyl groups

InChI: The Worldwide Chemical Structure Identifier Standard

A comprehensive 2013 review explaining how InChI emerged as the global standard for chemical structure identifiers, covering its history as a response to the Internet’s need for non-proprietary molecular linking, its governance under IUPAC, and the technical layers that ensure uniqueness across diverse chemical databases.