String notations for encoding molecular structure, from the original SMILES format through SELFIES and InChI, plus tokenization methods and representation surveys.

Concept References

PaperKey Idea
SMILES: A Compact Notation for Chemical StructuresComprehensive overview of SMILES ASCII string notation
SELFIES: A Robust Molecular String RepresentationOverview of SELFIES, where every string decodes to a valid molecule
InChI: The International Chemical IdentifierIUPAC’s layered chemical identifier for database interoperability

SMILES Family

YearPaperKey Idea
1988SMILES Notation: The Original Paper by Weininger (1988)Introduced SMILES encoding rules for molecular strings
2018DeepSMILES: Adapting SMILES Syntax for Machine LearningPostfix notation eliminating unbalanced parentheses and ring closures
2019Randomized SMILES Improve Molecular Generative ModelsNon-canonical SMILES augmentation improves RNN generation
2024Invalid SMILES Benefit Chemical Language Models: A StudyInvalid SMILES generation improves CLMs via quality filtering
2024t-SMILES: Tree-Based Fragment Molecular EncodingFragment-based binary tree traversal reduces nesting depth

SELFIES Family

YearPaperKey Idea
2020SELFIES: The Original Paper on Robust Molecular Strings100% robust molecular representation for generative ML
2022SELFIES and the Future of Molecular String RepresentationsPerspective proposing 16 research directions for SELFIES
2023Recent Advances in the SELFIES Library: 2023 UpdateStreamlined grammar, aromatic support, performance upgrades
2023Group SELFIES: Fragment-Based Molecular StringsFragment group tokens extending SELFIES for distribution learning

InChI Family

YearPaperKey Idea
2013InChI: The Worldwide Chemical Structure Identifier StandardHow InChI became the global standard for chemical identifiers
2018RInChI: The Reaction International Chemical IdentifierExtends InChI to uniquely identify chemical reactions
2019Mixfile & MInChI: Machine-Readable Mixture FormatsFirst standardized machine-readable formats for mixtures
2020NInChI: Toward a Chemical Identifier for NanomaterialsExtending InChI to represent multi-component nanomaterials
2020InChI and Tautomerism: Toward Comprehensive Treatment86 tautomeric rules validated across 400M+ structures for InChI V2
2025Making InChI FAIR and Sustainable for Inorganic ChemistryInChI v1.07 adds inorganic support and FAIR compliance

Tokenization

YearPaperKey Idea
2021SPE: Data-Driven SMILES Substructure TokenizationBPE-style algorithm learning chemically meaningful SMILES tokens
2023Atom-in-SMILES: Better Tokens for Chemical ModelsEnvironment-aware atomic tokens reduce token degeneration
2024SMILES vs SELFIES Tokenization for Chemical LMsAtom Pair Encoding tokenizer outperforms BPE on both formats
2025SMI+AIS: Hybridizing SMILES with Environment TokensHybrid SMILES + Atom-In-SMILES tokens improve generation
2026Smirk: Complete Tokenization for Molecular Models165-token set achieving full OpenSMILES specification coverage

Surveys and Foundations

YearPaperKey Idea
1931The Number of Isomeric Hydrocarbons of the Methane SeriesRecursive formulas for counting alkane structural isomers
2023Materials Representations for ML ReviewReview of solid-state material representations for ML
2025Review of Molecular Representation Learning ModelsSurvey of foundation models across five molecular modalities