This section covers chemical language models: architectures that learn molecular representations directly from chemical string notations (SMILES, SELFIES, InChI). Notes here include encoder-only transformers like ChemBERTa and MoLFormer for property prediction, sequence-to-sequence models like Chemformer for reaction prediction and GP-MoLFormer for molecular generation, and translation models like STOUT for SMILES-to-IUPAC conversion. For multimodal and reasoning LLMs applied to chemistry, see LLMs for Chemistry.

ChemBERTa-3: Open Source Chemical Foundation Models
ChemBERTa-3 provides a unified, scalable infrastructure for pretraining and benchmarking chemical foundation models. It addresses reproducibility gaps in previous studies like MoLFormer through standardized scaffold splitting and open-source tooling.







