This group covers models designed primarily to learn fixed-dimensional molecular representations from chemical string notations (SMILES, SELFIES, InChI). These encoders serve as feature extractors for downstream tasks like property prediction, virtual screening, and molecular similarity search. For models that fuse molecular strings or graphs with additional modalities (text, property vectors, knowledge graphs), see Multimodal Molecular Models.

Pre-trained Encoder Models

PaperYearArchitectureKey Idea
Seq2seq Fingerprint2017GRU encoder-decoderUnsupervised fingerprints learned from SMILES translation
Mol2vec2018Word2vecSubstructure embeddings inspired by NLP word vectors
CDDD2019Encoder-decoderDescriptors learned by translating between SMILES and InChI
SMILES-BERT2019BERTMasked pretraining on 18M+ SMILES from ZINC
SMILES Transformer2019TransformerUnsupervised transformer fingerprints for low-data tasks
ChemBERTa2020RoBERTaRoBERTa evaluation on 77M PubChem SMILES
MolBERT2020BERTDomain-relevant auxiliary task pretraining
X-MOL2020TransformerLarge-scale pretraining on 1.1 billion molecules
ChemBERTa-22022RoBERTaScaling to 77M molecules, comparing MLM vs MTR objectives
MoLFormer2022Linear-attention TransformerLinear attention at 1.1B molecule scale
SELFormer2023RoBERTaSELFIES-based pretraining on 2M ChEMBL molecules
BARTSmiles2024BARTDenoising pretraining on 1.7B SMILES from ZINC20
SMI-TED2025Encoder-decoder TransformerFoundation model trained on 91M PubChem molecules
ChemBERTa-32026RoBERTaOpen-source scalable chemical foundation models

Scaling, Evaluation & Surveys

PaperYearKey Idea
Neural Scaling of Deep Chemical Models2023Power-law scaling relations for chemical LMs and GNNs
Systematic Review of Deep Learning CLMs (2020-2024)2024PRISMA review of 72 papers on chemical language models
Transformer CLMs for SMILES: Literature Review 20242024Review of transformer-based CLMs operating on SMILES
Survey of Transformer Architectures in Molecular Science2024Survey of 12 transformer architecture families in molecular science
AMORE2025Testing ChemLLM robustness to SMILES variants