This group covers models designed primarily to learn fixed-dimensional molecular representations from chemical string notations (SMILES, SELFIES, InChI). These encoders serve as feature extractors for downstream tasks like property prediction, virtual screening, and molecular similarity search. For models that fuse molecular strings or graphs with additional modalities (text, property vectors, knowledge graphs), see Multimodal Molecular Models.

PaperYearArchitectureKey Idea
Seq2seq Fingerprint2017GRU encoder-decoderUnsupervised fingerprints learned from SMILES translation
Mol2vec2018Word2vecSubstructure embeddings inspired by NLP word vectors
CDDD2019Encoder-decoderDescriptors learned by translating between SMILES and InChI
SMILES-BERT2019BERTMasked pretraining on 18M+ SMILES from ZINC
SMILES Transformer2019TransformerUnsupervised transformer fingerprints for low-data tasks
ChemBERTa2020RoBERTaRoBERTa evaluation on 77M PubChem SMILES
MolBERT2020BERTDomain-relevant auxiliary task pretraining
X-MOL2020TransformerLarge-scale pretraining on 1.1 billion molecules
ChemBERTa-22022RoBERTaScaling to 77M molecules, comparing MLM vs MTR objectives
MoLFormer2022Linear-attention TransformerLinear attention at 1.1B molecule scale
SELFormer2023RoBERTaSELFIES-based pretraining on 2M ChEMBL molecules
BARTSmiles2024BARTDenoising pretraining on 1.7B SMILES from ZINC20
SMI-TED2025Encoder-decoder TransformerFoundation model trained on 91M PubChem molecules
AMORE2025Evaluation frameworkTesting ChemLLM robustness to SMILES variants
ChemBERTa-32026RoBERTaOpen-source scalable chemical foundation models

All Notes