Molecular Encoders

This group covers models designed primarily to learn fixed-dimensional molecular representations from chemical string notations (SMILES, SELFIES, InChI). These encoders serve as feature extractors for downstream tasks like property prediction, virtual screening, and molecular similarity search. For models that fuse molecular strings or graphs with additional modalities (text, property vectors, knowledge graphs), see Multimodal Molecular Models.

Paper	Year	Architecture	Key Idea
Seq2seq Fingerprint	2017	GRU encoder-decoder	Unsupervised fingerprints learned from SMILES translation
Mol2vec	2018	Word2vec	Substructure embeddings inspired by NLP word vectors
CDDD	2019	Encoder-decoder	Descriptors learned by translating between SMILES and InChI
SMILES-BERT	2019	BERT	Masked pretraining on 18M+ SMILES from ZINC
SMILES Transformer	2019	Transformer	Unsupervised transformer fingerprints for low-data tasks
ChemBERTa	2020	RoBERTa	RoBERTa evaluation on 77M PubChem SMILES
MolBERT	2020	BERT	Domain-relevant auxiliary task pretraining
X-MOL	2020	Transformer	Large-scale pretraining on 1.1 billion molecules
ChemBERTa-2	2022	RoBERTa	Scaling to 77M molecules, comparing MLM vs MTR objectives
MoLFormer	2022	Linear-attention Transformer	Linear attention at 1.1B molecule scale
SELFormer	2023	RoBERTa	SELFIES-based pretraining on 2M ChEMBL molecules
BARTSmiles	2024	BART	Denoising pretraining on 1.7B SMILES from ZINC20
SMI-TED	2025	Encoder-decoder Transformer	Foundation model trained on 91M PubChem molecules
AMORE	2025	Evaluation framework	Testing ChemLLM robustness to SMILES variants
ChemBERTa-3	2026	RoBERTa	Open-source scalable chemical foundation models

All Notes

Computational Chemistry

ChemBERTa-3 visualization showing muscular arms lifting a stack of building blocks representing molecular data with SMILES notation, symbolizing the power and scalability of the open-source training framework

ChemBERTa-3: Open Source Chemical Foundation Models

ChemBERTa-3 provides a unified, scalable infrastructure for pretraining and benchmarking chemical foundation models. It addresses reproducibility gaps in previous studies like MoLFormer through standardized scaffold splitting and open-source tooling.

Computational Chemistry

ChemBERTa-2 visualization showing flowing SMILES strings in blue tones representing molecular data streams

ChemBERTa-2: Scaling Molecular Transformers to 77M

This work investigates the scaling hypothesis for molecular transformers, training RoBERTa models on 77M SMILES from PubChem. It compares Masked Language Modeling (MLM) against Multi-Task Regression (MTR) pretraining, finding that MTR yields better downstream performance but is computationally heavier.

Computational Chemistry

ChemBERTa masked language modeling visualization showing SMILES string CC(=O)O with masked tokens

ChemBERTa: Molecular Property Prediction via Transformers

This paper introduces ChemBERTa, a RoBERTa-based model pretrained on 77M SMILES strings. It systematically evaluates the impact of pretraining dataset size, tokenization strategies, and input representations (SMILES vs. SELFIES) on downstream MoleculeNet tasks, finding that performance scales positively with data size.

All Notes#

All Notes