Molecular Encoders

This group covers models designed primarily to learn fixed-dimensional molecular representations from chemical string notations (SMILES, SELFIES, InChI). These encoders serve as feature extractors for downstream tasks like property prediction, virtual screening, and molecular similarity search. For models that fuse molecular strings or graphs with additional modalities (text, property vectors, knowledge graphs), see Multimodal Molecular Models.

Paper	Year	Architecture	Key Idea
Seq2seq Fingerprint	2017	GRU encoder-decoder	Unsupervised fingerprints learned from SMILES translation
Mol2vec	2018	Word2vec	Substructure embeddings inspired by NLP word vectors
CDDD	2019	Encoder-decoder	Descriptors learned by translating between SMILES and InChI
SMILES-BERT	2019	BERT	Masked pretraining on 18M+ SMILES from ZINC
SMILES Transformer	2019	Transformer	Unsupervised transformer fingerprints for low-data tasks
ChemBERTa	2020	RoBERTa	RoBERTa evaluation on 77M PubChem SMILES
MolBERT	2020	BERT	Domain-relevant auxiliary task pretraining
X-MOL	2020	Transformer	Large-scale pretraining on 1.1 billion molecules
ChemBERTa-2	2022	RoBERTa	Scaling to 77M molecules, comparing MLM vs MTR objectives
MoLFormer	2022	Linear-attention Transformer	Linear attention at 1.1B molecule scale
SELFormer	2023	RoBERTa	SELFIES-based pretraining on 2M ChEMBL molecules
BARTSmiles	2024	BART	Denoising pretraining on 1.7B SMILES from ZINC20
SMI-TED	2025	Encoder-decoder Transformer	Foundation model trained on 91M PubChem molecules
AMORE	2025	Evaluation framework	Testing ChemLLM robustness to SMILES variants
ChemBERTa-3	2026	RoBERTa	Open-source scalable chemical foundation models

All Notes

Computational Chemistry

Bar chart comparing Mol2vec ESOL RMSE against ECFP4, MACCS, and Neural Fingerprint baselines

Mol2vec: Unsupervised ML with Chemical Intuition

Mol2vec treats molecular substructures as words and compounds as sentences, training Word2vec on 19.9M molecules to produce dense embeddings that capture chemical intuition and enable competitive property prediction.

Computational Chemistry

Bar chart comparing CDDD virtual screening AUC against ECFP4, Mol2vec, Seq2seq FP, and VAE baselines

CDDD: Learning Descriptors by Translating SMILES

Winter et al. propose CDDD, a translation-based encoder-decoder that learns continuous molecular descriptors by translating between equivalent chemical representations like SMILES and InChI, pretrained on 72 million compounds.

Computational Chemistry

Bar chart showing MolBERT ablation: combining MLM, PhysChem, and SMILES equivalence tasks gives best improvement

MolBERT: Auxiliary Tasks for Molecular BERT Models

MolBERT pre-trains a BERT model on SMILES strings using masked language modeling, SMILES equivalence, and physicochemical property prediction as auxiliary tasks, achieving state-of-the-art results on virtual screening and QSAR benchmarks.

Computational Chemistry

Diagram showing SMILES string flowing through encoder to fixed-length fingerprint vector and back through decoder

Seq2seq Fingerprint: Unsupervised Molecular Embedding

A GRU-based sequence-to-sequence model that learns fixed-length molecular fingerprints by translating SMILES strings to themselves, enabling unsupervised representation learning for drug discovery tasks.

Computational Chemistry

Bar chart comparing SMI-TED ROC-AUC scores against ChemBERTa, ChemBERTa-2, MoLFormer, and GROVER on BBBP and HIV

SMI-TED: Encoder-Decoder Foundation Models for Chemistry

SMI-TED introduces encoder-decoder chemical foundation models (289M parameters) pre-trained on 91 million PubChem molecules, achieving strong results across property prediction, reaction yield, and molecule generation benchmarks.

Computational Chemistry

Diagram showing Transformer encoder-decoder architecture converting SMILES strings into molecular fingerprints

SMILES Transformer: Low-Data Molecular Fingerprints

A Transformer-based encoder-decoder pre-trained on 861K SMILES from ChEMBL24 produces 1024-dimensional molecular fingerprints that outperform ECFP and graph convolutions on 5 of 10 MoleculeNet tasks in low-data settings.

Computational Chemistry

Bar chart comparing SMILES-BERT accuracy against baselines on HIV, LogP, and PCBA tasks

SMILES-BERT: BERT-Style Pre-Training for Molecules

SMILES-BERT pre-trains a Transformer encoder on 18M+ SMILES from ZINC using a masked recovery task, then fine-tunes for molecular property prediction, outperforming prior methods on three datasets.

Computational Chemistry

Horizontal bar chart showing X-MOL achieves best performance across five molecular tasks

X-MOL: Pre-training on 1.1B Molecules for SMILES

X-MOL applies large-scale Transformer pre-training on 1.1 billion molecules with a generative SMILES-to-SMILES strategy, then fine-tunes for five molecular analysis tasks including property prediction, reaction analysis, and de novo generation.

Computational Chemistry

Bar chart showing retrieval accuracy of chemical language models across four SMILES augmentation types

AMORE: Testing ChemLLM Robustness to SMILES Variants

Introduces AMORE, an embedding-based retrieval framework that evaluates whether chemical language models can recognize the same molecule across different SMILES representations. Results show current models are not robust to identity-preserving augmentations.

Computational Chemistry

BARTSmiles ablation study summary showing impact of pre-training strategies on downstream task performance

BARTSmiles: BART Pre-Training for Molecular SMILES

BARTSmiles pre-trains a BART-large model on 1.7 billion SMILES strings from ZINC20 and achieves the best reported results on 11 classification, regression, and generation benchmarks.

Computational Chemistry

MoLFormer-XL architecture diagram showing SMILES tokens flowing through a linear attention transformer to MoleculeNet benchmark results and attention-structure correlation

MoLFormer: Large-Scale Chemical Language Representations

MoLFormer is a transformer encoder with linear attention and rotary positional embeddings, pretrained via masked language modeling on 1.1 billion molecules from PubChem and ZINC. MoLFormer-XL outperforms GNN baselines on most MoleculeNet classification and regression tasks, and attention analysis reveals that the model learns interatomic spatial relationships directly from SMILES strings.

Computational Chemistry

SELFormer architecture diagram showing SELFIES token input flowing through a RoBERTa transformer encoder to molecular property predictions

SELFormer: A SELFIES-Based Molecular Language Model

SELFormer is a transformer-based chemical language model that uses SELFIES instead of SMILES as input. Pretrained on 2M ChEMBL compounds via masked language modeling, it achieves strong classification performance on MoleculeNet tasks, outperforming ChemBERTa-2 by ~12% on average across BACE, BBBP, and HIV.

All Notes#

All Notes