This group covers models that generate molecules autoregressively, producing tokens one at a time from chemical string representations. The collection spans from early RNN baselines through large-scale pre-trained transformers and alternative architectures like state-space models.

PaperYearArchitectureKey Idea
LSTM Drug-Like Generation2017LSTMCharacter-level SMILES generation trained on 509K ChEMBL molecules
ChemGE2018Grammatical evolutionPopulation-based search over SMILES grammar productions
Back-Translation2021TransformerSemi-supervised generation using NLP back-translation with unlabeled ZINC data
Chemformer2022BARTDenoising pre-training on 100M molecules for generation and property prediction
RetMol2023Transformer + retrievalFew-shot property steering via retrieval-augmented generation
3D CLMs2023TransformerGenerating 3D molecules, crystals, and proteins from XYZ/CIF/PDB strings
MolGen2024TransformerSELFIES pre-training with chemical feedback alignment
S42024State-space modelStructured state spaces outperforming LSTMs and GPTs on bioactivity learning
GP-MoLFormer2025Transformer1.1B SMILES pre-training with pair-tuning for property optimization

All Notes