Autoregressive Generation

This group covers models that generate molecules autoregressively, producing tokens one at a time from chemical string representations. The collection spans from early RNN baselines through large-scale pre-trained transformers and alternative architectures like state-space models.

Paper	Year	Architecture	Key Idea
LSTM Drug-Like Generation	2017	LSTM	Character-level SMILES generation trained on 509K ChEMBL molecules
ChemGE	2018	Grammatical evolution	Population-based search over SMILES grammar productions
Back-Translation	2021	Transformer	Semi-supervised generation using NLP back-translation with unlabeled ZINC data
Chemformer	2022	BART	Denoising pre-training on 100M molecules for generation and property prediction
RetMol	2023	Transformer + retrieval	Few-shot property steering via retrieval-augmented generation
3D CLMs	2023	Transformer	Generating 3D molecules, crystals, and proteins from XYZ/CIF/PDB strings
MolGen	2024	Transformer	SELFIES pre-training with chemical feedback alignment
S4	2024	State-space model	Structured state spaces outperforming LSTMs and GPTs on bioactivity learning
GP-MoLFormer	2025	Transformer	1.1B SMILES pre-training with pair-tuning for property optimization

All Notes

Computational Chemistry

ChemGE pipeline from integer chromosome through CFG grammar rules to valid SMILES output

ChemGE: Molecule Generation via Grammatical Evolution

ChemGE uses grammatical evolution over SMILES context-free grammars to generate diverse drug-like molecules in parallel, outperforming deep learning baselines in throughput and molecular diversity.

Computational Chemistry

LSTM cells generating SMILES characters alongside validity and novelty statistics for drug-like molecule generation

LSTM Neural Network for Drug-Like Molecule Generation

Ertl et al. train a character-level LSTM on 509K bioactive ChEMBL SMILES and generate one million novel, diverse molecules whose physicochemical properties, substructure features, and predicted bioactivity closely match the training distribution.

Computational Chemistry

Bar chart showing language model validity rates across XYZ, CIF, and PDB 3D chemical file formats

LMs Generate 3D Molecules from XYZ, CIF, PDB Files

Demonstrates that standard transformer language models, trained with next-token prediction on sequences from XYZ, CIF, and PDB files, can generate valid 3D molecules, crystals, and protein binding sites competitive with domain-specific 3D generative models.

Computational Chemistry

Diagram showing the dual formulation of S4 models with convolution during training and recurrence during generation for SMILES-based molecular design

S4 Structured State Space Models for De Novo Drug Design

This paper introduces structured state space sequence (S4) models to chemical language modeling, showing they combine the strengths of LSTMs (efficient recurrent generation) and GPTs (holistic sequence learning) for de novo molecular design.

Computational Chemistry

Diagram showing back translation workflow with forward and reverse models mapping between source and target molecular domains, augmented by unlabeled ZINC molecules

Back Translation for Semi-Supervised Molecule Generation

Adapts back translation from NLP to molecular generation, using unlabeled molecules from ZINC to create synthetic training pairs that improve property optimization and retrosynthesis prediction across Transformer and graph-based architectures.

Computational Chemistry

Diagram of the RetMol pipeline showing input molecule and retrieval database feeding into a frozen encoder, cross-attention fusion module, and frozen decoder to produce optimized molecules with iterative refinement

RetMol: Retrieval-Based Controllable Molecule Generation

RetMol plugs a lightweight cross-attention retrieval module into a pre-trained Chemformer backbone to guide molecule generation toward multi-property design criteria. It requires no task-specific fine-tuning and works with as few as 23 exemplar molecules. It achieves 94.5% success on QED optimization, 96.9% on GSK3b/JNK3 dual inhibitor design, and 2.84 kcal/mol average binding affinity improvement on SARS-CoV-2 main protease inhibitor optimization.

Computational Chemistry

MolGen overview showing two-stage pre-training (molecular language syntax learning and domain-agnostic prefix tuning) and chemical feedback paradigm

MolGen: Molecular Generation with Chemical Feedback

MolGen pre-trains on 100M+ SELFIES molecules, introduces domain-agnostic prefix tuning for cross-domain transfer, and applies a chemical feedback paradigm to reduce molecular hallucinations.

Computational Chemistry

GP-MoLFormer architecture showing large-scale SMILES input, linear-attention transformer decoder, and property optimization via pair-tuning soft prompts

GP-MoLFormer: Molecular Generation via Transformers

This methodological paper proposes a linear-attention transformer decoder trained on 1.1 billion molecules. It introduces pair-tuning for efficient property optimization and establishes empirical scaling laws relating inference compute to generation novelty.

Computational Chemistry

Chemformer pre-training on 100M SMILES strings flowing into BART model, which then enables reaction prediction and property prediction tasks

Chemformer: A Pre-trained Transformer for Comp Chem

This paper introduces Chemformer, a BART-based sequence-to-sequence model pre-trained on 100M molecules using a ‘combined’ masking and augmentation task. It achieves top-1 accuracy on reaction prediction benchmarks while significantly reducing training time through transfer learning.

All Notes#

All Notes