Molecular Generation
Bar chart comparing PMO benchmark scores with and without chemical quality filters across five generative methods

Re-evaluating Sample Efficiency in Molecule Generation

A critical reassessment of the PMO benchmark for de novo molecule generation, showing that adding molecular weight, LogP, and diversity filters substantially re-ranks generative models, with Augmented Hill-Climb emerging as the top method.

Molecular Generation
Horizontal bar chart showing REINVENT 4 unified framework supporting seven generative model types

REINVENT 4: Open-Source Generative Molecule Design

Overview of REINVENT 4, an open-source generative molecular design framework from AstraZeneca that unifies RNN and transformer generators within reinforcement learning, transfer learning, and curriculum learning optimization algorithms.

Molecular Generation
Bar chart showing deep generative architecture types for molecular design: RNN, VAE, GAN, RL, and hybrid methods

Review: Deep Learning for Molecular Design (2019)

An early and influential review cataloging 45 papers on deep generative modeling for molecules, comparing RNN, VAE, GAN, and reinforcement learning architectures across SMILES and graph-based representations.

Molecular Generation
Bar chart comparing RNN and Transformer Wasserstein distances across drug-like, peptide-like, and polymer-like generation tasks

RNNs vs Transformers for Molecular Generation Tasks

Compares RNN-based and Transformer-based chemical language models across three molecular generation tasks of increasing complexity, finding that RNNs excel at local features while Transformers handle large molecules better.

Molecular Generation
Diagram showing the dual formulation of S4 models with convolution during training and recurrence during generation for SMILES-based molecular design

S4 Structured State Space Models for De Novo Drug Design

This paper introduces structured state space sequence (S4) models to chemical language modeling, showing they combine the strengths of LSTMs (efficient recurrent generation) and GPTs (holistic sequence learning) for de novo molecular design.

Molecular Representations
Bar chart comparing binding affinity scores across SMILES, AIS, and SMI+AIS hybrid tokenization strategies

SMI+AIS: Hybridizing SMILES with Environment Tokens

Proposes SMI+AIS, a hybrid molecular representation combining standard SMILES tokens with chemical-environment-aware Atom-In-SMILES tokens, demonstrating improved molecular generation for drug design targets.

Molecular Representations
Bar chart showing SMILES Pair Encoding reduces mean sequence length from 40 to 6 tokens

SPE: Data-Driven SMILES Substructure Tokenization

Introduces SMILES Pair Encoding (SPE), a data-driven tokenization algorithm that learns high-frequency SMILES substrings from ChEMBL to produce shorter, chemically interpretable token sequences for deep learning.

Molecular Representations
Bar chart showing SPMM supports bidirectional tasks: molecule to property, property to molecule, molecule optimization, and property interpolation

SPMM: A Bidirectional Molecular Foundation Model

SPMM pre-trains a dual-stream transformer on SMILES and 53 molecular property vectors using contrastive learning and cross-attention, enabling bidirectional structure-property generation, property prediction, and reaction prediction through a single model.

Molecular Representations
Bar chart showing CLM architecture publication trends from 2020 to 2024, with transformers overtaking RNNs

Systematic Review of Deep Learning CLMs (2020-2024)

PRISMA-based systematic review of 72 papers on chemical language models for molecular generation, comparing architectures and biased methods using MOSES metrics.

Molecular Representations
Diagram showing the t-SMILES pipeline from molecular graph fragmentation to binary tree traversal producing a string representation

t-SMILES: Tree-Based Fragment Molecular Encoding

t-SMILES represents molecules by fragmenting them into substructures, building full binary trees, and traversing them breadth-first to produce SMILES-type strings that reduce nesting depth and outperform SMILES, DeepSMILES, and SELFIES on generation benchmarks.

Molecular Generation
Bar chart comparing Char-RNN and Molecular VAE on validity and novelty metrics

VAE for Automatic Chemical Design (2018 Seminal)

This foundational paper introduces a variational autoencoder (VAE) that encodes SMILES strings into a continuous latent space, allowing gradient-based optimization of molecular properties. Joint training with a property predictor organizes the latent space by chemical properties, and Bayesian optimization over the latent surface discovers drug-like molecules with improved QED and synthetic accessibility.

Molecular Representations
Horizontal bar chart showing X-MOL achieves best performance across five molecular tasks

X-MOL: Pre-training on 1.1B Molecules for SMILES

X-MOL applies large-scale Transformer pre-training on 1.1 billion molecules with a generative SMILES-to-SMILES strategy, then fine-tunes for five molecular analysis tasks including property prediction, reaction analysis, and de novo generation.