Molecular Generation
Density plot showing training vs generated physicochemical property distribution

Molecular Sets (MOSES): A Generative Modeling Benchmark

MOSES introduces a comprehensive benchmarking platform for molecular generative models, offering standardized datasets, evaluation metrics, and baselines. By providing a unified measuring stick, it aims to resolve reproducibility challenges in chemical distribution learning.

Molecular Generation
GP-MoLFormer architecture showing large-scale SMILES input, linear-attention transformer decoder, and property optimization via pair-tuning soft prompts

GP-MoLFormer: Molecular Generation via Transformers

This methodological paper proposes a linear-attention transformer decoder trained on 1.1 billion molecules. It introduces pair-tuning for efficient property optimization and establishes empirical scaling laws relating inference compute to generation novelty.

Machine Learning
Comparison of linear interpolation (teleportation) showing double peaks versus displacement interpolation (transportation) showing smooth single peak

A Convexity Principle for Interacting Gases (McCann 1997)

A theoretical paper that introduces displacement interpolation (optimal transport) to establish a new convexity principle for energy functionals. It proves the uniqueness of ground states for interacting gases and generalizes the Brunn-Minkowski inequality, providing mathematical tools later used in flow matching and optimal transport-based generative models.

Generative Modeling
Visualization of probability density flow from initial distribution ρ₀ to target distribution ρ₁ over time through space

Building Normalizing Flows with Stochastic Interpolants

Proposes ‘InterFlow’, a method to learn continuous normalizing flows between arbitrary densities using stochastic interpolants. It avoids ODE backpropagation by minimizing a quadratic objective on the velocity field, enabling scalable ODE-based generation. On CIFAR-10, NLL matches ScoreSDE (2.99 bits per dim) with simulation-free training, though FID (10.27) trails dedicated image models (ScoreSDE: 2.92); the primary strength is tractable likelihood with efficient training cost.

Generative Modeling
Visualization comparing Optimal Transport (straight paths) vs Diffusion (curved paths) for Flow Matching

Flow Matching for Generative Modeling: Scalable CNFs

Introduces Flow Matching, a scalable method for training CNFs by regressing vector fields of conditional probability paths. It generalizes diffusion and enables Optimal Transport paths for straighter, more efficient sampling.

Generative Modeling
Visualization showing linear interpolation, learned ODE trajectories, and the reflow straightening process for rectified flow

Rectified Flow: Learning to Generate and Transfer Data

Introduces ‘Rectified Flow,’ a method to transport distributions via ODEs with straight paths. Uses a ‘reflow’ procedure to iteratively straighten trajectories, enabling high-quality 1-step generation with optional lightweight distillation.

Generative Modeling
Denoising Score Matching Intuition - Vectors point from corrupted samples back to clean data, approximating the score

Score Matching and Denoising Autoencoders: A Connection

This paper provides a rigorous probabilistic foundation for Denoising Autoencoders by proving they are mathematically equivalent to Score Matching on a kernel-smoothed data distribution. It derives a specific energy function for DAEs and justifies the use of tied weights.

Generative Modeling
Forward and Reverse SDE trajectories showing the diffusion process from data to noise and back

Score-Based Generative Modeling with SDEs (Song 2021)

This paper unifies previous score-based methods (SMLD and DDPM) under a continuous-time SDE framework. It introduces Predictor-Corrector samplers for improved generation and Probability Flow ODEs for near-exact likelihood computation, setting new records on CIFAR-10.

Computational Biology
DynamicFlow illustration showing the transformation from apo pocket to holo pocket with ligand molecule generation

DynamicFlow: Integrating Protein Dynamics into Drug Design

This paper introduces DynamicFlow, a full-atom stochastic flow matching model that simultaneously generates ligand molecules and transforms protein pockets from apo to holo states. It also contributes a new dataset of MD-simulated apo-holo pairs derived from MISATO.

Computational Biology
InvMSAFold generates diverse protein sequences from structure using a Potts model

InvMSAFold: Generative Inverse Folding with Potts Models

InvMSAFold replaces autoregressive decoding with a Potts model parameter generator, enabling diverse protein sequence sampling orders of magnitude faster than ESM-IF1.

Molecular Simulation
MOFFlow assembles metal nodes and organic linkers into Metal-Organic Framework structures

MOFFlow: Flow Matching for MOF Structure Prediction

MOFFlow is the first deep generative model tailored for Metal-Organic Framework (MOF) structure prediction. It utilizes Riemannian flow matching on SE(3) to assemble rigid building blocks (metal nodes and organic linkers), achieving higher accuracy and scalability than atom-based methods on large systems.

Molecular Representations
SELFIES robustness demonstration

Invalid SMILES Benefit Chemical Language Models: A Study

A 2024 Nature Machine Intelligence paper providing causal evidence that invalid SMILES generation improves chemical language model performance by filtering low-likelihood samples, while validity constraints (as in SELFIES) introduce structural biases that impair distribution learning.