Machine-Learning

GP-MoLFormer architecture showing large-scale SMILES input, linear-attention transformer decoder, and property optimization via pair-tuning soft prompts

GP-MoLFormer: Molecular Generation via Transformers

This methodological paper proposes a linear-attention transformer decoder trained on 1.1 billion molecules. It introduces pair-tuning for efficient property optimization and establishes empirical scaling laws relating inference compute to generation novelty.

Computational Chemistry

ChemBERTa masked language modeling visualization showing SMILES string CC(=O)O with masked tokens

ChemBERTa: Molecular Property Prediction via Transformers

This paper introduces ChemBERTa, a RoBERTa-based model pretrained on 77M SMILES strings. It systematically evaluates the impact of pretraining dataset size, tokenization strategies, and input representations (SMILES vs. SELFIES) on downstream MoleculeNet tasks, finding that performance scales positively with data size.

Computational Chemistry

Chemformer pre-training on 100M SMILES strings flowing into BART model, which then enables reaction prediction and property prediction tasks

Chemformer: A Pre-trained Transformer for Comp Chem

This paper introduces Chemformer, a BART-based sequence-to-sequence model pre-trained on 100M molecules using a ‘combined’ masking and augmentation task. It achieves top-1 accuracy on reaction prediction benchmarks while significantly reducing training time through transfer learning.

Generative Modeling

Visualization of probability density flow from initial distribution ρ₀ to target distribution ρ₁ over time through space

Building Normalizing Flows with Stochastic Interpolants

Proposes ‘InterFlow’, a method to learn continuous normalizing flows between arbitrary densities using stochastic interpolants. It avoids ODE backpropagation by minimizing a quadratic objective on the velocity field, enabling scalable ODE-based generation. On CIFAR-10, NLL matches ScoreSDE (2.99 bits per dim) with simulation-free training, though FID (10.27) trails dedicated image models (ScoreSDE: 2.92); the primary strength is tractable likelihood with efficient training cost.

Generative Modeling

Visualization comparing Optimal Transport (straight paths) vs Diffusion (curved paths) for Flow Matching

Flow Matching for Generative Modeling: Scalable CNFs

Introduces Flow Matching, a scalable method for training CNFs by regressing vector fields of conditional probability paths. It generalizes diffusion and enables Optimal Transport paths for straighter, more efficient sampling.

Machine Learning Fundamentals

Comparison of Residual Network vs ODE Network architectures showing discrete layers versus continuous transformations

Neural ODEs: Continuous-Depth Deep Learning Models

This paper replaces discrete network layers with continuous ordinary differential equations (ODEs), allowing for adaptive computation depth and constant memory cost during training via the adjoint sensitivity method. It introduces Continuous Normalizing Flows and latent ODEs for time-series.

Generative Modeling

Denoising Score Matching Intuition - Vectors point from corrupted samples back to clean data, approximating the score

Score Matching and Denoising Autoencoders: A Connection

This paper provides a rigorous probabilistic foundation for Denoising Autoencoders by proving they are mathematically equivalent to Score Matching on a kernel-smoothed data distribution. It derives a specific energy function for DAEs and justifies the use of tied weights.

Computational Biology

DynamicFlow illustration showing the transformation from apo pocket to holo pocket with ligand molecule generation

DynamicFlow: Integrating Protein Dynamics into Drug Design

This paper introduces DynamicFlow, a full-atom stochastic flow matching model that simultaneously generates ligand molecules and transforms protein pockets from apo to holo states. It also contributes a new dataset of MD-simulated apo-holo pairs derived from MISATO.

Computational Chemistry

InstructMol architecture showing molecular graph and text inputs feeding through two-stage training to produce property predictions, descriptions, and reactions

InstructMol: Multi-Modal Molecular LLM for Drug Discovery

InstructMol integrates a pre-trained molecular graph encoder (MoleculeSTM) with a Vicuna-7B LLM using a linear projector. It employs a two-stage training process (alignment pre-training followed by task-specific instruction tuning with LoRA) to excel at property prediction, description generation, and reaction analysis.

Computational Biology

InvMSAFold generates diverse protein sequences from structure using a Potts model

InvMSAFold: Generative Inverse Folding with Potts Models

InvMSAFold replaces autoregressive decoding with a Potts model parameter generator, enabling diverse protein sequence sampling orders of magnitude faster than ESM-IF1.

Computational Chemistry

MERMaid pipeline diagram showing PDF processing through VisualHeist segmentation, DataRaider VLM mining, and KGWizard graph construction to produce chemical knowledge graphs

MERMaid: Multimodal Chemical Reaction Mining from PDFs

MERMaid leverages fine-tuned vision models and VLM reasoning to mine chemical reaction data directly from PDF figures and tables. By handling context inference and coreference resolution, it builds high-fidelity knowledge graphs with 87% end-to-end accuracy.

Computational Chemistry

Overview of the OCSAug pipeline showing DDPM training, masked RePaint augmentation, and OCSR fine-tuning phases.

OCSAug: Diffusion-Based Augmentation for Hand-Drawn OCSR

OCSAug uses Denoising Diffusion Probabilistic Models (DDPM) and the RePaint algorithm with custom masking to generate synthetic hand-drawn chemical structure images, improving OCSR performance by 1.918-3.820x on the DECIMER benchmark.