Time Series Forecasting
Forecasting comparison of different neural architectures on the Multiscale Lorenz-96 system

Optimizing Sequence Models for Dynamical Systems

We systematically ablate core mechanisms of Transformers and RNNs, finding that attention-augmented Recurrent Highway Networks outperform standard Transformers on forecasting high-dimensional chaotic systems.

Computational Chemistry
ChemBERTa-3 visualization showing muscular arms lifting a stack of building blocks representing molecular data with SMILES notation, symbolizing the power and scalability of the open-source training framework

ChemBERTa-3: Open Source Chemical Foundation Models

ChemBERTa-3 provides a unified, scalable infrastructure for pretraining and benchmarking chemical foundation models. It addresses reproducibility gaps in previous studies like MoLFormer through standardized scaffold splitting and open-source tooling.

Computational Chemistry
Chemical structures and molecular representations feeding into a neural network model that processes atomized chemical knowledge

ChemDFM-R: Chemical Reasoning LLM with Atomized Knowledge

ChemDFM-R is a 14B-parameter chemical reasoning model that integrates a 101B-token dataset of atomized chemical knowledge. Using a mix-sourced distillation strategy and domain-specific reinforcement learning, it outperforms similarly sized models and DeepSeek-R1 on ChemEval.

Computational Chemistry
ChemBERTa-2 visualization showing flowing SMILES strings in blue tones representing molecular data streams

ChemBERTa-2: Scaling Molecular Transformers to 77M

This work investigates the scaling hypothesis for molecular transformers, training RoBERTa models on 77M SMILES from PubChem. It compares Masked Language Modeling (MLM) against Multi-Task Regression (MTR) pretraining, finding that MTR yields better downstream performance but is computationally heavier.

Computational Chemistry
GP-MoLFormer architecture showing large-scale SMILES input, linear-attention transformer decoder, and property optimization via pair-tuning soft prompts

GP-MoLFormer: Molecular Generation via Transformers

This methodological paper proposes a linear-attention transformer decoder trained on 1.1 billion molecules. It introduces pair-tuning for efficient property optimization and establishes empirical scaling laws relating inference compute to generation novelty.

Computational Chemistry
ChemBERTa masked language modeling visualization showing SMILES string CC(=O)O with masked tokens

ChemBERTa: Molecular Property Prediction via Transformers

This paper introduces ChemBERTa, a RoBERTa-based model pretrained on 77M SMILES strings. It systematically evaluates the impact of pretraining dataset size, tokenization strategies, and input representations (SMILES vs. SELFIES) on downstream MoleculeNet tasks, finding that performance scales positively with data size.

Computational Chemistry
Chemformer pre-training on 100M SMILES strings flowing into BART model, which then enables reaction prediction and property prediction tasks

Chemformer: A Pre-trained Transformer for Comp Chem

This paper introduces Chemformer, a BART-based sequence-to-sequence model pre-trained on 100M molecules using a ‘combined’ masking and augmentation task. It achieves top-1 accuracy on reaction prediction benchmarks while significantly reducing training time through transfer learning.

Generative Modeling
Visualization of probability density flow from initial distribution ρ₀ to target distribution ρ₁ over time through space

Building Normalizing Flows with Stochastic Interpolants

Proposes ‘InterFlow’, a method to learn continuous normalizing flows between arbitrary densities using stochastic interpolants. It avoids ODE backpropagation by minimizing a quadratic objective on the velocity field, enabling scalable ODE-based generation. On CIFAR-10, NLL matches ScoreSDE (2.99 nats) with simulation-free training, though FID (10.27) trails dedicated image models (ScoreSDE: 2.92); the primary strength is tractable likelihood with efficient training cost.

Generative Modeling
Visualization comparing Optimal Transport (straight paths) vs Diffusion (curved paths) for Flow Matching

Flow Matching for Generative Modeling: Scalable CNFs

Introduces Flow Matching, a scalable method for training CNFs by regressing vector fields of conditional probability paths. It generalizes diffusion and enables Optimal Transport paths for straighter, more efficient sampling.

Machine Learning Fundamentals
Comparison of Residual Network vs ODE Network architectures showing discrete layers versus continuous transformations

Neural ODEs: Continuous-Depth Deep Learning

This paper replaces discrete network layers with continuous ordinary differential equations (ODEs), allowing for adaptive computation depth and constant memory cost during training via the adjoint sensitivity method. It introduces Continuous Normalizing Flows and latent ODEs for time-series.

Generative Modeling
Denoising Score Matching Intuition - Vectors point from corrupted samples back to clean data, approximating the score

Score Matching and Denoising Autoencoders

This paper provides a rigorous probabilistic foundation for Denoising Autoencoders by proving they are mathematically equivalent to Score Matching on a kernel-smoothed data distribution. It derives a specific energy function for DAEs and justifies the use of tied weights.

Generative Modeling
Forward and Reverse SDE trajectories showing the diffusion process from data to noise and back

Score-Based Generative Modeling with SDEs

This paper unifies previous score-based methods (SMLD and DDPM) under a continuous-time SDE framework. It introduces Predictor-Corrector samplers for improved generation and Probability Flow ODEs for near-exact likelihood computation, setting new records on CIFAR-10.