Time Series Forecasting
Forecasting comparison of different neural architectures on the Multiscale Lorenz-96 system

Optimizing Sequence Models for Dynamical Systems

We systematically ablate core mechanisms of Transformers and RNNs, finding that attention-augmented Recurrent Highway Networks outperform standard Transformers on forecasting high-dimensional chaotic systems.

Molecular Simulation
Diagram showing conformation autoencoder architecture with internal coordinate encoding and decoding

Conformation Autoencoder for 3D Molecules

A conformation autoencoder converts molecular 3D arrangements into fixed-size latent representations using internal coordinates and graph neural networks, enabling conformer generation and spatial property optimization.

Machine Learning
Three-panel diagram showing symmetry group decomposition, equivariant mapping from world states to representations, and block-diagonal disentangled decomposition

Defining Disentangled Representations via Group Theory

Proposes the first principled mathematical definition of disentangled representations by connecting symmetry group decompositions to independent subspaces in a representation’s vector space.

Machine Learning
Three-panel diagram showing DGCNN point cloud processing: input space k-NN graph, EdgeConv operation, and semantic feature space clustering

DGCNN: Dynamic Graph CNN for Point Cloud Learning

DGCNN introduces the EdgeConv operator, which constructs k-nearest neighbor graphs dynamically in feature space at each network layer. This enables the model to capture both local geometry and long-range semantic relationships for point cloud classification and segmentation.

Time Series Forecasting
LSTNet architecture diagram showing convolutional, recurrent, recurrent-skip, and autoregressive components

LSTNet: Long- and Short-Term Time Series Network

LSTNet is a deep learning framework for multivariate time series forecasting that uses convolutional layers for local dependencies, a recurrent-skip component for periodic long-term patterns, and an autoregressive component for scale robustness.

Natural Language Processing
SpeechT5 architecture diagram showing shared encoder-decoder with speech and text pre/post-nets

SpeechT5: Unified Speech-Text Pre-Training Framework

SpeechT5 proposes a unified encoder-decoder pre-training framework that jointly learns from unlabeled speech and text data, achieving strong results on ASR, TTS, speech translation, voice conversion, speech enhancement, and speaker identification.

Natural Language Processing
Diagram showing the three-step nested pipeline from small-scale training to large-model loss prediction across data mixtures

Data Mixing Laws for LM Pretraining Optimization

Ye et al. find that language model loss on each domain follows an exponential function of training mixture proportions. By nesting data mixing laws with scaling laws for steps and model size, small-scale experiments can predict and optimize mixtures for large models, achieving 48% training efficiency gains.

Natural Language Processing
Bar chart comparing baseline and DoReMi domain weights across 12 Pile domains, showing Pile-CC upweighted 5.4x

DoReMi: Optimizing Data Mixtures for LM Pretraining

Xie et al. propose DoReMi, which trains a 280M proxy model using Group DRO to find optimal domain mixture weights, then uses those weights to train an 8B model 2.6x faster with 6.5% better downstream accuracy.

Natural Language Processing
Chart showing effective data as a function of epochs with exponential decay, with the 4-epoch safe zone and 16-epoch half-life marked

Scaling Data-Constrained Language Models

Muennighoff et al. train 400+ models to study how data repetition affects scaling. They propose a data-constrained scaling law with exponential decay for repeated tokens, finding that up to 4 epochs have negligible impact on loss, returns diminish around 16 epochs, and code augmentation provides a 2x effective data boost.

Natural Language Processing
Table comparing multi-task mixing strategies showing examples-proportional and temperature-scaled mixing results

T5: Exploring Transfer Learning Limits

Raffel et al. introduce T5, a unified text-to-text framework for NLP transfer learning. Through systematic ablation of architectures, pre-training objectives, datasets, and multi-task mixing strategies, they identify best practices and scale to 11B parameters, achieving state-of-the-art results across multiple benchmarks.

Natural Language Processing
Diagram showing block-recurrent transformer architecture with vertical and horizontal processing directions

Block-Recurrent Transformers for Long Sequences

A transformer architecture that applies a recurrent cell over blocks of tokens, achieving linear complexity in sequence length while outperforming Transformer-XL baselines on PG19, arXiv, and GitHub datasets.

Molecular Simulation
Diagram showing the Ewald decomposition of long-range interactions into short-range and Fourier-space components for molecular graph neural networks

Ewald Message Passing for Molecular Graphs

Proposes Ewald message passing, a Fourier-space scheme inspired by Ewald summation that captures long-range interactions in molecular graphs. The method is architecture-agnostic and improves energy MAEs by 10% on OC20 and 16% on OE62 across four baseline GNN models.