Machine Learning Fundamentals
Diagram showing the three-step nested pipeline from small-scale training to large-model loss prediction across data mixtures

Data Mixing Laws for LM Pretraining Optimization

Ye et al. find that language model loss on each domain follows an exponential function of training mixture proportions. By nesting data mixing laws with scaling laws for steps and model size, small-scale experiments can predict and optimize mixtures for large models, achieving 48% training efficiency gains.

Machine Learning Fundamentals
Bar chart comparing baseline and DoReMi domain weights across 12 Pile domains, showing Pile-CC upweighted 5.4x

DoReMi: Optimizing Data Mixtures for LM Pretraining

Xie et al. propose DoReMi, which trains a 280M proxy model using Group DRO to find optimal domain mixture weights, then uses those weights to train an 8B model 2.6x faster with 6.5% better downstream accuracy.

Machine Learning Fundamentals
Chart showing effective data as a function of epochs with exponential decay, with the 4-epoch safe zone and 16-epoch half-life marked

Scaling Data-Constrained Language Models

Muennighoff et al. train 400+ models to study how data repetition affects scaling. They propose a data-constrained scaling law with exponential decay for repeated tokens, finding that up to 4 epochs have negligible impact on loss, returns diminish around 16 epochs, and code augmentation provides a 2x effective data boost.

Machine Learning Fundamentals
Bar chart comparing average benchmark accuracy across seven domain combination configurations showing diversity improves performance

SlimPajama-DC: Data Combinations for LLM Training

Shen et al. empirically analyze how different domain combinations and deduplication strategies in the SlimPajama dataset affect 1.3B model performance. Global deduplication across sources outperforms local deduplication, and increasing domain diversity consistently improves average accuracy, with findings transferring to 7B scale.

Machine Learning Fundamentals
Table comparing multi-task mixing strategies showing examples-proportional and temperature-scaled mixing results

T5: Exploring Transfer Learning Limits

Raffel et al. introduce T5, a unified text-to-text framework for NLP transfer learning. Through systematic ablation of architectures, pre-training objectives, datasets, and multi-task mixing strategies, they identify best practices and scale to 11B parameters, achieving state-of-the-art results across multiple benchmarks.

Machine Learning Fundamentals
Log-log plots showing power-law scaling of ChemGPT validation loss versus model size and GNN force field loss versus dataset size

Neural Scaling of Deep Chemical Models

Frey et al. discover empirical power-law scaling relations for both chemical language models (ChemGPT, up to 1B parameters) and equivariant GNN interatomic potentials, finding that neither domain has saturated with respect to model size, data, or compute.

Machine Learning Fundamentals
Three-panel diagram showing an original sequence, its time-warped version, and the gate values derived from requiring time warping invariance

Can Recurrent Neural Networks Warp Time? (ICLR 2018)

Tallec and Ollivier show that requiring invariance to time transformations in recurrent models leads to gating mechanisms, recovering key LSTM components from first principles. They propose the chrono initialization for gate biases that improves learning of long-term dependencies.

Machine Learning Fundamentals
Graph network block diagram showing input graph transformed through edge, node, and global update steps to produce an updated graph

Relational Inductive Biases in Deep Learning (2018)

Battaglia et al. argue that combinatorial generalization requires structured representations, systematically analyze the relational inductive biases in standard deep learning architectures (MLPs, CNNs, RNNs), and present the graph network as a unifying framework that generalizes and extends prior graph neural network approaches.

Machine Learning Fundamentals
Log-log plot comparing scaling laws across six architectures showing the vanilla Transformer has the steepest slope

Scaling Laws vs Model Architectures: Inductive Bias

Tay et al. systematically compare scaling laws across ten diverse architectures (Transformers, Switch Transformers, Performers, MLP-Mixers, and others), finding that the vanilla Transformer has the best scaling coefficient and that the best-performing architecture changes across compute regions.

Machine Learning Fundamentals
SE(3)-Transformer architecture showing invariant attention weights modulating equivariant value messages on a 3D point cloud

SE(3)-Transformers: Equivariant Attention for 3D Data

Fuchs et al. introduce the SE(3)-Transformer, which combines self-attention with SE(3)-equivariance for 3D point clouds and graphs. Invariant attention weights modulate equivariant value messages from tensor field networks, resolving angular filter constraints while enabling data-adaptive, anisotropic processing.

Machine Learning Fundamentals
Comparison of planar CNN (translation only) versus spherical CNN (SO(3)-equivariant) showing how filters rotate on the sphere

Spherical CNNs: Rotation-Equivariant Networks on the Sphere

Cohen et al. introduce Spherical CNNs that achieve SO(3)-equivariance by defining cross-correlation on the sphere and rotation group, computed efficiently via generalized FFT algorithms from non-commutative harmonic analysis.

Machine Learning Fundamentals
The three quarks of attention: multiplexing (additive), output gating (multiplicative output), and synaptic gating (multiplicative weight)

The Quarks of Attention: Building Blocks of Attention

Baldi and Vershynin systematically classify the fundamental building blocks of attention (activation attention, output gating, synaptic gating) by source, target, and mechanism, then prove capacity bounds showing that gating introduces quadratic terms sparsely, gaining expressiveness without the full cost of polynomial activations.