Language Models

Research notes on language model architectures and pretraining methods, covering transformer variants, data mixing strategies, and scaling laws.

Year	Paper	Key Idea
2020	T5: Exploring Transfer Learning Limits	Unified text-to-text framework with systematic ablation of NLP transfer
2022	Block-Recurrent Transformers	Recurrence over token blocks for linear-complexity long-sequence modeling
2023	DoReMi	Proxy-model DRO to learn optimal domain weights for LM pretraining
2023	RWKV	Transformer-level quality with linear-time, constant-memory RNN inference
2023	Scaling Data-Constrained Language Models	Scaling laws for repeated data: up to 4 epochs cause negligible loss
2023	SlimPajama-DC	Global deduplication and domain diversity improve LLM training
2025	Data Mixing Laws for LM Pretraining	Exponential law for loss over domain mixtures enables cheap optimization

Machine Learning Fundamentals

Diagram showing the three-step nested pipeline from small-scale training to large-model loss prediction across data mixtures

Data Mixing Laws for LM Pretraining Optimization

Ye et al. find that language model loss on each domain follows an exponential function of training mixture proportions. By nesting data mixing laws with scaling laws for steps and model size, small-scale experiments can predict and optimize mixtures for large models, achieving 48% training efficiency gains.

Machine Learning Fundamentals

Bar chart comparing baseline and DoReMi domain weights across 12 Pile domains, showing Pile-CC upweighted 5.4x

DoReMi: Optimizing Data Mixtures for LM Pretraining

Xie et al. propose DoReMi, which trains a 280M proxy model using Group DRO to find optimal domain mixture weights, then uses those weights to train an 8B model 2.6x faster with 6.5% better downstream accuracy.

Machine Learning Fundamentals

Chart showing effective data as a function of epochs with exponential decay, with the 4-epoch safe zone and 16-epoch half-life marked

Scaling Data-Constrained Language Models

Muennighoff et al. train 400+ models to study how data repetition affects scaling. They propose a data-constrained scaling law with exponential decay for repeated tokens, finding that up to 4 epochs have negligible impact on loss, returns diminish around 16 epochs, and code augmentation provides a 2x effective data boost.

Machine Learning Fundamentals

Bar chart comparing average benchmark accuracy across seven domain combination configurations showing diversity improves performance

SlimPajama-DC: Data Combinations for LLM Training

Shen et al. empirically analyze how different domain combinations and deduplication strategies in the SlimPajama dataset affect 1.3B model performance. Global deduplication across sources outperforms local deduplication, and increasing domain diversity consistently improves average accuracy, with findings transferring to 7B scale.

Machine Learning Fundamentals

Table comparing multi-task mixing strategies showing examples-proportional and temperature-scaled mixing results

T5: Exploring Transfer Learning Limits

Raffel et al. introduce T5, a unified text-to-text framework for NLP transfer learning. Through systematic ablation of architectures, pre-training objectives, datasets, and multi-task mixing strategies, they identify best practices and scale to 11B parameters, achieving state-of-the-art results across multiple benchmarks.

Machine Learning

Diagram showing block-recurrent transformer architecture with vertical and horizontal processing directions

Block-Recurrent Transformers for Long Sequences

A transformer architecture that applies a recurrent cell over blocks of tokens, achieving linear complexity in sequence length while outperforming Transformer-XL baselines on PG19, arXiv, and GitHub datasets.

Machine Learning

Diagram comparing RWKV inference complexity against Transformers and efficient variants

RWKV: Linear-Cost RNN with Transformer Training

RWKV is a novel sequence model that achieves transformer-level performance while maintaining linear time and constant memory complexity during inference, scaled up to 14 billion parameters.