Language Models on Hunter Heidenreich | ML Research Scientist

T5: Exploring Transfer Learning Limits

Wed, 08 Apr 2026 00:00:00 +0000

A systematic study of NLP transfer learning

This is a systematization paper that provides a comprehensive empirical survey of transfer learning techniques for NLP. Rather than proposing a single new method, T5 introduces a unified text-to-text framework and uses it as a testbed to systematically compare pre-training objectives, architectures, unlabeled data sources, transfer approaches, and multi-task mixing strategies. The scale of the ablation study (covering dozens of configurations) and the release of C4, pre-trained models, and code make it both a reference guide and a resource.

Unifying NLP tasks as text-to-text

The core design decision is to cast every NLP task as a text-to-text problem: both the input and output are text strings, with a task-specific prefix. Classification, regression, summarization, translation, and question answering all use the same model, loss function (cross-entropy on output tokens), and decoding procedure. This simplicity enables fair comparison across tasks and training strategies.

The model architecture is a standard encoder-decoder Transformer. The paper finds that this form outperforms decoder-only (language model) and encoder-only (BERT-style) variants in the text-to-text setting, while having similar computational cost to decoder-only models despite twice the parameters (the encoder processes the input only once, then the decoder attends to it).

Multi-task mixing: strategies and findings

The most thesis-relevant contribution is the systematic ablation of multi-task mixing strategies (Section 3.5.2). When training on multiple tasks simultaneously (which in the text-to-text framework simply means mixing data from different sources), the central question is how to set the proportion of data from each task.

Three mixing strategies

Examples-proportional mixing. Sample in proportion to each dataset’s size, with an artificial cap $K$ on the maximum dataset size. Without the cap, the unsupervised pre-training data (orders of magnitude larger) would dominate all batches. The mixing rate for task $m$ is:

$$ r_{m} = \frac{\min(e_{m}, K)}{\sum_{n} \min(e_{n}, K)} $$

where $e_{m}$ is the number of examples in task $m$’s dataset.

Temperature-scaled mixing. Raise each mixing rate $r_{m}$ to the power $1/T$ and renormalize. At $T=1$ this equals examples-proportional mixing; as $T$ increases, proportions approach equal mixing. Uses a large cap $K = 2^{21}$.

Equal mixing. Sample uniformly from all tasks. Included as a negative reference: the model overfits on low-resource tasks and underfits on high-resource tasks.

Results

Mixing strategy	GLUE	CNN/DM	SQuAD	SuperGLUE	EnDe	EnFr	EnRo
Baseline (pre-train/fine-tune)	83.28	19.24	80.88	71.36	26.98	39.82	27.65
Equal	76.13	19.02	76.51	63.37	23.89	34.31	26.78
Examples-proportional, $K=2^{18}$	81.67	19.07	78.17	67.94	24.57	35.19	27.39
Examples-proportional, $K=2^{19}$	81.42	19.24	79.78	67.30	25.21	36.30	27.76
Temperature-scaled, $T=2$	81.90	19.28	79.42	69.92	25.42	36.72	27.20

Key findings on mixing:

Multi-task training underperforms pre-train-then-fine-tune on most tasks. No mixing strategy matches the baseline of unsupervised pre-training followed by task-specific fine-tuning.
Equal mixing is worst. It dramatically degrades performance, confirming that proportions matter.
There exists a task-specific sweet spot for the cap $K$. Most tasks have an optimal $K$ value; larger or smaller values hurt. The exception is very high-resource tasks (WMT English-French) that always benefit from higher mixing proportions.
Temperature scaling at $T=2$ provides the best single compromise. It achieves reasonable performance across all tasks without requiring per-task tuning of $K$.
Multi-task pre-training followed by fine-tuning closes the gap. When multi-task training is used as pre-training (not as the final training stage), followed by task-specific fine-tuning, performance becomes comparable to unsupervised pre-training alone. This suggests that multi-task exposure during pre-training provides useful early signal without the negative effects of forcing a single model to perform all tasks simultaneously.
“Leave-one-out” training works. Pre-training on a multi-task mixture that excludes a target task, then fine-tuning on it, produces only slightly worse results. This indicates that multi-task pre-training builds general capabilities that transfer to unseen tasks without dramatic task interference.

Data repetition degrades performance

The paper also systematically tests the effect of pre-training data set size by truncating C4 and training over repeated data:

Unique tokens	Repeats	GLUE	SQuAD	SuperGLUE
Full dataset	0	83.28	80.88	71.36
$2^{29}$	64	82.87	80.97	72.03
$2^{27}$	256	82.62	79.78	69.97
$2^{25}$	1,024	79.55	76.27	64.76
$2^{23}$	4,096	76.34	70.92	59.29

Performance degrades as data shrinks, with 64 repeats showing limited effects but 1,024+ repeats causing significant degradation. Training loss curves confirm memorization at high repetition counts. The paper recommends using large, diverse pre-training datasets whenever possible.

Scaling and final configuration

The paper compares scaling strategies: more data, larger models, and ensembles. Training a larger model for fewer steps generally outperforms training a smaller model on more data. Ensembles of independently pre-trained and fine-tuned models provide orthogonal gains.

The final T5-11B model combines the best choices from all ablations: encoder-decoder architecture, span corruption objective, C4 pre-training data, multi-task pre-training followed by fine-tuning, and scaling to 11B parameters trained on over 1 trillion tokens. It achieves state-of-the-art results on GLUE (90.3 average), SuperGLUE (88.9, near human performance of 89.8), SQuAD, and CNN/Daily Mail. It does not achieve state-of-the-art on WMT translation tasks, where methods using backtranslation and cross-lingual pre-training retain the lead.

Implications and limitations

The T5 paper’s multi-task mixing findings are its most enduring contribution beyond the model itself. The core lessons: proportions matter enormously (equal mixing fails), examples-proportional mixing with a cap is a reasonable default, temperature scaling provides a single-knob alternative, and multi-task pre-training followed by fine-tuning can match pure unsupervised pre-training.

Limitations:

All ablations use the same encoder-decoder architecture. Findings may not transfer to decoder-only models that dominate current practice.
The multi-task mixing experiments treat each task as a separate “domain.” Interactions between similar tasks (e.g., multiple classification tasks) are not isolated.
The paper does not provide a principled method for choosing $K$ or $T$; both require empirical search.
C4 has known quality issues (templated text, noisy content) that have been addressed in later datasets.

Reproducibility Details

Status: Highly Reproducible. Code, pre-trained models, and the C4 dataset are all publicly released.

Data

Purpose	Dataset	Size	Notes
Pre-training	C4 (Colossal Clean Crawled Corpus)	~750 GB	Heuristically cleaned Common Crawl
Downstream	GLUE, SuperGLUE, SQuAD, CNN/DM, WMT (EnDe, EnFr, EnRo)	Standard splits	Text-to-text format

Models

Encoder-decoder Transformer. Sizes: Base (220M), Small (60M), Large (770M), 3B, 11B. Baseline uses Base size. SentencePiece vocabulary with 32K tokens. Pre-trained for $2^{19}$ steps, fine-tuned for $2^{18}$ steps on individual tasks.

Algorithms

Multi-task mixing: examples-proportional with cap $K \in {2^{16}, \ldots, 2^{21}}$, temperature-scaled with $T \in {2, 4, 8}$, and equal mixing. Unsupervised objective: span corruption (mean span length 3, 15% corruption rate). Training with Adafactor optimizer, inverse square root learning rate schedule.

Hardware

All models trained using Mesh TensorFlow on TPU slices. T5-11B pre-trained for 1M steps with batch size $2^{11}$ sequences of length 512 (~1 trillion tokens total). Exact TPU pod configurations per experiment not detailed.

Artifacts

Artifact	Type	License	Notes
T5 Code	Code	Apache 2.0	Official TensorFlow implementation (JAX successor: T5X)
T5 Models	Model	Apache 2.0	Pre-trained checkpoints (Small through 11B)
C4 Dataset	Dataset	-	~750 GB cleaned Common Crawl, via TensorFlow Datasets

Citation

@article{raffel2020exploring,
  title={Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer},
  author={Raffel, Colin and Shazeer, Noam and Roberts, Adam and Lee, Katherine and Narang, Sharan and Matena, Michael and Zhou, Yanqi and Li, Wei and Liu, Peter J.},
  journal={Journal of Machine Learning Research},
  volume={21},
  number={140},
  pages={1--67},
  year={2020}
}

SlimPajama-DC: Data Combinations for LLM Training

Wed, 08 Apr 2026 00:00:00 +0000

An empirical study of data domain combinations

This is a discovery paper that empirically investigates how different combinations and proportions of data domains affect language model pretraining. Using the SlimPajama dataset (a globally deduplicated, 627B token refinement of RedPajama), the study trains seven 1.3B model configurations with varying domain mixtures to identify which combinations and deduplication strategies produce the best downstream performance.

Why data combination strategy matters

Multi-source pretraining datasets combine data from web crawls, code repositories, books, academic papers, and other sources. Two underexplored questions drive this work: (1) Does deduplication within each source (local) versus across all sources (global) meaningfully affect model quality? (2) When sources are thoroughly deduplicated, how does the combination and proportion of domains affect downstream performance? Most open-source LLM training datasets (RedPajama, The Pile) perform only local deduplication, leaving cross-source redundancy unaddressed.

Global deduplication and the SlimPajama dataset

SlimPajama applies global MinHashLSH deduplication (Jaccard similarity threshold 0.8, 13-gram signatures) across all seven data sources simultaneously. This reduces RedPajama’s 1.2T tokens to 627B tokens, a roughly 48% reduction. The heaviest deduplication hits CommonCrawl and GitHub, which had the most cross-source overlap.

The key processing steps:

Low-length document filtering: Remove documents below a minimum length threshold.
Global deduplication: MinHashLSH across all sources simultaneously, requiring 64 CPU cores and 1.4TB peak memory. This removes both within-source and between-source duplicates.

The resulting dataset composition:

Source	SlimPajama	RedPajama	LLaMA 1
CommonCrawl	52.2% (333B)	72.6% (878B)	67.0%
C4	26.7% (170B)	14.4% (175B)	15.0%
GitHub	5.2% (33B)	4.9% (59B)	4.5%
Books	4.2% (27B)	2.1% (26B)	4.5%
ArXiv	4.6% (29B)	2.3% (28B)	2.5%
Wikipedia	3.8% (24B)	2.0% (24B)	4.5%
StackExchange	3.3% (21B)	1.7% (20B)	2.0%

Seven domain combination configurations

All configurations train 1.3B parameter models on 330B tokens with identical architecture and hyperparameters. The configurations systematically vary domain diversity:

DC-1: CommonCrawl only (single source)
DC-2: CommonCrawl + C4 (two web sources)
DC-3: CommonCrawl + C4 with adjusted proportions
DC-4: Wikipedia + Books + GitHub + ArXiv + StackExchange (no web crawl)
DC-5: CommonCrawl + C4 + Wikipedia + Books (four sources, no code/academic)
DC-6: All seven SlimPajama sources (maximum diversity)
DC-7: RefinedWeb CommonCrawl (external single-source baseline)

The experimental design probes: incremental diversity (DC-1 to DC-2 to DC-5 to DC-6), proportion sensitivity (DC-2 vs DC-3), source importance (DC-3 vs DC-4), and specialization vs generalization (individual vs combined).

Diversity after global deduplication drives performance

Hugging Face leaderboard results

Model	Average	ARC	HellaSwag	MMLU	TruthfulQA
RedPajama-1.3B	38.0	37.2	55.8	24.9	34.3
DC-1 (CC only)	38.5	36.3	56.0	27.0	34.8
DC-4 (no web)	37.6	33.4	53.3	26.0	37.6
DC-6 (all sources)	40.0	33.7	61.0	26.9	38.4
DC-7 (RefinedWeb)	41.0	35.1	64.7	26.2	37.9

Key patterns:

More domain diversity improves average performance. The progression DC-1 (38.5) to DC-2 (38.4) to DC-5 (38.6) to DC-6 (40.0) shows that adding domains consistently lifts average accuracy once global deduplication has removed cross-source redundancy.
Global deduplication enables clean combination. All SlimPajama configurations except DC-4 outperform RedPajama-1.3B (38.0), which uses local deduplication only. The elimination of cross-source overlap means adding sources contributes genuinely new information.
Removing web crawl data hurts. DC-4 (no CommonCrawl/C4) scores lowest (37.6), demonstrating that web text provides essential breadth even when specialized sources are included.
Individual domains excel at specific tasks. DC-1 (CC only) achieves the highest ARC and MMLU scores. DC-4 leads on Winogrande. DC-5 leads on WSC273. No single combination dominates all tasks, reinforcing that diversity trades specialization for generalization.
Findings transfer to 7B scale. The best 1.3B configuration insights were applied to a 7B model trained with large batch sizes, achieving 63.4 average accuracy across the extended benchmark suite.

Training loss patterns

DC-6 (all sources) achieves the lowest training loss among SlimPajama configurations, consistent with the downstream results. DC-4 (no web crawl) shows the highest training loss, confirming that the large, diverse web crawl data is the most important single component.

Implications and limitations

The central finding is that diversity matters most after deduplication. When cross-source redundancy is removed, each additional source contributes genuinely new signal. Without global deduplication, adding sources may just increase redundancy without proportional benefit.

Limitations:

Only seven fixed configurations are tested. No systematic search over continuous mixture proportions (contrast with DoReMi or Data Mixing Laws).
The configurations are not independent: DC-6 includes all sources from DC-1 through DC-5, making it difficult to isolate the contribution of any single addition.
Only 1.3B and 7B scales tested. Whether the diversity benefit continues scaling is unverified.
English-only. Cross-lingual diversity effects are not studied.
The paper is a technical report without formal peer review.

Reproducibility Details

Status: Highly Reproducible. All 1.3B models and datasets are publicly released under MIT license on HuggingFace.

Data

Purpose	Dataset	Size	Notes
Training	SlimPajama	627B tokens	Globally deduplicated from 1.2T RedPajama
Training	RefinedWeb	600B tokens	External CC-only baseline
Evaluation	HF Leaderboard (ARC, HellaSwag, MMLU, TruthfulQA)	Standard	4 benchmarks
Evaluation	Extended suite	12 additional benchmarks	Zero and few-shot

Models

1.3B parameter Cerebras-GPT architecture with ALiBi positional encoding and SwiGLU activation. All configurations trained on 330B tokens. 7B model trained with large batch-size (LBS) strategy on Cerebras 16x CS-2 cluster (80 PFLOP/s in bf16).

Hardware

Cerebras 16x CS-2 cluster, 80 PFLOP/s in bf16 mixed precision.

Artifacts

Artifact	Type	License	Notes
SlimPajama-DC Models	Model	MIT	All 1.3B DC configurations (select via revision)
SlimPajama-627B-DC Dataset	Dataset	-	Source-split version of SlimPajama-627B

Citation

@article{shen2023slimpajamadc,
  title={SlimPajama-DC: Understanding Data Combinations for LLM Training},
  author={Shen, Zhiqiang and Tao, Tianhua and Ma, Liqun and Neiswanger, Willie and Liu, Zhengzhong and Wang, Hongyi and Tan, Bowen and Hestness, Joel and Vassilieva, Natalia and Soboleva, Daria and Xing, Eric},
  journal={arXiv preprint arXiv:2309.10818},
  year={2023}
}

Scaling Data-Constrained Language Models

Wed, 08 Apr 2026 00:00:00 +0000

An empirical study of scaling under data constraints

This is a discovery paper that systematically investigates what happens when language models are trained for multiple epochs on repeated data. It extends the Chinchilla scaling laws to the data-constrained regime by proposing a new scaling formula that accounts for the diminishing value of repeated tokens, validated across 400+ training runs ranging from 10M to 9B parameters and up to 1500 epochs.

Running out of unique training data

The Chinchilla scaling laws assume unlimited unique data: for a given compute budget, there exists an optimal balance of model parameters and training tokens. But extrapolating these laws to larger models implies data requirements that exceed what is available. Villalobos et al. estimated that high-quality English text would be exhausted by 2024 under Chinchilla-optimal scaling. Most prior large language models trained for a single epoch, and some work explicitly warned against data reuse. The Galactica models (trained for 4.25 epochs) showed that multi-epoch training could work, but no systematic study had quantified the tradeoff between repeated data and fresh data, or how to allocate compute optimally when data is finite.

Effective data with exponential decay for repetition

The paper generalizes the Chinchilla scaling law by replacing raw token count $D$ with an effective data term $D’$ that accounts for the diminishing value of repeated tokens:

$$ L(N, D) = \frac{A}{N’^{\alpha}} + \frac{B}{D’^{\beta}} + E $$

where the effective data is:

$$ D’ = U_{D} + U_{D} R_{D}^{} \left(1 - e^{-R_{D}/R_{D}^{}}\right) $$

Here $U_{D}$ is the number of unique tokens, $R_{D}$ is the number of repetitions (epochs minus 1), and $R_{D}^{}$ is a learned constant representing the “half-life” of data repetition. When $R_{D} = 0$ (single epoch), $D’ = U_{D} = D$ and the formula reduces to standard Chinchilla. When $R_{D} \ll R_{D}^{}$, repeated data is worth almost the same as fresh data. As $R_{D}$ grows large, the value of repeated tokens decays to zero, and $D’$ saturates at $U_{D}(1 + R_{D}^{})$, meaning no amount of repetition can substitute for more than $R_{D}^{}$ epochs’ worth of fresh data.

A symmetric formula handles excess parameters:

$$ N’ = U_{N} + U_{N} R_{N}^{} \left(1 - e^{-R_{N}/R_{N}^{}}\right) $$

where $U_{N}$ is the compute-optimal parameter count for $U_{D}$ unique tokens and $R_{N}$ measures how much the model exceeds that count. The fitted values are $R_{D}^{} \approx 15.0$ (data repetition half-life at ~16 epochs) and $R_{N}^{} \approx 5.3$ (excess parameters decay faster than repeated data).

Experiments across 400+ models

Scale. Models from 10M to 9B parameters, trained for up to 1500 epochs. Three experimental protocols: fixed unique data (100M, 400M, 1.5B tokens), fixed FLOPs, and parametric fitting across all runs. Training on C4 (English web text) with GPT-2 architecture decoder-only transformers.

Resource allocation: epochs scale faster than parameters

With fixed unique data, results show that more than 50% loss reduction is possible by training beyond one epoch and increasing model size beyond the single-epoch optimum. The data-constrained efficient frontier recommends allocating most additional compute to more epochs rather than more parameters, because excess parameters decay faster ($R_{N}^{} < R_{D}^{}$). This contrasts with Chinchilla, which recommends scaling both equally.

A concrete validation: training the data-constrained compute-optimal model for $9.3 \times 10^{21}$ FLOPs with 25B unique tokens, the recommended allocation (27% fewer parameters, more epochs) achieves better loss and downstream performance than the Chinchilla-optimal allocation.

Resource return: the 4-epoch safe zone and 16-epoch half-life

Epochs	Loss impact	Downstream impact
1 (baseline)	Optimal	Optimal
Up to 4	Negligible (+0.5% loss)	No significant difference
~16 ($R_{D}^{*}$)	Diminishing returns begin sharply	Measurable degradation
Beyond 16	Returns decay to near zero	Significant degradation
Extreme (44+)	Training can diverge	Failure

The 8.7B parameter model trained for 4 epochs ($D_{C} = 44$B unique tokens) finishes with only 0.5% higher validation loss than the single-epoch model ($D_{C} = 178$B unique tokens). Beyond 16 epochs, each repeated token retains only $1 - 1/e \approx 63%$ of the value of a fresh token, meaning roughly 37% of value is lost per repetition cycle at the half-life point.

Complementary strategies: code augmentation and filtering

When data is limited, two strategies can extend the effective dataset:

Code augmentation. Mixing Python code from The Stack with natural language data. Up to 50% code (42B tokens) shows no degradation on natural language benchmarks, effectively providing a 2x increase in useful training data. Some tasks (WebNLG generation, bAbI reasoning) actually improve with code, possibly because code trains long-range state-tracking capabilities.

Filtering relaxation. Perplexity filtering (keeping the 25% lowest-perplexity samples) is effective on noisy datasets, but deduplication filtering does not improve downstream performance (though it may reduce memorization). The recommendation: reserve aggressive filtering for noisy data sources; for clean datasets, more data through reduced filtering is better than less data through strict filtering.

Combined strategy: doubling available data with code and then repeating for 4 epochs yields 8x more training tokens with performance expected to match 8x more unique data.

Key findings and limitations

Key findings:

Multi-epoch training is beneficial, not harmful, up to moderate repetition counts.
The data-constrained scaling law accurately predicts loss under repetition using an exponential decay formulation.
Compute should be allocated to epochs faster than parameters when data is constrained.
Code augmentation and selective filtering extend effective data without quality degradation.

Limitations:

All experiments use the GPT-2 transformer architecture; applicability to other architectures or modalities is untested.
Only the entire dataset is repeated uniformly. Selectively repeating subsets (e.g., high-value data for more epochs) is not modeled.
Hyperparameter sensitivity (learning rate, dropout) to epoch count is unexplored. Higher learning rates may cause earlier onset of diminishing returns.
Focused on English text. Cross-lingual augmentation effects are not studied.

Reproducibility Details

Status: Highly Reproducible. Code, models, datasets, and hyperparameters are all publicly released under Apache 2.0.

Data

Purpose	Dataset	Size	Notes
Training	C4 (English)	Varies by experiment	Fixed unique data: 100M, 400M, 1.5B tokens
Code augmentation	The Stack (Python)	Up to 42B tokens	Mixed with natural language
Evaluation	19 NL tasks	Standard splits	Zero to five-shot, 114 scores per model

Algorithms

Data-constrained scaling law: $D’ = U_{D} + U_{D} R_{D}^{}(1 - e^{-R_{D}/R_{D}^{}})$ with $R_{D}^{} \approx 15.0$, $R_{N}^{} \approx 5.3$. Fitted using the methodology of Hoffmann et al. (2022) adapted for the repetition terms. 400+ training runs used for fitting.

Models

GPT-2 architecture decoder-only transformers with GPT-2 tokenizer. Sizes: 10M to 8.7B parameters. Cosine learning rate schedule (max 2e-4, decay to 2e-5), Adam optimizer ($\beta_2 = 0.999$), dropout 0.1, weight decay 0.1, gradient clipping at 1.0. bfloat16 precision. Trained using Megatron-DeepSpeed.

Evaluation

Metric	Data-Constrained Optimal	Chinchilla Optimal	Notes
Validation loss (9.3e21 FLOPs, 25B unique)	Lower	Higher	27% fewer parameters
Downstream (4 epochs vs 1)	No significant difference	Baseline	8.7B params, 44B unique tokens
Code augmentation (50% code)	No NL degradation	Baseline	Some tasks improve

Hardware

Trained on the LUMI supercomputer (Finland) using AMD Instinct MI250X GPUs with data, tensor, and pipeline parallelism. Up to 256 GPUs (64 nodes) per run, with up to 2,200 nodes (~8,800 GPUs) used in parallel across all concurrent runs. Total compute: approximately 3 million GPU hours. The cluster runs on 100% renewable hydroelectric energy.

Artifacts

Artifact	Type	License	Notes
datablations	Code + Models + Data	Apache 2.0	All 400+ models, datasets, and training code
Megatron-DeepSpeed fork	Code	-	Training framework adapted for AMD ROCm

Citation

@inproceedings{muennighoff2023scaling,
  title={Scaling Data-Constrained Language Models},
  author={Muennighoff, Niklas and Rush, Alexander M. and Barak, Boaz and Le Scao, Teven and Piktus, Aleksandra and Tazi, Nouamane and Pyysalo, Sampo and Wolf, Thomas and Raffel, Colin},
  booktitle={Advances in Neural Information Processing Systems},
  volume={36},
  year={2023}
}

DoReMi: Optimizing Data Mixtures for LM Pretraining

Wed, 08 Apr 2026 00:00:00 +0000

A method for automatic domain reweighting

This is a method paper that introduces Domain Reweighting with Minimax Optimization (DoReMi), an algorithm for automatically tuning the mixture proportions of pretraining data domains. Rather than relying on heuristics or expensive downstream-task-based tuning, DoReMi uses a small proxy model trained with group distributionally robust optimization (Group DRO) to produce domain weights that transfer to much larger models.

Why data mixture proportions matter

Language model pretraining datasets combine text from many domains: web crawls, Wikipedia, books, code, academic papers, and others. The mixture proportions (how much of each domain to include) significantly affect downstream performance, but existing approaches either set them by hand (The Pile uses heuristic weights) or tune them against downstream tasks (GLaM/PaLM), which is expensive and risks overfitting to a specific evaluation set. No principled, task-agnostic method existed for determining mixture proportions.

Minimax optimization over domain excess loss

DoReMi’s core insight is to frame data mixture optimization as a minimax problem: find domain weights that minimize the worst-case excess loss across all domains. The algorithm has three steps.

Step 1: Train a small reference model (280M parameters) on some default domain weights $\alpha_{\text{ref}}$ (e.g., proportional to raw token count).

Step 2: Train a small proxy model $p_{\theta}$ using Group DRO, which solves the minimax objective:

$$ \min_{\theta} \max_{\alpha \in \Delta^{k}} \sum_{i=1}^{k} \alpha_{i} \cdot \left[ \frac{1}{\sum_{x \in D_{i}} |x|} \sum_{x \in D_{i}} \ell_{\theta}(x) - \ell_{\text{ref}}(x) \right] $$

where $\ell_{\theta}(x) = -\log p_{\theta}(x)$ and $\ell_{\text{ref}}(x) = -\log p_{\text{ref}}(x)$. The excess loss $\ell_{\theta}(x) - \ell_{\text{ref}}(x)$ measures how much headroom the proxy has to improve on each example relative to the reference. The inner maximization upweights domains with high excess loss via exponentiated gradient ascent, while the outer minimization trains the proxy on those upweighted domains.

At each training step, the domain weights update as:

$$ \alpha_{t}’ \leftarrow \alpha_{t-1} \exp(\eta \lambda_{t}) $$

where $\lambda_{t}[i]$ is the per-domain excess loss (clipped at zero), followed by renormalization and smoothing with a uniform component: $\alpha_{t} \leftarrow (1-c)\frac{\alpha_{t}’}{\sum_{i} \alpha_{t}’[i]} + cu$, with $c = 10^{-3}$.

The final domain weights are the average over all training steps: $\bar{\alpha} = \frac{1}{T}\sum_{t=1}^{T} \alpha_{t}$.

Step 3: Resample data according to $\bar{\alpha}$ and train the full-scale model using standard procedures.

Iterated DoReMi extends this by running multiple rounds, using the previous round’s optimized weights as the next round’s reference weights. This converges within 3 rounds on the GLaM dataset.

Experiments across The Pile and GLaM datasets

Datasets. The Pile (22 domains, 800GB) and the GLaM dataset (8 domains, also used for PaLM). On The Pile, baseline weights come from the dataset defaults. On GLaM, baseline weights are uniform, with downstream-tuned oracle weights available for comparison.

Setup. Transformer decoder-only LMs trained with next-token prediction. All models use batch size 512 and sequence length 1024. Proxy and reference models are 280M parameters. Main models are 8B parameters (30x larger). Training runs: 200K steps (Pile) or 300K steps (GLaM). The domain weight optimization cost (training two 280M models) is 8% of the compute for the 8B main model.

Evaluation. Per-domain held-out perplexity and one-shot generative accuracy on five tasks: TriviaQA, NaturalQuestions, WebQuestions, SQuADv2, and LAMBADA.

Key domain weight shifts

On The Pile, DoReMi (280M) dramatically upweights diverse web text (Pile-CC: 0.112 to 0.606) while downweighting specialized domains like ArXiv (0.105 to 0.004), PubMed Central (0.107 to 0.005), and StackExchange (0.093 to 0.015). Smaller, underrepresented domains like YouTubeSubtitles and PhilPapers receive proportionally large increases.

Scaling behavior

DoReMi was tested with matched proxy/main model sizes (280M through 1B) and with varying proxy sizes (70M through 1B) feeding into an 8B main model.

Configuration	Speedup to baseline accuracy	Downstream improvement
DoReMi (280M to 280M)	4x	+2% avg accuracy
DoReMi (280M to 8B)	2.6x	+6.5% avg accuracy
DoReMi (150M to 8B)	~2x	Significant
DoReMi (1B to 8B)	~2x	Significant

Improvements are consistent across all tested model scales (280M to 1B matched), with no sign of diminishing returns at larger sizes.

Perplexity improves everywhere, even on downweighted domains

The most striking finding is that DoReMi improves perplexity on all 22 domains in The Pile, including domains it downweights. The proposed explanation: the lowest-entropy domains need few samples to learn (they’re statistically simple), while the highest-entropy domains have token distributions close to the uniform initialization and also need fewer samples. Reallocating weight to medium-entropy domains generates positive transfer that lifts all domains.

On The Pile, DoReMi reaches the baseline’s downstream accuracy in 75K steps versus 200K for the baseline (2.6x speedup) and achieves a 6.5% absolute improvement in average one-shot accuracy at 200K steps.

On the GLaM dataset, iterated DoReMi (round 2) matches the performance of domain weights that were tuned directly on downstream task performance, despite having no knowledge of downstream tasks. Domain weights converge within 3 iterations.

Ablations

Using only the proxy model’s loss (prefer hardest domains) or only the negative reference loss (prefer easiest domains) both underperform the full excess loss formulation. Both components are necessary: the excess loss identifies domains where the proxy has room to improve relative to what is learnable.

The proxy model itself typically underperforms the main model trained on its weights, and this gap grows at larger proxy scales. A 1B proxy model underperforms the 1B baseline, yet its domain weights still improve 1B main model training by over 2x. This suggests the domain weight signal is robust even when the proxy model itself is not well-trained.

Limitations

The domain weight landscape may have multiple local optima: a 280M proxy puts most weight on Pile-CC, while a 1B proxy favors OpenWebText2 instead. Both configurations improve over baseline, but the optimal weights are not unique.

The granularity of “domains” matters. DoReMi works better with more domains (22 on The Pile versus 8 on GLaM). Domains are defined by data provenance, which is coarse-grained. Fine-grained domain definitions (e.g., via clustering) could improve results but also risk DRO putting all weight on a small set of worst-case examples.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pretraining	The Pile	800 GB, 22 domains	Default heuristic weights as baseline
Pretraining	GLaM dataset	8 domains	Uniform weights as baseline; downstream-tuned oracle available
Evaluation	TriviaQA, NaturalQuestions, WebQuestions, SQuADv2, LAMBADA	Standard splits	One-shot generative evaluation

Algorithms

Group DRO with exponentiated gradient ascent for domain weight updates. Step size $\eta = 1$, smoothing $c = 10^{-3}$. Per-token excess loss clipped at zero. Domain weights averaged over all training steps. Iterated DoReMi converges when $|\bar{\alpha} - \alpha_{\text{ref}}|_{\infty} < 10^{-3}$.

Models

Vanilla Transformer decoder-only models with 256K vocabulary. Sizes: 70M (3 layers), 150M (6 layers), 280M (12 layers), 510M (12 layers), 760M (12 layers), 1B (16 layers), 8B (32 layers). All use 64-dim attention heads except 8B (128-dim).

Evaluation

Metric	DoReMi (280M to 8B)	Baseline (8B)	Notes
Avg one-shot accuracy	+6.5% over baseline	Reference	5 generative tasks
Worst-case log-perplexity	1.46	1.71	Across 22 Pile domains
Avg log-perplexity	1.40	1.64	Across 22 Pile domains
Domains beating baseline	22/22	0/22	Per-domain perplexity

Hardware

Proxy and reference models (under 1B) trained on TPUv3. Models at 1B and 8B trained on TPUv4. Domain weight optimization (two 280M runs) costs 8% of 8B training FLOPs.

Citation

@inproceedings{xie2023doremi,
  title={DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining},
  author={Xie, Sang Michael and Pham, Hieu and Dong, Xuanyi and Du, Nan and Liu, Hanxiao and Lu, Yifeng and Liang, Percy and Le, Quoc V. and Ma, Tengyu and Yu, Adams Wei},
  booktitle={Advances in Neural Information Processing Systems},
  volume={36},
  year={2023}
}

Data Mixing Laws for LM Pretraining Optimization

Wed, 08 Apr 2026 00:00:00 +0000

An empirical discovery of predictable mixture-loss relationships

This is a discovery paper that identifies a quantitative, functional relationship between pretraining data mixture proportions and language model loss. The key finding is that domain-specific validation loss follows an exponential law over the linear combination of training domain proportions, and this law composes with standard scaling laws to enable cheap prediction of large-model performance under arbitrary mixtures.

The missing quantitative link between data mixtures and performance

Pretraining data for large language models combines text from many domains (web, code, academic, books, etc.), and mixture proportions significantly affect model quality. Existing approaches either set proportions by hand without disclosed criteria (LLaMA, Baichuan) or use algorithmic methods like DoReMi that optimize qualitatively but cannot predict the quantitative effect of a specific mixture before training. Scaling laws exist for model size and data quantity, but no equivalent existed for mixture proportions. This paper fills that gap.

The exponential data mixing law

The core finding: for a model of fixed size trained for a fixed number of steps, the validation loss on domain $i$ as a function of the training mixture proportions $r_{1 \dots M}$ follows:

$$ L_{i}(r_{1 \dots M}) = c_{i} + k_{i} \exp\left(\sum_{j=1}^{M} t_{ij} r_{j}\right) $$

where $c_{i}$, $k_{i}$, and $t_{ij}$ are fitted parameters. The constant $c_{i}$ represents the irreducible loss (not affected by mixture changes). The interaction coefficients $t_{ij}$ capture how training domain $j$ affects validation loss on domain $i$: negative $t_{ij}$ means domain $j$ helps domain $i$, positive means it hurts.

This was discovered progressively:

Two domains: Log-reducible-loss is linear in domain proportion (univariate exponential).
Three domains: The exponential generalizes to a linear combination over all domain proportions (Eq. above), outperforming alternatives with comparable parameter count.
General validation: For a validation set composed of $K$ domains with proportions $s_{1 \dots K}$, the overall loss is:

$$ L(r_{1 \dots M}) = \sum_{i=1}^{K} s_{i} \left[ c_{i} + k_{i} \exp\left(\sum_{j=1}^{M} t_{ij} r_{j}\right) \right] $$

When the validation set composition is unknown, implicit domain aggregation treats $s_{i}$ as learnable parameters. Setting the number of implicit domains larger than the true number works well and is robust to overestimation.

Domain interaction patterns

Visualizing the fitted $t_{ij}$ coefficients across 5 coarse Pile domains reveals three relationship types: most domain pairs are unrelated (sparse interaction matrix where each domain’s loss is dominated by its own training proportion), some show facilitation (e.g., dialogue data helps internet text), and some show conflict (e.g., symbolic data hurts prose). This sparsity explains why the law can be fitted with fewer samples than the quadratic parameter count would suggest.

Nested scaling pipeline for cheap prediction

Fitting data mixing laws directly at target scale is too expensive (requires many full training runs at different mixtures). The paper proposes nesting three scaling laws:

Step 1: For each mixture $r_{i}$ and each small model size $N_{j}$, train for $S_{0}$ steps. Fit a power law $L(S) = E_{1} + B/S^{\beta}$ over steps to extrapolate to the target step count $S_{\text{target}}$.

Step 2: With the step-extrapolated losses for each mixture, fit a power law $L(N) = E_{2} + A/N^{\alpha}$ over model sizes to extrapolate to the target model size $N_{\text{target}}$.

Step 3: With the predicted losses at $(N_{\text{target}}, S_{\text{target}})$ for all sampled mixtures, fit the data mixing law and search for the optimal mixture.

This pipeline requires only training small models (70M to 410M) for short runs (30B tokens) to predict performance of a 1B model trained for 100B tokens.

Mixture sampling strategy

To get informative samples efficiently, the paper uses double-diminishing proportions: for each domain, enumerate proportions by halving from the maximum available. This distributes losses evenly across the exponential law’s range. From 40 candidate mixtures trained at the smallest scale (70M), 20 are selected based on which subset minimizes data mixing law fitting error.

Experiments on RedPajama and continual pretraining

Main experiment. Models trained on RedPajama, validated on the Pile (mimicking the common scenario where validation data comes from a different distribution than training). Small models: 70M, 160M, 305M, 410M trained for 30B tokens. Target: 1B model for 100B tokens.

The optimized mixture dramatically redistributes weight compared to RedPajama defaults:

Domain	Default	Optimized
CommonCrawl	0.670	0.125
C4	0.150	0.250
GitHub	0.045	0.141
ArXiv	0.045	0.250
Books	0.045	0.094
StackExchange	0.025	0.125
Wikipedia	0.020	0.016

The optimized mixture reaches the default mixture’s final performance in 73% of the training steps and eventually achieves performance equivalent to 48% more training on the default mixture.

Comparison to DoReMi and DoGE. Data mixing laws outperform both: the predicted-optimal mixture achieves lower validation loss than DoReMi and DoGE (both universal and OOD settings) for 1B models trained for 100B tokens on RedPajama.

Continual pretraining. The law extends to continual pretraining (Pythia-70M on Pile + Python code). It accurately predicts the critical mixture proportion that avoids catastrophic forgetting on the original domain while improving the target domain. This suggests data mixing laws could guide dynamic data schedules across multi-stage pretraining.

Implications and limitations

The data mixing law provides a predictive framework rather than just an optimization algorithm. Key implications:

The interaction coefficients $t_{ij}$ make domain relationships quantitatively observable before full-scale training, identifying facilitation and conflict pairs.
The nested pipeline’s cost is dominated by the small-model training runs (40 mixtures at 70M scale), which is orders of magnitude cheaper than even a single target-scale run.
The continual pretraining application opens the door to optimizing dynamic data schedules, where mixture proportions change across training stages.

Limitations: The “domain” concept remains loosely defined (provenance-based). The nested scaling laws introduce compounding errors at each step, and predictions tend to slightly underestimate actual loss. The number of required fitting samples, while subquadratic in practice due to sparsity, still scales with the number of domains. No theoretical justification for the exponential form is provided; it is a purely empirical finding.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Training (pilot)	The Pile (GitHub, Pile-CC, Books3)	30B tokens	2-domain and 3-domain experiments
Training (main)	RedPajama	100B tokens	7 domains
Validation	The Pile validation set	Standard split	Out-of-distribution relative to RedPajama
Continual pretraining	Pile + Python code	10B tokens	Pythia-70M base model

Algorithms

Data mixing law: $L_{i}(r_{1 \dots M}) = c_{i} + k_{i} \exp(\sum_{j} t_{ij} r_{j})$. Fitted via AdaBoost Regressor on sampled mixtures. Step scaling law: $L(S) = E_{1} + B/S^{\beta}$. Model size scaling law: $L(N) = E_{2} + A/N^{\alpha}$. Both fitted via Huber loss minimization with LBFGS. Decomposed Chinchilla-style (separate fits for stability). 40 candidate mixtures sampled via double-diminishing proportions, 20 selected for the final pipeline.

Models

Transformer decoder-only LMs. Pilot: 70M, 160M. Main pipeline: 70M, 160M, 305M, 410M (for fitting), 1B (target). Batch size: 1M tokens. Cosine learning rate decay with 2K step warmup, decaying to 0.1x at 100K steps.

Evaluation

Metric	Optimized Mixture	Default Mixture	Notes
Steps to match default final loss	73K (73%)	100K (100%)	27% training reduction
Equivalent extra training	+48%	Baseline	Estimated via step scaling law
Validation loss (1B, 100B)	Lowest	Higher than optimized	Also beats DoReMi and DoGE

Hardware

8 A100 GPUs. Training times per 30B-token run: 3.5 hours (70M), 8 hours (160M), 16 hours (305M), 21 hours (410M).

Artifacts

Artifact	Type	License	Notes
The Pile	Dataset	MIT	Pilot and validation data
RedPajama	Dataset	Apache 2.0	Main training data
Pythia Suite	Model	Apache 2.0	Model architecture configs; Pythia-70M checkpoint for continual pretraining

Reproducibility status: Partially Reproducible. Datasets and base model checkpoints are public. No official code release for the data mixing law fitting pipeline, mixture sampling, or the nested scaling law prediction workflow.

Citation

@inproceedings{ye2025datamixinglaws,
  title={Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance},
  author={Ye, Jiasheng and Liu, Peiju and Sun, Tianxiang and Zhan, Jun and Zhou, Yunhua and Qiu, Xipeng},
  booktitle={International Conference on Learning Representations},
  year={2025}
}

RWKV: Linear-Cost RNN with Transformer Training

Tue, 07 Apr 2026 00:00:00 +0000

A New Architecture Bridging RNNs and Transformers

This is a Method paper that introduces RWKV (Receptance Weighted Key Value), a novel sequence model architecture that combines the parallelizable training of Transformers with the efficient $O(Td)$ inference of RNNs. RWKV can be formulated equivalently as either a Transformer (for parallel training) or an RNN (for sequential inference), achieving the lowest computational and memory complexity among comparable architectures while matching Transformer-level performance. The authors scale RWKV to 14 billion parameters, making it the largest dense RNN ever trained at the time of publication.

The Quadratic Cost of Self-Attention

Transformers have become the dominant architecture for NLP, powering models like GPT-3, LLaMA, and Chinchilla. Their self-attention mechanism captures both local and long-range dependencies while supporting parallelized training. However, self-attention scales quadratically with sequence length in both time ($O(T^2d)$) and space ($O(T^2 + Td)$), making it computationally and memory intensive for long sequences and resource-constrained deployment.

RNNs, by contrast, offer linear scaling in memory and computation, but suffer from the vanishing gradient problem and cannot parallelize across the time dimension during training. This limits their scalability and makes them unable to match Transformer performance in practice.

Prior work on efficient Transformers (Reformer, Performer, Linformer, AFT, MEGA) has attempted to reduce this quadratic cost, often at the expense of model expressivity. RWKV aims to combine the best of both worlds: Transformer-grade training efficiency with RNN-grade inference cost, without any approximation to the attention mechanism.

Linear Attention via Channel-Wise Decay

RWKV is built on four core vectors that interact multiplicatively at each timestep:

R (Receptance): receives past information, acting as a gating signal
W (Weight): a trainable positional weight decay vector
K (Key): analogous to keys in standard attention
V (Value): analogous to values in standard attention

The architecture consists of stacked residual blocks, each containing a time-mixing sub-block and a channel-mixing sub-block.

Token Shift

All linear projection vectors are produced by interpolating between the current input $x_t$ and the previous input $x_{t-1}$, creating a token shift mechanism:

$$ r_t = W_r \cdot (\mu_r \odot x_t + (1 - \mu_r) \odot x_{t-1}) $$

$$ k_t = W_k \cdot (\mu_k \odot x_t + (1 - \mu_k) \odot x_{t-1}) $$

$$ v_t = W_v \cdot (\mu_v \odot x_t + (1 - \mu_v) \odot x_{t-1}) $$

where $\mu_r$, $\mu_k$, $\mu_v$ are learnable interpolation parameters. This is implemented efficiently as a simple offset in the temporal dimension.

The WKV Operator

The core attention-like computation replaces standard dot-product attention with a channel-wise weighted sum using exponential decay:

$$ wkv_t = \frac{\sum_{i=1}^{t-1} e^{-(t-1-i)w + k_i} \odot v_i + e^{u + k_t} \odot v_t}{\sum_{i=1}^{t-1} e^{-(t-1-i)w + k_i} + e^{u + k_t}} $$

Here $w$ is the channel-wise time decay vector and $u$ is a separate bonus vector that attends specifically to the current token. Unlike AFT where $W$ is a pairwise matrix, RWKV treats $W$ as a channel-wise vector modified by relative position, enabling the recurrent formulation.

Output Gating

The receptance vector gates the WKV output through a sigmoid:

$$ o_t = W_o \cdot (\sigma(r_t) \odot wkv_t) $$

The channel-mixing block uses a similar gating mechanism with squared ReLU activation:

$$ o’_t = \sigma(r’_t) \odot (W’_v \cdot \max(k’_t, 0)^2) $$

Dual-Mode Operation

During training, RWKV operates in time-parallel mode. The matrix multiplications ($W_\lambda$ for $\lambda \in {r, k, v, o}$) dominate at $O(BTd^2)$ and parallelize identically to standard Transformers. The element-wise WKV computation is $O(BTd)$ and parallelizes along batch and channel dimensions.

During inference, RWKV switches to time-sequential mode. Each timestep updates a fixed-size state vector, giving constant $O(d)$ memory and $O(Td)$ total time for generating $T$ tokens, compared to $O(T^2d)$ for standard Transformers.

Optimizations

Three additional design choices improve training:

Custom CUDA kernels for the sequential WKV computation, fusing it into a single kernel on training accelerators
Small init embedding: initializing the embedding matrix with small values plus an additional LayerNorm, accelerating convergence
Custom initialization: most weights initialized to zero with no biases, following identity-mapping principles from residual network design

Scaling to 14B Parameters and Benchmark Evaluation

Model Scaling

The authors train six RWKV models from 169M to 14B parameters, all for one epoch (330B tokens) on the Pile:

Model	Layers	Dimension	Parameters	FLOP/Token
169M	12	768	$1.69 \times 10^8$	$2.61 \times 10^8$
430M	24	1024	$4.30 \times 10^8$	$7.57 \times 10^8$
1.5B	24	2048	$1.52 \times 10^9$	$2.82 \times 10^9$
3B	32	2560	$2.99 \times 10^9$	$5.71 \times 10^9$
7B	32	4096	$7.39 \times 10^9$	$1.44 \times 10^{10}$
14B	40	5120	$1.42 \times 10^{10}$	$2.78 \times 10^{10}$

The parameter count follows: $\text{params} = 2VD + 13D^2L + D(11L + 4)$, where $V = 50277$ is vocabulary size, $D$ is model dimension, and $L$ is layers. FLOPs match the standard transformer formula: $\text{FLOP} = 6 \cdot [\text{tokens}] \cdot [\text{params}]$.

Scaling Laws

Training 45 RWKV models across varied (dataset, parameters) pairs, the authors find that RWKV follows the same log-log linear scaling law established for Transformers. The linear fit to Pareto-optimal points achieves $r^2 = 0.994$, and extrapolation an additional order of magnitude still yields $r^2 = 0.875$. This contrasts with prior claims that LSTMs do not follow transformer-like scaling.

NLP Benchmarks

RWKV is compared against similarly-sized models trained on comparable token budgets: Pythia, OPT, and BLOOM (all FLOP-matched). Results span twelve benchmarks: ARC (Easy/Challenge), BoolQ, COPA, HeadQA, HellaSwag, LAMBADA, OpenBookQA, PIQA, ReCoRD, SciQ, and Winogrande.

RWKV performs competitively with Transformers across all model sizes. On average across benchmarks, RWKV tracks closely with Pythia and outperforms OPT and BLOOM at comparable scales.

Long Context and Extended Finetuning

RWKV can extend its context length after pretraining through progressive finetuning: doubling from 1024 to 2048 (10B tokens), then to 4096 (100B tokens), and finally to 8192 (100B tokens). Each doubling reduces test loss on the Pile, indicating effective use of longer context.

On the Long Range Arena (LRA) benchmark, which tests sequences from 1,000 to 16,000 tokens, RWKV performs second only to S4 across the five datasets.

Inference Efficiency

Benchmarking text generation on CPU (x86) and GPU (NVIDIA A100 80GB) at float32 precision shows that RWKV exhibits linear scaling in generation time, while Transformers scale quadratically. This advantage grows with sequence length: for long outputs, RWKV completes generation substantially faster at equivalent model sizes.

Competitive Performance with Key Caveats

RWKV demonstrates that RNN-class models can match Transformer performance at scale, while maintaining $O(Td)$ time and $O(d)$ memory during inference. The key findings are:

Scaling laws hold: RWKV follows the same compute-optimal scaling as Transformers ($r^2 = 0.994$), contradicting earlier claims about RNN scaling behavior
Competitive NLP performance: Across twelve benchmarks, RWKV matches similarly-sized Transformers trained on comparable data
Linear inference cost: Generation time scales linearly rather than quadratically, with constant memory regardless of sequence length
Context extension: Progressive finetuning effectively extends the context window post-training

Limitations

The authors identify two primary limitations:

Information compression: Linear attention funnels all past information through a single fixed-size state vector. For tasks requiring recall of specific details over very long contexts, this is mechanistically more constrained than full self-attention, which maintains direct access to all previous tokens.

Prompt sensitivity: RWKV is more sensitive to prompt engineering than standard Transformers. The linear attention mechanism limits how much prompt information carries forward, making the order of information in the prompt particularly important. Reordering prompts improved F1 from 44.2% to 74.8% on one task.

Future Directions

The authors suggest several avenues: applying parallel scan to reduce WKV cost to $O(B \log(T) d)$, extending RWKV to encoder-decoder and multimodal architectures, leveraging hidden states for interpretability and safety, and increasing internal state size to improve long-range recall.

Reproducibility Details

Artifacts

Artifact	Type	License	Notes
BlinkDL/RWKV-LM	Code	Apache-2.0	Official PyTorch training and inference implementation
Pre-trained weights (169M to 14B)	Model	Apache-2.0	All six Pile-trained sizes on HuggingFace (`BlinkDL/rwkv-4-pile-*`)
The Pile	Dataset	Mixed	825 GiB pretraining corpus; component licenses vary by source

Reproducibility classification: Highly Reproducible. Training code (Apache-2.0), pre-trained weights for all six model sizes, the full training corpus, and complete hyperparameters (Appendix G) are all publicly available. The only missing detail is the specific GPU cluster configuration used for pretraining.

Data

Purpose	Dataset	Size	Notes
Pretraining	The Pile	330B tokens	One full epoch for all model sizes
Context extension	The Pile	210B additional tokens	Progressive doubling: 1024 to 8192
NLP evaluation	ARC, BoolQ, COPA, HeadQA, HellaSwag, LAMBADA, OpenBookQA, PIQA, ReCoRD, SciQ, Winogrande	Various	Zero-shot evaluation
Long-range evaluation	Long Range Arena (LRA)	1K-16K tokens	Five sub-tasks

Algorithms

Optimizer: Adam ($\beta = (0.9, 0.99)$), no weight decay
Precision: bfloat16
Training context length: 1024 tokens
Learning rate: constant warmup, then exponential decay
Auxiliary loss from PaLM (softmax normalizer regularization)
Batch size: 128 or 256 sequences (dynamically switched)
Training organized into mini-epochs of 40,320 samples each (8,043 mini-epochs per Pile epoch)

Models

Model	Init LR	Warmup Mini-Epochs	End LR
169M	6e-4	361	1e-5
430M	4e-4	411	1e-5
1.5B	3e-4	443	1e-5
3B	1.5e-4	451	1e-5
7B	1.5e-4	465	1e-5
14B	1e-4	544	7e-6

All pretrained models (169M to 14B) are publicly released on HuggingFace (BlinkDL/rwkv-4-pile-*) under Apache-2.0. Training code is at BlinkDL/RWKV-LM (Apache-2.0).

Evaluation

All NLP benchmarks evaluated in zero-shot setting
FLOP-matched comparison against Pythia, OPT, BLOOM
Inference benchmarked on CPU (x86) and GPU (NVIDIA A100 80GB) at float32

Hardware

Inference experiments: NVIDIA A100 80GB GPU
Training hardware details not fully specified; FLOP budgets reported per model

Paper Information

Citation: Peng, B., Alcaide, E., Anthony, Q., Albalak, A., Arcadinho, S., Biderman, S., … & Zhu, R.-J. (2023). RWKV: Reinventing RNNs for the Transformer Era. In Findings of the Association for Computational Linguistics: EMNLP 2023, pp. 14048-14077.

Publication: Findings of EMNLP 2023

Additional Resources:

GitHub Repository (Apache-2.0)

@inproceedings{peng2023rwkv,
  title={RWKV: Reinventing RNNs for the Transformer Era},
  author={Peng, Bo and Alcaide, Eric and Anthony, Quentin and Albalak, Alon and Arcadinho, Samuel and Biderman, Stella and Cao, Huanqi and Cheng, Xin and Chung, Michael and Derczynski, Leon and Du, Xingjian and Grella, Matteo and GV, Kranthi Kiran and He, Xuzheng and Hou, Haowen and Kazienko, Przemys{\l}aw and Koco{\'n}, Jan and Kong, Jiaming and Koptyra, Bart{\l}omiej and Lau, Hayden and Lin, Jiaju and Mantri, Krishna Sri Ipsit and Mom, Ferdinand and Saito, Atsushi and Song, Guangyu and Tang, Xiangru and Wind, Johan S. and Wo{\'z}niak, Stanis{\l}aw and Zhang, Zhenyuan and Zhou, Qinghua and Zhu, Jian and Zhu, Rui-Jie},
  booktitle={Findings of the Association for Computational Linguistics: EMNLP 2023},
  pages={14048--14077},
  year={2023},
  doi={10.18653/v1/2023.findings-emnlp.936}
}

Block-Recurrent Transformers for Long Sequences

Tue, 07 Apr 2026 00:00:00 +0000

A Method for Combining Attention with Block-Level Recurrence

This is a Method paper that introduces the Block-Recurrent Transformer, a model architecture that integrates recurrence into the transformer framework at the block level. Rather than processing tokens one at a time (as in traditional RNNs) or attending over entire sequences (as in standard transformers), this approach applies a transformer layer recurrently across blocks of tokens. The result is a model with linear complexity in sequence length that maintains the parallelism benefits of transformers during training. A related approach, RWKV, later explored similar ideas using linear attention with channel-wise decay.

Why Transformers Struggle with Long Documents

Transformers have largely replaced RNNs for sequence modeling tasks, but their quadratic self-attention cost limits the length of sequences they can process. A transformer with a window size of 512 tokens cannot see information beyond that window, making it blind to long-range dependencies in books, technical papers, or source code repositories.

Prior approaches to this problem fall into several categories: sparse attention patterns (BigBird, Routing Transformers, Reformer), sequence compression (Linformer, Funnel Transformers), and linearized attention approximations. These methods either sacrifice the expressiveness of full softmax attention or introduce implementation complexity.

Traditional RNNs like LSTMs offer linear complexity but suffer from three key limitations: sequential processing prevents parallelism on modern hardware, a single state vector bottlenecks information capacity, and vanishing gradients limit effective memory to a few hundred tokens.

Block-Level Recurrence with LSTM-Style Gates

The core innovation is applying a standard transformer layer in a recurrent fashion along the sequence, operating on blocks of $W$ tokens rather than individual tokens. The recurrent cell maintains $S$ state vectors (typically $S = W = 512$) that are updated at each block boundary.

The Recurrent Cell

The cell has two processing directions:

Vertical direction: An ordinary transformer layer with self-attention over input tokens and cross-attention to recurrent states, producing output embeddings.
Horizontal direction: Self-attention over current state vectors and cross-attention to input tokens, producing updated state vectors. Residual connections are replaced with gates.

Self-attention and cross-attention are computed in parallel (not sequentially), with results concatenated and fed into a linear projection. Keys and values are shared between directions, while queries are separate, yielding four query sets: $Q_e^v$, $Q_s^v$ (vertical) and $Q_s^h$, $Q_e^h$ (horizontal).

Gating Mechanisms

Two gate types are explored. The fixed gate uses a learned convex combination:

$$ g = \sigma(b_g) $$

$$ c_{t+1} = c_t \odot g + z_t \odot (1 - g) $$

where $g$ is constant after training, implementing an exponential moving average.

The LSTM gate uses input and forget gates:

$$ i_t = \sigma(W_i h_t + b_i - 1) $$

$$ f_t = \sigma(W_f h_t + b_f + 1) $$

$$ c_{t+1} = c_t \odot f_t + z_t \odot i_t $$

The bias offsets ($-1$ for input, $+1$ for forget) initialize the model to “remember” by default, which is critical for training stability. Without careful initialization, the model can fall into a local optimum where it ignores the recurrent state entirely. This echoes the gate initialization challenges studied by Tallec and Ollivier, who derived chrono initialization for LSTMs from time-warping invariance.

Gate Configurations

Three configurations are tested: dual (gates on both attention and MLP outputs), single (gate only on MLP output), and skip (gate only on attention output, no MLP). The skip configuration removes the large MLP from the recurrent layer entirely.

Learned State IDs

Since the same weights are applied to all state vectors, learned “state IDs” (analogous to position embeddings) are added so each state vector can issue distinct queries. T5-style relative position bias is used for token self-attention, with no position bias for state-token cross-attention.

Language Modeling on PG19, arXiv, and GitHub

Experimental Setup

The base model is a 12-layer transformer with 150M parameters (8 heads of size 128, embedding dimension 1024, MLP hidden size 4096). The recurrent layer is placed at layer 10 with segment length $N = 4096$ and window size $W = 512$. The architecture is evaluated on three long-document datasets:

PG19: Full-length books from Project Gutenberg (pre-1919)
arXiv: Mathematics papers in LaTeX
GitHub: Concatenated source code from open-source repositories

All models report bits-per-token ($\log_2$ perplexity, lower is better).

Baselines

Five baselines are compared: Transformer-XL with window sizes of 512, 1024, and 2048, plus 12-layer and 13-layer sliding window models. The 13-layer sliding window (Slide:13L) is the primary comparison, having equivalent computation cost and parameter count to the recurrent models.

Main Results

Model	Step Time	PG19 (bytes)	PG19 (tokens)	arXiv	GitHub
XL:512	0.88	1.01	3.62	1.45	1.21
XL:2048	2.11	0.990	3.58	1.31	1.01
Slide:13L	1.00	0.989	3.58	1.42	1.17
Rec:fixed:skip	0.99	0.952	3.53	1.24	0.976
Rec:fixed:dual	1.01	0.957	3.52	1.27	0.991
Feedback:fixed:skip	1.35	0.935	3.49	1.24	-
Memorizing Trans. 64k	1.94	0.950	3.53	1.22	-

The Rec:fixed:skip configuration achieves the best overall results while being slightly faster than the 13-layer baseline. It outperforms XL:2048, which runs over 2x slower. The block feedback variant (allowing all layers to cross-attend to recurrent states) improves perplexity further at ~35-40% higher step time.

Scaling Behavior

Models from 40M to 1.3B parameters show that the benefit of recurrence is consistent across scales and increases with model size. At larger sizes, adding recurrence provides a benefit greater than doubling the number of parameters. The 1.3B parameter model achieves 26.50 word-level perplexity on PG19, setting a new state of the art at the time of publication.

Model	Layers	PG19 Perplexity	Parameters
Compressive Transformer	36	33.6	-
Routing Transformer	22	33.2	490M
Perceiver AR	60	28.9	974.6M
Block-Recurrent Transformer	24	26.50	1.3B

Ablations

Multiple recurrent layers: Two adjacent layers (9, 10) provide no benefit. Two separated layers (4, 10) help but no more than adding another non-recurrent layer.
Number of states: Improvement up to 1024 states, degradation at 2048.
Window size reduction: Reducing the sliding window hurts Transformer-XL dramatically but has smaller impact on the recurrent model, which compensates via recurrence.
Gate type: The fixed gate consistently outperforms the LSTM gate despite being theoretically less expressive.

Qualitative Analysis

Comparing per-token predictions against Transformer-XL on PG19 books, the recurrent model’s advantage comes overwhelmingly from predicting proper names (17/20 top-improvement tokens). In 19/20 cases, the predicted word was outside the attention window, confirming it was stored in recurrent state. The model can remember book titles and authors across 60,000+ tokens.

Findings, Limitations, and Future Directions

The Block-Recurrent Transformer demonstrates that recurrence at the block level is a cost-effective way to improve language modeling on long sequences. The fixed:skip configuration (the simplest variant) performs best, suggesting the model primarily uses recurrence for long-range name lookup rather than complex reasoning. The fact that removing the MLP from the recurrent layer has minimal impact further supports this interpretation.

Key limitations include: the model was only evaluated on language modeling perplexity (no downstream tasks), the LSTM gate underperforms the simpler fixed gate (suggesting untapped potential for more expressive recurrence), and the authors acknowledge that training the recurrent layer to fully exploit its capacity for knowledge extraction will require further advances.

The authors note that evaluating on downstream tasks requiring long-range context (book summarization, long-document QA, code completion) is an important direction for future work.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Training/Eval	PG19	~29k books	Public domain, freely available
Training/Eval	arXiv	Mathematics papers	Obtained via private channels, not redistributable
Training/Eval	GitHub	Open-source repos	Obtained via private channels, not redistributable

Algorithms

Optimizer: Adafactor
Learning rate: 1.0 with inverse square root decay (initial experiments), cosine decay with max 0.01 (scaling experiments)
Warmup: 1000 steps
Dropout: 0.05
Vocabulary: 32k SentencePiece (T5 pretrained for initial, custom for scaling)
Gate initialization: bias of $+1$ for forget gate, $-1$ for input gate to ensure initial “remember” behavior

Models

Variant	Layers	Parameters	Recurrent Layers
Base	12 (+1 recurrent)	~151-164M	Layer 10
Large	24 (+2 recurrent)	650M	Layers 10, 20
XL	24 (+2 recurrent)	1.3B	Layers 10, 20

Evaluation

Metric	Best Model	PG19 (tokens)	arXiv	GitHub
Bits-per-token	Rec:fixed:skip	3.53	1.24	0.976
Word-level PPL	1.3B model	26.50	-	-

Error bars on PG19 are between 0.002 and 0.007 (3 runs with different seeds).

Hardware

Training: 32 Google V4 TPU replicas
Training time: ~48 hours for 500k steps on PG19
Batch size: 32 (segment length 4096) or 256 (segment length 512), adjusted so each model sees the same tokens per step

Artifacts

Artifact	Available	License	URL
Code (Meliad)	Yes	Apache 2.0	github.com/google-research/meliad
PG19 Dataset	Yes	Public Domain	Public
arXiv Dataset	No	Not redistributable	Private
GitHub Dataset	No	Not redistributable	Private
Pretrained Models	No	-	-

Reproducibility Assessment: Partially Reproducible. Source code is available under Apache 2.0 and the PG19 dataset is public. However, two of three evaluation datasets (arXiv, GitHub) were obtained via private channels and are not redistributable. No pretrained model checkpoints are released.

Paper Information

Citation: Hutchins, D., Schlag, I., Wu, Y., Dyer, E., & Neyshabur, B. (2022). Block-Recurrent Transformers. Advances in Neural Information Processing Systems 35 (NeurIPS 2022).

@misc{hutchins2022block,
  title={Block-Recurrent Transformers},
  author={Hutchins, DeLesley and Schlag, Imanol and Wu, Yuhuai and Dyer, Ethan and Neyshabur, Behnam},
  year={2022},
  eprint={2203.07852},
  archiveprefix={arXiv},
  primaryclass={cs.LG}
}