Research notes on language model architectures and pretraining methods, covering transformer variants, data mixing strategies, and scaling laws.

YearPaperKey Idea
2020T5: Exploring Transfer Learning LimitsUnified text-to-text framework with systematic ablation of NLP transfer
2022Block-Recurrent TransformersRecurrence over token blocks for linear-complexity long-sequence modeling
2023DoReMiProxy-model DRO to learn optimal domain weights for LM pretraining
2023RWKVTransformer-level quality with linear-time, constant-memory RNN inference
2023Scaling Data-Constrained Language ModelsScaling laws for repeated data: up to 4 epochs cause negligible loss
2023SlimPajama-DCGlobal deduplication and domain diversity improve LLM training
2025Data Mixing Laws for LM PretrainingExponential law for loss over domain mixtures enables cheap optimization