Different architectural choices encode different inductive biases: how a model processes sequences, aggregates information, or shares parameters all shape what it can learn efficiently. Notes in this section cover the design, analysis, and comparison of neural network architectures, including how structural decisions affect scaling properties, expressivity, and generalization. The focus is on understanding architectures along axes beyond specific symmetry groups (which fall under geometric deep learning).

YearPaperKey Idea
1949Communication in the Presence of NoiseShannon’s information theory, channel capacity, and sampling theorem
1984Distributed RepresentationsTheoretical efficiency of coarse coding over local representations
2018Can Recurrent Neural Networks Warp Time?Deriving gating from time-warping invariance; chrono initialization
2018Relational Inductive Biases in Deep LearningUnifying graph neural network variants under a general GN framework
2020Lagrangian Neural NetworksEnergy-conserving dynamics from learned Lagrangians, no canonical coords
2022Liquid-S4Input-dependent state transitions for structured state-space models
2022Scaling Laws vs Model ArchitecturesComparing scaling behavior across ten architectures
2023The Quarks of AttentionDecomposing attention into fundamental building blocks with capacity bounds
2023NaViTSequence packing for native-resolution Vision Transformers