Model Architectures

Different architectural choices encode different inductive biases: how a model processes sequences, aggregates information, or shares parameters all shape what it can learn efficiently. Notes in this section cover the design, analysis, and comparison of neural network architectures, including how structural decisions affect scaling properties, expressivity, and generalization. The focus is on understanding architectures along axes beyond specific symmetry groups (which fall under geometric deep learning).

Year	Paper	Key Idea
1949	Communication in the Presence of Noise	Shannon’s information theory, channel capacity, and sampling theorem
1984	Distributed Representations	Theoretical efficiency of coarse coding over local representations
2018	Can Recurrent Neural Networks Warp Time?	Deriving gating from time-warping invariance; chrono initialization
2018	Relational Inductive Biases in Deep Learning	Unifying graph neural network variants under a general GN framework
2020	Lagrangian Neural Networks	Energy-conserving dynamics from learned Lagrangians, no canonical coords
2022	Liquid-S4	Input-dependent state transitions for structured state-space models
2022	Scaling Laws vs Model Architectures	Comparing scaling behavior across ten architectures
2023	The Quarks of Attention	Decomposing attention into fundamental building blocks with capacity bounds
2023	NaViT	Sequence packing for native-resolution Vision Transformers

Machine Learning

Diagram showing the Lagrangian Neural Network pipeline from coordinates through a learned Lagrangian to energy-conserving dynamics

Lagrangian Neural Networks for Physics

Lagrangian Neural Networks (LNNs) use neural networks to parameterize arbitrary Lagrangians, enabling energy-conserving learned dynamics without canonical coordinates. Unlike Hamiltonian approaches, LNNs handle relativistic systems and extend to graphs via Lagrangian Graph Networks.

Machine Learning

Visualization of Liquid-S4 kernel decomposition showing input signal, S4 kernel, liquid kernel, and combined output

Liquid-S4: Input-Dependent State-Space Models

Liquid-S4 extends the S4 framework by incorporating a linearized liquid time-constant formulation that introduces input-dependent state transitions. This yields an additional convolutional kernel capturing input correlations, improving generalization across long-range sequence tasks.

Machine Learning

Diagram showing NaViT packing variable-resolution image patches into a single sequence

NaViT: Native Resolution Vision Transformer

NaViT applies sequence packing (Patch n’ Pack) to Vision Transformers, enabling training on images of arbitrary resolution and aspect ratio while improving training efficiency by up to 4x over standard ViT.

Machine Learning

Three-panel diagram showing an original sequence, its time-warped version, and the gate values derived from requiring time warping invariance

Can Recurrent Neural Networks Warp Time? (ICLR 2018)

Tallec and Ollivier show that requiring invariance to time transformations in recurrent models leads to gating mechanisms, recovering key LSTM components from first principles. They propose the chrono initialization for gate biases that improves learning of long-term dependencies.

Machine Learning

Graph network block diagram showing input graph transformed through edge, node, and global update steps to produce an updated graph

Relational Inductive Biases in Deep Learning (2018)

Battaglia et al. argue that combinatorial generalization requires structured representations, systematically analyze the relational inductive biases in standard deep learning architectures (MLPs, CNNs, RNNs), and present the graph network as a unifying framework that generalizes and extends prior graph neural network approaches.

Machine Learning

Log-log plot comparing scaling laws across six architectures showing the vanilla Transformer has the steepest slope

Scaling Laws vs Model Architectures: Inductive Bias

Tay et al. systematically compare scaling laws across ten diverse architectures (Transformers, Switch Transformers, Performers, MLP-Mixers, and others), finding that the vanilla Transformer has the best scaling coefficient and that the best-performing architecture changes across compute regions.

Machine Learning

The three quarks of attention: multiplexing (additive), output gating (multiplicative output), and synaptic gating (multiplicative weight)

The Quarks of Attention: Building Blocks of Attention

Baldi and Vershynin systematically classify the fundamental building blocks of attention (activation attention, output gating, synaptic gating) by source, target, and mechanism, then prove capacity bounds showing that gating introduces quadratic terms sparsely, gaining expressiveness without the full cost of polynomial activations.

Machine Learning

Diagram showing distributed representations with three pools of units (AGENT, RELATIONSHIP, PATIENT) connected via role/identity bindings

Distributed Representations: A Foundational Theory

Geoffrey Hinton’s 1984 technical report that formally derives the efficiency of distributed representations (coarse coding) and demonstrates their properties of automatic generalization, content-addressability, and robustness to damage.

Machine Learning

Sphere packing illustration showing Shannon's geometric interpretation of channel capacity

Communication in the Presence of Noise: Shannon's 1949 Paper

Shannon’s foundational 1949 paper establishing the mathematical framework for modern information theory, defining channel capacity as the fundamental limit for reliable communication over noisy channels and introducing the sampling theorem (Nyquist-Shannon) that underpins all digital signal processing.