Hunter Heidenreich | ML Research Scientist — Page 3

Natural Language Processing
Chart showing effective data as a function of epochs with exponential decay, with the 4-epoch safe zone and 16-epoch half-life marked

Scaling Data-Constrained Language Models

Muennighoff et al. train 400+ models to study how data repetition affects scaling. They propose a data-constrained scaling law with exponential decay for repeated tokens, finding that up to 4 epochs have negligible impact on loss, returns diminish around 16 epochs, and code augmentation provides a 2x effective data boost.

Natural Language Processing
Bar chart comparing average benchmark accuracy across seven domain combination configurations showing diversity improves performance

SlimPajama-DC: Data Combinations for LLM Training

Shen et al. empirically analyze how different domain combinations and deduplication strategies in the SlimPajama dataset affect 1.3B model performance. Global deduplication across sources outperforms local deduplication, and increasing domain diversity consistently improves average accuracy, with findings transferring to 7B scale.

Natural Language Processing
Table comparing multi-task mixing strategies showing examples-proportional and temperature-scaled mixing results

T5: Exploring Transfer Learning Limits

Raffel et al. introduce T5, a unified text-to-text framework for NLP transfer learning. Through systematic ablation of architectures, pre-training objectives, datasets, and multi-task mixing strategies, they identify best practices and scale to 11B parameters, achieving state-of-the-art results across multiple benchmarks.

Natural Language Processing
Diagram showing block-recurrent transformer architecture with vertical and horizontal processing directions

Block-Recurrent Transformers for Long Sequences

A transformer architecture that applies a recurrent cell over blocks of tokens, achieving linear complexity in sequence length while outperforming Transformer-XL baselines on PG19, arXiv, and GitHub datasets.

Molecular Simulation
Diagram showing the Ewald decomposition of long-range interactions into short-range and Fourier-space components for molecular graph neural networks

Ewald Message Passing for Molecular Graphs

Proposes Ewald message passing, a Fourier-space scheme inspired by Ewald summation that captures long-range interactions in molecular graphs. The method is architecture-agnostic and improves energy MAEs by 10% on OC20 and 16% on OE62 across four baseline GNN models.

Machine Learning
Diagram showing the Lagrangian Neural Network pipeline from coordinates through a learned Lagrangian to energy-conserving dynamics

Lagrangian Neural Networks for Physics

Lagrangian Neural Networks (LNNs) use neural networks to parameterize arbitrary Lagrangians, enabling energy-conserving learned dynamics without canonical coordinates. Unlike Hamiltonian approaches, LNNs handle relativistic systems and extend to graphs via Lagrangian Graph Networks.

Machine Learning
Visualization of Liquid-S4 kernel decomposition showing input signal, S4 kernel, liquid kernel, and combined output

Liquid-S4: Input-Dependent State-Space Models

Liquid-S4 extends the S4 framework by incorporating a linearized liquid time-constant formulation that introduces input-dependent state transitions. This yields an additional convolutional kernel capturing input correlations, improving generalization across long-range sequence tasks.

Natural Language Processing
Diagram comparing RWKV inference complexity against Transformers and efficient variants

RWKV: Linear-Cost RNN with Transformer Training

RWKV is a novel sequence model that achieves transformer-level performance while maintaining linear time and constant memory complexity during inference, scaled up to 14 billion parameters.

Molecular Representations
Caffeine molecular structure with its InChIKey identifier

InChI: The International Chemical Identifier

InChI (International Chemical Identifier) is an open standard from IUPAC that represents molecular structures as hierarchical, layered strings optimized for database interoperability, unique identification, and web search via its hashed InChIKey.

Optical Chemical Structure Recognition
Dual-encoder architecture diagram for MarkushGrapher-2 showing vision and VTL encoding pipelines

MarkushGrapher-2: End-to-End Markush Recognition

An 831M-parameter encoder-decoder model that jointly encodes image, OCR text, and layout information through a two-stage training strategy, achieving state-of-the-art multimodal Markush structure recognition while remaining competitive on standard molecular structure recognition.

Molecular Representations
Overview of six categories of materials representations for machine learning

Materials Representations for ML Review

A comprehensive review of how solid-state materials can be numerically represented for machine learning, spanning structural features, graph neural networks, compositional descriptors, transfer learning, and generative models for inverse design.

Machine Learning
Diagram showing NaViT packing variable-resolution image patches into a single sequence

NaViT: Native Resolution Vision Transformer

NaViT applies sequence packing (Patch n’ Pack) to Vision Transformers, enabling training on images of arbitrary resolution and aspect ratio while improving training efficiency by up to 4x over standard ViT.