Hi, I’m Hunter.

I’m an ML Research Scientist at Roots.ai. I train large language and vision models at production scale on DGX H100 clusters, and my research roots are in scientific computing for molecular dynamics at Harvard. I publish at venues like COLING, EMNLP, and AIES, and I build open-source tools and datasets. I’m exploring how foundation model training transfers to scientific domains: computational chemistry, materials science, and molecular generation. More about me →
Document Processing
GutenOCR Mascot

GutenOCR: A Grounded Vision-Language Front-End for Documents

GutenOCR is a family of vision-language models designed to serve as a ‘grounded OCR front-end’, providing high-quality text transcription and explicit geometric grounding.

Time Series Forecasting
Forecasting comparison of different neural architectures on the Multiscale Lorenz-96 system

Optimizing Sequence Models for Dynamical Systems

We systematically ablate core mechanisms of Transformers and RNNs, finding that attention-augmented Recurrent Highway Networks outperform standard Transformers on forecasting high-dimensional chaotic systems.

Computational Chemistry
Diagram showing the Ewald decomposition of long-range interactions into short-range and Fourier-space components for molecular graph neural networks

Ewald Message Passing for Molecular Graphs

Proposes Ewald message passing, a Fourier-space scheme inspired by Ewald summation that captures long-range interactions in molecular graphs. The method is architecture-agnostic and improves energy MAEs by 10% on OC20 and 16% on OE62 across four baseline GNN models.

Machine Learning
Diagram comparing RWKV inference complexity against Transformers and efficient variants

RWKV: Linear-Cost RNN with Transformer Training

RWKV is a novel sequence model that achieves transformer-level performance while maintaining linear time and constant memory complexity during inference, scaled up to 14 billion parameters.

Computational Chemistry
Caffeine molecular structure with its InChIKey identifier

InChI: The International Chemical Identifier

InChI (International Chemical Identifier) is an open standard from IUPAC that represents molecular structures as hierarchical, layered strings optimized for database interoperability, unique identification, and web search via its hashed InChIKey.

Computational Chemistry
Dual-encoder architecture diagram for MarkushGrapher-2 showing vision and VTL encoding pipelines

MarkushGrapher-2: End-to-End Markush Recognition

An 831M-parameter encoder-decoder model that jointly encodes image, OCR text, and layout information through a two-stage training strategy, achieving state-of-the-art multimodal Markush structure recognition while remaining competitive on standard molecular structure recognition.

Computational Chemistry
Overview of six categories of materials representations for machine learning

Materials Representations for ML Review

A comprehensive review of how solid-state materials can be numerically represented for machine learning, spanning structural features, graph neural networks, compositional descriptors, transfer learning, and generative models for inverse design.

Machine Learning
Diagram showing NaViT packing variable-resolution image patches into a single sequence

NaViT: Native Resolution Vision Transformer

NaViT applies sequence packing (Patch n’ Pack) to Vision Transformers, enabling training on images of arbitrary resolution and aspect ratio while improving training efficiency by up to 4x over standard ViT.

Computational Chemistry
BioT5 architecture showing SELFIES molecules, amino acid proteins, and scientific text feeding into a T5 encoder-decoder

BioT5: Cross-Modal Integration of Biology and Chemistry

BioT5 uses SELFIES representations and separate tokenization to pre-train a unified T5 model across molecules, proteins, and text, achieving state-of-the-art results on 10 of 15 downstream tasks.

Computational Chemistry
ChatDrug pipeline from prompt design through ChatGPT to domain feedback and edited molecule output

ChatDrug: Conversational Drug Editing with ChatGPT

ChatDrug is a parameter-free framework that combines ChatGPT with retrieval-augmented domain feedback and iterative conversation to edit drugs across small molecules, peptides, and proteins.

Computational Chemistry
ChemCrow architecture with GPT-4 central planner connected to 18 chemistry tools via ReAct reasoning

ChemCrow: Augmenting LLMs with 18 Chemistry Tools

ChemCrow augments GPT-4 with 18 chemistry tools to autonomously plan and execute syntheses, discover novel chromophores, and solve diverse chemical reasoning tasks.

Computational Chemistry
ChemGE pipeline from integer chromosome through CFG grammar rules to valid SMILES output

ChemGE: Molecule Generation via Grammatical Evolution

ChemGE uses grammatical evolution over SMILES context-free grammars to generate diverse drug-like molecules in parallel, outperforming deep learning baselines in throughput and molecular diversity.