Hi, I’m Hunter.

I’m an ML Research Scientist at Roots.ai. I train large language and vision models at production scale on DGX H100 clusters, and my research roots are in scientific computing for molecular dynamics at Harvard. I publish at venues like COLING, EMNLP, and AIES, and I build open-source tools and datasets. I’m exploring how foundation model training transfers to scientific domains: computational chemistry, materials science, and molecular generation. More about me →
Document Processing
GutenOCR Mascot

GutenOCR: A Grounded Vision-Language Front-End for Documents

GutenOCR is a family of vision-language models designed to serve as a ‘grounded OCR front-end’, providing high-quality text transcription and explicit geometric grounding.

Time Series Forecasting
Forecasting comparison of different neural architectures on the Multiscale Lorenz-96 system

Optimizing Sequence Models for Dynamical Systems

We systematically ablate core mechanisms of Transformers and RNNs, finding that attention-augmented Recurrent Highway Networks outperform standard Transformers on forecasting high-dimensional chaotic systems.

Computational Biology
Graph-grammar expansion of a carbon fixation reaction network with ILP flow queries selecting short autocatalytic cycles

Graph Grammar and ILP for Carbon Fixation Pathways

A graph-grammar cheminformatics workflow expands the carbon fixation reaction network, then uses integer linear programming flow queries to surface short autocatalytic pathways producing Acetyl-CoA and Malate with efficiencies approaching the CETCH cycle.

Molecular Simulation
Schematic of polyalanine 1-mer functional groups interacting with water through CCSD(T)-fit 2-body PIPs.

MB-nrg in Solution: Polyalanine in Water with CCSD(T) PEFs

Building on the gas-phase MB-nrg PEF for polyalanine, Ruihan Zhou and Francesco Paesani add machine-learned 2-body terms for each backbone functional group interacting with water, fit to BSSE-corrected DLPNO-CCSD(T)/aug-cc-pVTZ data, then validate the resulting potential against alanine dipeptide-water dimer scans, free-energy surfaces in explicit MB-pol water, and hydration radial distribution functions.

Molecular Simulation
Schematic of polyalanine decomposed into overlapping n-mer building blocks fit to CCSD(T) energies.

MB-nrg: CCSD(T)-Accurate Potentials for Polyalanine

Ruihan Zhou and co-authors extend the MB-nrg many-body formalism to covalently bonded biomolecules by fragmenting polyalanine into functional-group n-mers, fitting permutationally invariant polynomials to DLPNO-CCSD(T)/aug-cc-pVTZ reference energies, and reproducing alanine dipeptide Ramachandran surfaces, harmonic frequencies, and AceAla9Nme secondary-structure dynamics more faithfully than Amber ff14SB and ff19SB.

Predictive Chemistry
Three panels comparing sampling strategies in a multi-modal fitness landscape: exhaustive enumeration, genetic algorithm clustering around few peaks, and ACSESS covering all peaks with fewer evaluations

ACSESS: Diverse Optimal Molecules in the SMU

Property-optimizing ACSESS combines diversity-biased sampling with iterative fitness thresholding to discover diverse sets of molecules with favorable properties. Tested on GDB-9 (dipole moment optimization) and NKp fitness landscapes, it outperforms standard genetic algorithms in diversity while matching or exceeding their fitness, using only ~30,000 evaluations to navigate a 300,000-molecule space.

Predictive Chemistry
Diagram showing AllChem's combinatorial synthon assembly pipeline: 7,000 building blocks transformed by 100 reactions into 5 million synthons, which combine in A-B-C topology to represent 10^20 structures

AllChem: Generating and Searching 10^20 Structures

AllChem generates ~5 million synthons by recursively applying ~100 reactions to ~7,000 building blocks, combinatorially representing up to 10^20 complete structures with an A-B-C topology. Topomer shape similarity enables efficient searching of this space, and every hit comes with a proposed synthetic route.

Molecular Simulation
Pipeline showing atoms converted to smooth density, symmetrized via Haar integration, and projected to invariant features

Atom-Density Representations for Machine Learning

Introduces a Dirac notation formalism for atomic environments that unifies SOAP power spectra, Behler-Parrinello symmetry functions, and other density-based structural representations under a single theoretical framework.

Predictive Chemistry
CHX8 enumeration pipeline from 77,524 structures to 31,497 stable molecules, example strained scaffolds with RSE values, and box plots of relative strain energy distribution by heavy atom count

CHX8: Complete Eight-Carbon Hydrocarbon Space

CHX8 exhaustively enumerates all mathematically feasible hydrocarbons with up to eight carbon atoms (77,524 structures), then DFT-optimizes them to identify 31,497 stable molecules. A universal relative strain energy (RSE) metric referenced to cyclohexane serves as a synthesizability proxy. CHX8 covers 16x more C8 hydrocarbons than GDB-13 and reveals that over 90% of novel structures should be synthetically accessible.

Molecular Simulation
Diagram showing conformation autoencoder architecture with internal coordinate encoding and decoding

Conformation Autoencoder for 3D Molecules

A conformation autoencoder converts molecular 3D arrangements into fixed-size latent representations using internal coordinates and graph neural networks, enabling conformer generation and spatial property optimization.

Machine Learning
Three-panel diagram showing symmetry group decomposition, equivariant mapping from world states to representations, and block-diagonal disentangled decomposition

Defining Disentangled Representations via Group Theory

Proposes the first principled mathematical definition of disentangled representations by connecting symmetry group decompositions to independent subspaces in a representation’s vector space.

Machine Learning
Three-panel diagram showing DGCNN point cloud processing: input space k-NN graph, EdgeConv operation, and semantic feature space clustering

DGCNN: Dynamic Graph CNN for Point Cloud Learning

DGCNN introduces the EdgeConv operator, which constructs k-nearest neighbor graphs dynamically in feature space at each network layer. This enables the model to capture both local geometry and long-range semantic relationships for point cloud classification and segmentation.