Hunter Heidenreich | ML Research Scientist

Stream accuracy versus relative throughput for Mistral-7B and XGBoost models

LLMs for Insurance Document Automation

We explore LLM applications for page stream segmentation in insurance document processing, demonstrating that parameter-efficient fine-tuning achieves strong accuracy but revealing significant calibration challenges that limit deployment confidence.

Computational Chemistry

Diagram showing how Ring-Free Language decouples a molecular graph into skeleton, ring structures, and branch information

RFL: Simplifying Chemical Structure Recognition (AAAI 2025)

Proposes Ring-Free Language (RFL) to hierarchically decouple molecular graphs into skeletons, rings, and branches, solving issues with 1D serialization of complex 2D structures. Introduces the Molecular Skeleton Decoder (MSD) to progressively predict these components, achieving strong results on handwritten and printed chemical structure recognition benchmarks.

Document Processing

Diagram showing page stream segmentation workflow: an input stream of pages is processed through binary classification of page pairs to predict document breaks, producing segmented output documents

LLMs for Page Stream Segmentation

We create TabMe++, an enhanced page stream segmentation benchmark with commercial-grade OCR, and show that parameter-efficiently fine-tuned decoder-based LLMs like Mistral-7B achieve 80% straight-through processing rates, dramatically outperforming encoder-based models.

Evolutionary Biology

A reconstruction of LUCA within its evolutionary and ecological context

The Nature of LUCA and Its Impact on the Early Earth System

A comprehensive phylogenomic study dating LUCA to ~4.2 Ga and reconstructing it as a complex, anaerobic acetogen. The authors apply the cross-bracing molecular clock method alongside gene-tree-species-tree reconciliation to infer that LUCA possessed an early immune system and lived within a hydrogen-recycling ecosystem.

Computational Chemistry

Invalid SMILES Benefit Chemical Language Models: A Study

A 2024 Nature Machine Intelligence paper providing causal evidence that invalid SMILES generation improves chemical language model performance by filtering low-likelihood samples, while validity constraints (as in SELFIES) introduce structural biases that impair distribution learning.

Computational Chemistry

3D ball-and-stick model of butane molecule representing the structural isomer generation process

Synthetic Isomer Data Generation Pipeline

An end-to-end data factory for molecular machine learning that transforms raw chemical formulas (e.g., C6H14) into labeled 3D conformer datasets, using MAYGEN for structural isomer enumeration, RDKit for 3D embedding, and physics-based featurization to address data scarcity in computational drug discovery.

Generative Modeling

Variational Autoencoder architecture diagram showing encoder, latent space, and decoder

Modern PyTorch VAEs: A Detailed Implementation Guide

A complete guide to implementing modern Variational Autoencoders in PyTorch. Includes a copy-pasteable implementation, explanation of KL annealing to fix posterior collapse, and a deep dive into stable standard deviation parameterizations.

Natural Language Processing

Word vector illustration showing text classification and NLP concepts

Sarcasm Detection with Transformers: A Cautionary Tale

What happens when you achieve 99.8% accuracy on sarcasm detection? You might have accidentally built a domain classifier. A cautionary ML tale about dataset bias.

Computational Chemistry

3D ball-and-stick model of butane molecule showing linear carbon chain structure

Hearing Molecular Shape via Coulomb Matrix Eigenvalues

Can mathematical signatures capture molecular shape? We test whether Coulomb matrix eigenvalues can distinguish alkane constitutional isomers, from unsupervised clustering failures to supervised learning successes.

Computational Social Science

Top features for Armed Forces and National Security policy classification showing veterans, defense, military keywords

Classifying Congressional Bills with Machine Learning

We test three ML models on 48K congressional bills to see how well they can predict policy areas from bill text. Results show logistic regression achieves 89% F1 score.

Computational Chemistry

Coulomb matrix heatmap visualization showing molecular structure encoding on logarithmic scale

Coulomb Matrices for Molecular Machine Learning

A practical introduction to Coulomb matrices: how they transform molecular 3D structures into ML features, complete with Python examples and honest assessment of their limitations.

Computational Social Science

Top features for Economics and Public Finance policy classification across Congresses

How Does Congress Actually Work? Data from 15K Bills

Only 2% of congressional bills become law. We analyze 15K bills from 2021-2023 to understand what drives legislative success and failure.

Hunter Heidenreich | ML Research Scientist — Page 19