Hunter Heidenreich | ML Research Scientist — Page 31

Molecular Simulation
Schematic showing atom-surface interaction using the method of images

Lennard-Jones on Adsorption and Diffusion on Surfaces

Lennard-Jones’s 1932 theoretical paper applying quantum mechanical potential energy surfaces to gas-solid interactions, providing the first unified framework explaining both physisorption and chemisorption as different regions of the same energy landscape.

Computational Chemistry
GDB-13 molecule structure showing CCCC(O)(CO)CC1CC1CN

GDB-13: Chemical Universe Database (970M Molecules)

GDB-13 contains nearly 1 billion systematically generated small organic molecules with up to 13 atoms, achieving billion-scale chemical space exploration while maintaining drug-like properties.

2025-08-16 · Hunter Heidenreich
Computational Chemistry
GDB-17 molecule structure showing complex polycyclic architecture

GDB-17: Chemical Universe Database (166.4B Molecules)

GDB-17 contains 166.4 billion systematically generated small organic molecules with up to 17 atoms. It represents the most comprehensive exploration of drug-relevant chemical space achieved through computational enumeration.

2025-08-16 · Hunter Heidenreich
Natural Language Processing
Huffman Tree visualization for the input 'beep boop beer!' showing internal nodes with frequency counts and leaf nodes with characters

High-Performance Word2Vec in Pure PyTorch

A ground-up PyTorch implementation of Word2Vec treating it as a systems engineering challenge, with “tensorized tree” architecture converting pointer-chasing Hierarchical Softmax into dense GPU operations, infinite streaming datasets with Zipfian subsampling, and torch.compile compatibility for production-grade efficiency.

Computational Chemistry
3D conformer ensemble of a drug-like molecule from the GEOM dataset

GEOM Dataset: 3D Molecular Conformer Generation

Get a practical overview of the GEOM dataset and learn how it’s advancing 3D molecular machine learning by bridging static graphs and dynamic reality.

Optical Chemical Structure Recognition
Markush structure diagram

SubGrapher: Visual Fingerprinting of Chemical Structures

SubGrapher introduces a visual fingerprinting approach to Optical Chemical Structure Recognition that detects functional groups directly from images, enabling chemical database searches without full structure reconstruction and handling complex patent images including Markush structures.

Machine Learning
Comparison of standard 3D CNN versus 3D Steerable CNN for handling rotational symmetry

3D Steerable CNNs: Rotationally Equivariant Features

Weiler et al.’s NeurIPS 2018 paper introducing 3D Steerable CNNs that achieve SE(3) equivariance through group representation theory and spherical harmonic convolution kernels, eliminating the need for rotational data augmentation and improving data efficiency for scientific applications with rotational symmetry like molecular and protein structures.

Document Processing
Stream accuracy versus relative throughput for Mistral-7B and XGBoost models

LLMs for Insurance Document Automation

We explore LLM applications for page stream segmentation in insurance document processing, demonstrating that parameter-efficient fine-tuning achieves strong accuracy but revealing significant calibration challenges that limit deployment confidence.

Optical Chemical Structure Recognition
Diagram showing how Ring-Free Language decouples a molecular graph into skeleton, ring structures, and branch information

RFL: Simplifying Chemical Structure Recognition (AAAI 2025)

Proposes Ring-Free Language (RFL) to hierarchically decouple molecular graphs into skeletons, rings, and branches, solving issues with 1D serialization of complex 2D structures. Introduces the Molecular Skeleton Decoder (MSD) to progressively predict these components, achieving strong results on handwritten and printed chemical structure recognition benchmarks.

Document Processing
Diagram showing page stream segmentation workflow: an input stream of pages is processed through binary classification of page pairs to predict document breaks, producing segmented output documents

LLMs for Page Stream Segmentation

We create TabMe++, an enhanced page stream segmentation benchmark with commercial-grade OCR, and show that parameter-efficiently fine-tuned decoder-based LLMs like Mistral-7B achieve 80% straight-through processing rates, dramatically outperforming encoder-based models.

Molecular Generation
3D ball-and-stick model of butane molecule representing the structural isomer generation process

Synthetic Isomer Data Generation Pipeline

An end-to-end data factory for molecular machine learning that transforms raw chemical formulas (e.g., C6H14) into labeled 3D conformer datasets, using MAYGEN for structural isomer enumeration, RDKit for 3D embedding, and physics-based featurization to address data scarcity in computational drug discovery.

Generative Modeling
Variational Autoencoder architecture diagram showing encoder, latent space, and decoder

Modern PyTorch VAEs: A Detailed Implementation Guide

A complete guide to implementing modern Variational Autoencoders in PyTorch. Includes a copy-pasteable implementation, explanation of KL annealing to fix posterior collapse, and a deep dive into stable standard deviation parameterizations.