Hunter Heidenreich | ML Research Scientist — Page 2

Computational Chemistry
FDB-17 filtering pipeline from GDB-17 (166.4B) through fragment filters (4.6B) to even sampling (10M), with bar charts comparing size distribution and Fsp3 shape complexity against commercial fragments

FDB-17: Fragment Database (10M Molecules)

FDB-17 contains 10 million fragment-like molecules selected from GDB-17’s 166.4 billion entries. Fragment-likeness filters reduce GDB-17 by 36x to 4.6 billion molecules, then even sampling across (HAC, heteroatoms, stereocenters) triplets produces a 460x further reduction to a manageable, diverse library enriched in 3D-shaped molecules.

Computational Chemistry
GDBMedChem pipeline from GDB-17 through medicinal chemistry filters to 10M molecules, with Venn diagram showing 97% unique substructures and property comparison against known drugs

GDBMedChem: Drug-Like Subset of GDB-17 (10M Molecules)

GDBMedChem applies medicinal chemistry-inspired functional group and structural complexity filters to GDB-17, reducing 166.4 billion molecules to 17.8 billion, then evenly samples across molecular size, stereochemistry, and polarity to produce 10 million drug-like molecules. 97% of its substructures are absent from known molecule databases.

Time Series
LSTNet architecture diagram showing convolutional, recurrent, recurrent-skip, and autoregressive components

LSTNet: Long- and Short-Term Time Series Network

LSTNet is a deep learning framework for multivariate time series forecasting that uses convolutional layers for local dependencies, a recurrent-skip component for periodic long-term patterns, and an autoregressive component for scale robustness.

Computational Chemistry
Six molecules with atoms colored by divalent (blue, simple) vs non-divalent (red, complex) nodes, showing increasing MC1 complexity from hexane to pivaloyl methylamine

Molecular Complexity from the GDB Chemical Space

Buehler and Reymond introduce two molecular complexity measures, MC1 (fraction of non-divalent nodes) and MC2 (count of non-divalent nodes excluding carboxyl groups), derived from analyzing synthesizability patterns in GDB-enumerated molecules. They compare these measures against existing complexity scores across GDB-13s, ZINC, ChEMBL, and COCONUT.

Interdisciplinary
Side-by-side search tree diagrams comparing nauty depth-first and Traces breadth-first traversal strategies for graph isomorphism

nauty and Traces: Graph Isomorphism Algorithms

An updated description of nauty and introduction of Traces, two programs for graph isomorphism testing and canonical labeling using the individualization-refinement paradigm.

Computational Chemistry
Simulated QM9 property landscape scatter plot of HOMO-LUMO gap vs dipole moment, colored by heavy atom count, with example molecules rendered alongside

QM9: Quantum Chemistry Properties of 134k Molecules

QM9 provides B3LYP/6-31G(2df,p)-level geometric, energetic, electronic, and thermodynamic properties for 133,885 small organic molecules (up to 9 heavy atoms of C, N, O, F) drawn from the GDB-17 chemical universe. It is one of the most widely used benchmarks in molecular machine learning.

Natural Language Processing
SpeechT5 architecture diagram showing shared encoder-decoder with speech and text pre/post-nets

SpeechT5: Unified Speech-Text Pre-Training Framework

SpeechT5 proposes a unified encoder-decoder pre-training framework that jointly learns from unlabeled speech and text data, achieving strong results on ASR, TTS, speech translation, voice conversion, speech enhancement, and speaker identification.

Computational Chemistry
Three-stage canonical generation pipeline (geng, vcolg, multig) alongside a log-scale speed comparison showing Surge outperforming MOLGEN 5.0 by 42-161x across natural product molecular formulas

Surge: Fastest Open-Source Chemical Graph Generator

Surge is a constitutional isomer generator based on the canonical generation path method, using nauty for graph automorphism computation. Its three-stage pipeline (simple graph generation, vertex coloring for atom assignment, edge multiplicity for bond orders) generates 7-22 million molecules per second, outperforming MOLGEN 5.0 by 42-161x on natural product molecular formulas.

Computational Chemistry
Grid of heteroaromatic ring systems rendered with RDKit, showing known ring systems in blue-tinted panels and predicted tractable rings in amber-tinted panels

VEHICLe: Heteroaromatic Rings of the Future

VEHICLe (Virtual Exploratory Heterocyclic Library) is a complete enumeration of 24,867 mono- and bicyclic heteroaromatic ring systems built from C, N, O, S, and H. Of these, only 1,701 have ever appeared in published compounds. A random forest classifier trained on known vs. unknown ring systems predicts that over 3,000 additional ring systems are synthetically tractable.

Computational Chemistry
VQM24 overview showing 9 included elements with valencies, combinatorial scaling of molecular geometries with heavy atom count, and ML learning curves comparing VQM24 vs QM9 difficulty

VQM24: 836k Molecules at DFT and Diffusion QMC

VQM24 exhaustively enumerates all neutral closed-shell molecules with up to 5 heavy atoms from C, N, O, F, Si, P, S, Cl, Br, yielding 258k constitutional isomers and 578k conformers (836k total). Properties are computed at the wB97X-D3/cc-pVDZ level, with diffusion QMC energies for 10,793 molecules up to 4 heavy atoms. ML models show up to 8x higher errors than on QM9, making VQM24 a more challenging benchmark.

Machine Learning Fundamentals
Diagram showing the three-step nested pipeline from small-scale training to large-model loss prediction across data mixtures

Data Mixing Laws for LM Pretraining Optimization

Ye et al. find that language model loss on each domain follows an exponential function of training mixture proportions. By nesting data mixing laws with scaling laws for steps and model size, small-scale experiments can predict and optimize mixtures for large models, achieving 48% training efficiency gains.

Machine Learning Fundamentals
Bar chart comparing baseline and DoReMi domain weights across 12 Pile domains, showing Pile-CC upweighted 5.4x

DoReMi: Optimizing Data Mixtures for LM Pretraining

Xie et al. propose DoReMi, which trains a 280M proxy model using Group DRO to find optimal domain mixture weights, then uses those weights to train an 8B model 2.6x faster with 6.5% better downstream accuracy.