Hunter Heidenreich | ML Research Scientist — Page 2

Computational Chemistry
Three data transfer methods for retrosynthesis: pre-training plus fine-tuning, multi-task learning, and self-training

Data Transfer Approaches for Seq-to-Seq Retrosynthesis

A systematic study of data transfer techniques (joint training, self-training, pre-training plus fine-tuning) applied to Transformer-based retrosynthesis. Pre-training on USPTO-Full followed by fine-tuning on USPTO-50K achieves the best results, improving top-1 accuracy from 35.3% to 57.4%.

Computational Chemistry
DrugAssist workflow from user instruction through LoRA fine-tuned Llama2 to optimized molecule output

DrugAssist: Interactive LLM Molecule Optimization

DrugAssist fine-tunes Llama2-7B-Chat on over one million molecule pairs for interactive, dialogue-based molecule optimization across six molecular properties.

Computational Chemistry
DrugChat architecture showing GNN encoder, linear adaptor, and Vicuna LLM for conversational drug analysis

DrugChat: Conversational QA on Drug Molecule Graphs

DrugChat is a prototype system that bridges molecular graph neural networks with large language models for interactive, multi-turn question answering about drug compounds. It trains only a lightweight linear adaptor between a frozen GNN encoder and Vicuna-13B using 143K curated QA pairs from ChEMBL and PubChem.

Computational Chemistry
Pareto front plot for multi-objective optimization alongside DrugEx v2 explorer-exploiter architecture

DrugEx v2: Pareto Multi-Objective RL for Drug Design

DrugEx v2 introduces Pareto-based multi-objective optimization and evolutionary exploration strategies into an RNN reinforcement learning framework for de novo drug design toward multiple protein targets.

Computational Chemistry
Pipeline diagram showing natural language chemistry questions flowing through fine-tuned GPT-3 to chemical predictions across molecules, materials, and reactions

Fine-Tuning GPT-3 for Predictive Chemistry Tasks

Jablonka et al. show that fine-tuning GPT-3 on natural language chemistry questions achieves competitive or superior performance to dedicated ML models across 15 benchmarks, with particular strength in low-data settings and inverse molecular design.

Computational Chemistry
Visualization of Galactica corpus composition and benchmark performance comparing Galactica 120B against baselines

Galactica: A Curated Scientific LLM from Meta AI

Galactica trains a decoder-only Transformer on a curated 106B-token scientific corpus spanning papers, proteins, and molecules, achieving strong results on scientific QA, mathematical reasoning, and citation prediction.

Computational Chemistry
Diagram comparing character-level VAE with low validity to Grammar VAE using parse tree constraints for molecular generation

Grammar VAE: Generating Valid Molecules via CFGs

The Grammar VAE replaces character-level decoding with context-free grammar production rules, using a stack-based masking mechanism to guarantee that all generated SMILES strings are syntactically valid. Applied to molecular optimization and symbolic regression, it learns smoother latent spaces and finds better molecules than character-level baselines.

Computational Chemistry
LatentGAN pipeline from SMILES encoder through latent space WGAN-GP to SMILES decoder

LatentGAN: Latent-Space GAN for Molecular Generation

LatentGAN decouples molecular generation from SMILES syntax by training a Wasserstein GAN on latent vectors from a pretrained heteroencoder, enabling de novo design of drug-like and target-biased compounds.

Computational Chemistry
SMolInstruct dataset feeding into four base models for chemistry instruction tuning

LlaSMol: Instruction-Tuned LLMs for Chemistry Tasks

LlaSMol fine-tunes Mistral, Llama 2, and other open-source LLMs on SMolInstruct, a 3.3M-sample instruction tuning dataset covering 14 chemistry tasks. The Mistral-based model outperforms GPT-4 and Claude 3 Opus across all tasks.

Computational Chemistry
LSTM cells generating SMILES characters alongside validity and novelty statistics for drug-like molecule generation

LSTM Neural Network for Drug-Like Molecule Generation

Ertl et al. train a character-level LSTM on 509K bioactive ChEMBL SMILES and generate one million novel, diverse molecules whose physicochemical properties, substructure features, and predicted bioactivity closely match the training distribution.

Computational Chemistry
Diagram showing how memory-assisted reinforcement learning explores multiple local maxima in chemical space compared to standard RL

Memory-Assisted RL for Diverse De Novo Mol. Design

Introduces a memory unit that modifies the RL reward function to penalize previously explored chemical scaffolds, substantially increasing the diversity of generated molecules while maintaining relevance to known active ligands.

Computational Chemistry
Molecular graph being built atom-by-atom with BFS ordering and property optimization bars

MolecularRNN: Graph-Based Molecular Generation and RL

Proposes MolecularRNN, a graph recurrent model that generates molecular graphs atom-by-atom with 100% validity via valency-based rejection sampling, then shifts property distributions using policy gradient reinforcement learning.