Hunter Heidenreich | ML Research Scientist — Page 6

Predictive Chemistry
ReactionT5 two-stage pretraining from CompoundT5 to ReactionT5 with product prediction and yield results

ReactionT5: Pre-trained T5 for Reaction Prediction

ReactionT5 introduces a two-stage pretraining pipeline (compound then reaction) on the Open Reaction Database, enabling competitive product and yield prediction with as few as 30 fine-tuning reactions.

Molecular Generation
REINVENT pipeline showing Prior, Agent, and Scoring Function with augmented likelihood equation

REINVENT: Reinforcement Learning for Mol. Design

Introduces a policy-based reinforcement learning method that fine-tunes an RNN pre-trained on ChEMBL SMILES to generate molecules with specified desirable properties, using an augmented episodic likelihood that anchors the agent to its prior while optimizing a user-defined scoring function.

Computational Chemistry
Three-stage progression from task-specific transformers through multimodal models to LLM chemistry agents

Transformers and LLMs for Chemistry Drug Discovery

A review chapter tracing three stages of transformer adoption in chemistry: task-specific single-modality models (reaction prediction, retrosynthesis), multimodal approaches bridging spectra and text, and LLM-powered agents like ChemCrow for general chemical reasoning.

Molecular Representations
Diagram showing dual-view molecule pre-training with a SMILES Transformer branch and a GNN branch connected by a consistency loss

DMP: Dual-View Molecule Pre-training (SMILES+GNN)

DMP combines a SMILES Transformer and a GNN branch during pre-training, using masked language modeling plus a BYOL-inspired dual-view consistency loss to learn complementary molecular representations.

Molecular Simulation
Bar chart comparing MAT average ROC-AUC against D-MPNN, GCN, and Weave baselines

MAT: Graph-Augmented Transformer for Molecules (2020)

Molecule Attention Transformer (MAT) augments Transformer self-attention with inter-atomic distances and graph adjacency, achieving strong property prediction across diverse molecular tasks with minimal hyperparameter tuning after self-supervised pretraining.

Predictive Chemistry
Bar chart showing RMSE improvement from SMILES augmentation across ESOL, FreeSolv, and lipophilicity datasets

Maxsmi: SMILES Augmentation for Property Prediction

A systematic study of SMILES augmentation strategies for molecular property prediction, showing that augmentation consistently improves CNN and RNN performance and that prediction variance across SMILES correlates with model uncertainty.

Molecular Representations
Bar chart comparing MG-BERT vs GNN baselines on six MoleculeNet classification tasks

MG-BERT: Graph BERT for Molecular Property Prediction

MG-BERT combines GNN-style local attention with BERT’s masked pretraining on molecular graphs, learning context-sensitive atomic representations that improve ADMET property prediction across 11 benchmark datasets.

Molecular Representations
Bar chart comparing Mol2vec ESOL RMSE against ECFP4, MACCS, and Neural Fingerprint baselines

Mol2vec: Unsupervised ML with Chemical Intuition

Mol2vec treats molecular substructures as words and compounds as sentences, training Word2vec on 19.9M molecules to produce dense embeddings that capture chemical intuition and enable competitive property prediction.

Predictive Chemistry
Bar chart showing MTL-BERT combining pretraining, multitask learning, and SMILES enumeration for best improvement

MTL-BERT: Multitask BERT for Property Prediction

MTL-BERT pretrains a BERT model on 1.7M unlabeled SMILES, then fine-tunes jointly on 60 ADMET and molecular property tasks using SMILES enumeration as data augmentation in all phases.

Molecular Generation
Bar chart comparing AlphaDrug docking scores against known ligands across five protein targets

AlphaDrug: MCTS-Guided Target-Specific Drug Design

AlphaDrug generates drug candidates for specific protein targets by combining an Lmser Transformer (with hierarchical encoder-decoder skip connections) and Monte Carlo tree search guided by docking scores, achieving higher binding affinities than known ligands on 86% of test proteins.

Molecular Representations
Bar chart comparing SMILES tokens vs Atom-in-SMILES across molecular generation, retrosynthesis, and reaction prediction

Atom-in-SMILES: Better Tokens for Chemical Models

Introduces Atom-in-SMILES (AIS), a tokenization scheme that encodes local chemical environments into SMILES tokens, improving prediction quality across canonicalization, retrosynthesis, and property prediction tasks.

Molecular Generation
Bar chart showing Augmented Hill-Climb achieves up to 45x sample efficiency over REINVENT

Augmented Hill-Climb for RL-Based Molecule Design

Proposes Augmented Hill-Climb, a hybrid RL strategy for SMILES-based generative models that improves sample efficiency ~45-fold over REINVENT by filtering low-scoring molecules from the loss computation, with diversity filters to prevent mode collapse.