Hunter Heidenreich | ML Research Scientist — Page 9

Molecular Representations
Encoder-decoder architecture diagram for translating chemical names between English and Chinese with performance comparison bar chart

Neural Machine Translation of Chemical Nomenclature

This paper applies character-level CNN and LSTM encoder-decoder networks to translate chemical names between English and Chinese, comparing them against an existing rule-based tool.

Computational Chemistry
Conceptual diagram showing natural language prompts flowing into code generation for chemistry tasks

NLP Models That Automate Programming for Chemistry

Hocky and White argue that NLP models capable of generating code from natural language prompts will fundamentally alter how chemists interact with scientific software, reducing barriers to computational research and reshaping programming pedagogy.

Molecular Generation
Distribution plot showing original QM9 logP shifted toward +6 and -6 targets via gradient-based dreaming

PASITHEA: Gradient-Based Molecular Design via Dreaming

PASITHEA adapts deep dreaming from computer vision to molecular design, directly optimizing SELFIES-encoded molecules for target chemical properties via gradient-based inversion of a trained regression network.

Molecular Generation
Bar chart showing PrefixMol Vina scores across different conditioning modes: target, property, combined, and scaffold

PrefixMol: Prefix Embeddings for Drug Molecule Design

PrefixMol prepends learnable condition vectors to a GPT transformer for SMILES generation, enabling joint control over binding pocket targeting and chemical properties like QED, SA, and LogP.

Molecular Generation
Bar chart comparing docking scores of generated vs known ligands for CDK2 and EGFR targets

Protein-to-Drug Molecule Translation via Transformer

Applies the Transformer architecture to generate drug-like molecules conditioned on protein amino acid sequences, treating target-specific de novo drug design as a sequence-to-sequence translation problem.

Molecular Representations
Bar chart showing randomized SMILES generate more of GDB-13 chemical space than canonical SMILES across training set sizes

Randomized SMILES Improve Molecular Generative Models

An extensive benchmark showing that training RNN generative models with randomized (non-canonical) SMILES strings yields more uniform, complete, and closed molecular output domains than canonical SMILES.

Molecular Generation
Bar chart comparing PMO benchmark scores with and without chemical quality filters across five generative methods

Re-evaluating Sample Efficiency in Molecule Generation

A critical reassessment of the PMO benchmark for de novo molecule generation, showing that adding molecular weight, LogP, and diversity filters substantially re-ranks generative models, with Augmented Hill-Climb emerging as the top method.

Molecular Generation
Horizontal bar chart showing REINVENT 4 unified framework supporting seven generative model types

REINVENT 4: Open-Source Generative Molecule Design

Overview of REINVENT 4, an open-source generative molecular design framework from AstraZeneca that unifies RNN and transformer generators within reinforcement learning, transfer learning, and curriculum learning optimization algorithms.

Molecular Generation
Bar chart showing deep generative architecture types for molecular design: RNN, VAE, GAN, RL, and hybrid methods

Review: Deep Learning for Molecular Design (2019)

An early and influential review cataloging 45 papers on deep generative modeling for molecules, comparing RNN, VAE, GAN, and reinforcement learning architectures across SMILES and graph-based representations.

Molecular Generation
Bar chart comparing RNN and Transformer Wasserstein distances across drug-like, peptide-like, and polymer-like generation tasks

RNNs vs Transformers for Molecular Generation Tasks

Compares RNN-based and Transformer-based chemical language models across three molecular generation tasks of increasing complexity, finding that RNNs excel at local features while Transformers handle large molecules better.

Molecular Generation
Diagram showing the dual formulation of S4 models with convolution during training and recurrence during generation for SMILES-based molecular design

S4 Structured State Space Models for De Novo Drug Design

This paper introduces structured state space sequence (S4) models to chemical language modeling, showing they combine the strengths of LSTMs (efficient recurrent generation) and GPTs (holistic sequence learning) for de novo molecular design.

Molecular Representations
Diagram showing SMILES string flowing through encoder to fixed-length fingerprint vector and back through decoder

Seq2seq Fingerprint: Unsupervised Molecular Embedding

A GRU-based sequence-to-sequence model that learns fixed-length molecular fingerprints by translating SMILES strings to themselves, enabling unsupervised representation learning for drug discovery tasks.