Computational Chemistry
Bar chart showing deep generative architecture types for molecular design: RNN, VAE, GAN, RL, and hybrid methods

Review: Deep Learning for Molecular Design (2019)

An early and influential review cataloging 45 papers on deep generative modeling for molecules, comparing RNN, VAE, GAN, and reinforcement learning architectures across SMILES and graph-based representations.

Computational Chemistry
Bar chart comparing RNN and Transformer Wasserstein distances across drug-like, peptide-like, and polymer-like generation tasks

RNNs vs Transformers for Molecular Generation Tasks

Compares RNN-based and Transformer-based chemical language models across three molecular generation tasks of increasing complexity, finding that RNNs excel at local features while Transformers handle large molecules better.

Computational Chemistry
Diagram showing the dual formulation of S4 models with convolution during training and recurrence during generation for SMILES-based molecular design

S4 Structured State Space Models for De Novo Drug Design

This paper introduces structured state space sequence (S4) models to chemical language modeling, showing they combine the strengths of LSTMs (efficient recurrent generation) and GPTs (holistic sequence learning) for de novo molecular design.

Computational Chemistry
Diagram showing SMILES string flowing through encoder to fixed-length fingerprint vector and back through decoder

Seq2seq Fingerprint: Unsupervised Molecular Embedding

A GRU-based sequence-to-sequence model that learns fixed-length molecular fingerprints by translating SMILES strings to themselves, enabling unsupervised representation learning for drug discovery tasks.

Computational Chemistry
Bar chart comparing SMI-TED ROC-AUC scores against ChemBERTa, ChemBERTa-2, MoLFormer, and GROVER on BBBP and HIV

SMI-TED: Encoder-Decoder Foundation Models for Chemistry

SMI-TED introduces encoder-decoder chemical foundation models (289M parameters) pre-trained on 91 million PubChem molecules, achieving strong results across property prediction, reaction yield, and molecule generation benchmarks.

Computational Chemistry
Bar chart comparing binding affinity scores across SMILES, AIS, and SMI+AIS hybrid tokenization strategies

SMI+AIS: Hybridizing SMILES with Environment Tokens

Proposes SMI+AIS, a hybrid molecular representation combining standard SMILES tokens with chemical-environment-aware Atom-In-SMILES tokens, demonstrating improved molecular generation for drug design targets.

Computational Chemistry
Diagram showing Transformer encoder-decoder architecture converting SMILES strings into molecular fingerprints

SMILES Transformer: Low-Data Molecular Fingerprints

A Transformer-based encoder-decoder pre-trained on 861K SMILES from ChEMBL24 produces 1024-dimensional molecular fingerprints that outperform ECFP and graph convolutions on 5 of 10 MoleculeNet tasks in low-data settings.

Computational Chemistry
Bar chart comparing Atom Pair Encoding vs BPE tokenization on MoleculeNet classification tasks

SMILES vs SELFIES Tokenization for Chemical LMs

Introduces Atom Pair Encoding (APE), a chemistry-aware tokenizer for SMILES and SELFIES, and shows it consistently outperforms Byte Pair Encoding in RoBERTa-based molecular property classification on BBBP, HIV, and Tox21 benchmarks.

Computational Chemistry
Bar chart comparing SMILES-BERT accuracy against baselines on HIV, LogP, and PCBA tasks

SMILES-BERT: BERT-Style Pre-Training for Molecules

SMILES-BERT pre-trains a Transformer encoder on 18M+ SMILES from ZINC using a masked recovery task, then fine-tunes for molecular property prediction, outperforming prior methods on three datasets.

Computational Chemistry
Bar chart comparing SMILES2Vec and Graph Conv scores across five MoleculeNet tasks

SMILES2Vec: Interpretable Chemical Property Prediction

SMILES2Vec is a deep RNN that learns chemical features directly from SMILES strings using a Bayesian-optimized CNN-GRU architecture. It matches graph convolution baselines on toxicity and activity prediction, and its explanation mask identifies chemically meaningful functional groups with 88% accuracy.

Computational Chemistry
Visualization of tokenizer vocabulary coverage across chemical space

Smirk: Complete Tokenization for Molecular Models

Introduces Smirk and Smirk-GPE tokenizers that fully cover the OpenSMILES specification, proposes n-gram language models as low-cost proxies for evaluating tokenizer quality, and benchmarks 34 tokenizers across intrinsic and extrinsic metrics.

Computational Chemistry
Bar chart showing SMILES Pair Encoding reduces mean sequence length from 40 to 6 tokens

SPE: Data-Driven SMILES Substructure Tokenization

Introduces SMILES Pair Encoding (SPE), a data-driven tokenization algorithm that learns high-frequency SMILES substrings from ChEMBL to produce shorter, chemically interpretable token sequences for deep learning.