Surveys and Reviews on Hunter Heidenreich | ML Research Scientist

Transformers and LLMs for Chemistry Drug Discovery

Sat, 28 Mar 2026 00:00:00 +0000

A Systematization of Transformers in Chemistry

This book chapter by Bran and Schwaller is a Systematization paper that organizes the growing body of work applying transformer architectures to chemistry and drug discovery. Rather than proposing a new method, the authors trace a three-stage evolution: (1) task-specific single-modality models operating on SMILES and reaction strings, (2) multimodal models bridging molecular representations with spectra, synthesis actions, and natural language, and (3) large language models and LLM-powered agents capable of general chemical reasoning.

Why Transformers for Chemistry?

The authors motivate the review by drawing analogies between natural language and chemical language. Just as text can be decomposed into subwords and tokens, molecules can be linearized into SMILES or SELFIES strings, and chemical reactions can be encoded as reaction SMILES. This structural parallel enabled direct transfer of transformer architectures, originally designed for machine translation, to chemical prediction tasks.

Several factors accelerated this adoption:

The publication of open chemical databases and benchmarks (e.g., MoleculeNet, Open Reaction Database, Therapeutics Data Commons)
Improvements in compute infrastructure and training algorithms
The success of attention mechanisms at capturing context-dependent relationships, which proved effective for learning chemical grammar and atom-level correspondences

The review positions the transformer revolution in chemistry as a natural extension of NLP advances, noting that the gap between chemical and natural language is progressively closing.

Molecular Representations as Language

A key section of the review covers text-based molecular representations that make transformer applications possible:

SMILES (Simplified Molecular Input Line Entry System): The dominant linearization scheme since the 1980s, encoding molecular graphs as character sequences with special symbols for bonds, branches, and rings.
SELFIES (Self-Referencing Embedded Strings): A newer representation that guarantees every string maps to a valid molecule, addressing the robustness issues of SMILES in generative settings.
Reaction SMILES: Extends molecular representations to encode full chemical reactions in the format “A.B > catalyst.reagent > C.D”, enabling reaction prediction as a sequence-to-sequence task.

The authors note that while IUPAC names, InChI, and DeepSMILES exist as alternatives, SMILES and SELFIES dominate practical applications.

Stage 1: Task-Specific Transformer Models

The first stage of transformer adoption focused on clearly defined chemical tasks, with models trained on a single data modality (molecular strings).

Chemical Translation Tasks

The encoder-decoder architecture was directly applied to tasks framed as translation:

Molecular Transformer (Schwaller et al.): Treated reaction prediction as translation from reactant SMILES to product SMILES, becoming a leading method for forward synthesis prediction.
Retrosynthetic planning: The reverse task, predicting reactants from products, with iterative application to construct full retrosynthetic trees mapping to commercially available building blocks.
Chemformer (Irwin et al.): A pre-trained model across multiple chemical tasks, offering transferability to new applications with improved performance.
Graph-to-sequence models (Tu and Coley): Used a custom graph encoder with a transformer decoder, achieving improvements through permutation-invariant molecular graph encoding.

Representation Learning and Feature Extraction

Encoder-only transformers proved valuable for generating molecular and reaction embeddings:

Reaction representations (Wang et al., SMILES-BERT): Trained models to generate reaction vectors that outperformed hand-engineered features on downstream regression tasks.
Reaction classification (Schwaller et al.): Replaced the decoder with a classification layer to map chemical reactions by class, revealing clustering patterns by reaction type, data source, and molecular properties.
Yield prediction: Regression heads attached to encoders achieved strong results on high-throughput experimentation datasets.
Protein language models (Rives et al., ESM): Trained on 250 million protein sequences using unsupervised learning, achieving strong performance on protein property prediction and structure forecasting.
RXNMapper (Schwaller et al.): A notable application where attention weight analysis revealed that transformers internally learn atom-to-atom mappings in chemical reactions, leading to an open-source atom mapping algorithm that outperformed existing approaches.

Stage 2: Multimodal Chemical Models

The second stage extended transformers beyond molecular strings to incorporate additional data types:

Molecular captioning: Describing molecules in natural language, covering scaffolds, sources, drug interactions, and other features (Edwards et al.).
Bidirectional molecule-text conversion: Models capable of generating molecules from text queries and performing molecule-to-molecule tasks (Christofidellis et al.).
Experimental procedure prediction: Generating actionable synthesis steps from reaction SMILES (Vaucher et al.), bridging the gap between retrosynthetic planning and laboratory execution.
Structural elucidation from IR spectra: Encoding IR spectra as text sequences alongside chemical formulas, then predicting SMILES from these inputs (Alberts et al.), achieving 45% accuracy in structure prediction and surpassing prior approaches for functional group identification.

Stage 3: Large Language Models and Chemistry Agents

The most recent stage builds on foundation models pre-trained on vast text corpora, adapted for chemistry through fine-tuning and in-context learning.

Scaling Laws and Emergent Capabilities

The authors discuss how model scaling leads to emergent capabilities relevant to chemistry:

Below certain compute thresholds, model performance on chemistry tasks appears random.
Above critical sizes, sudden improvements emerge, along with capabilities like chain-of-thought (CoT) reasoning and instruction following.
These emergent abilities enable chemistry tasks that require multi-step reasoning without explicit training on chemical data.

LLMs as Chemistry Tools

Key applications of LLMs in chemistry include:

Fine-tuning for low-data chemistry (Jablonka et al.): GPT-3 fine-tuned on limited chemistry datasets performed comparably to, and sometimes exceeded, specialized models with engineered features for tasks like predicting transition wavelengths and phase classification.
In-context learning: Providing LLMs with a few examples enables prediction on chemistry tasks without any parameter updates, particularly valuable when data is scarce.
Bayesian optimization with LLMs (Ramos et al.): Using GPT models for uncertainty-calibrated regression, enabling catalyst and molecular optimization directly from synthesis procedures without feature engineering.
3D structure generation (Flam-Shepherd and Aspuru-Guzik): Using language models to generate molecular structures with three-dimensional atomic positions in XYZ, CIF, and PDB formats, matching graph-based algorithms while overcoming representation limitations.

LLM-Powered Chemistry Agents

The review highlights the agent paradigm as the most impactful recent development:

14 LLM use-cases (Jablonka et al.): A large-scale collaborative effort demonstrating applications from computational tool wrappers to reaction optimization assistants and scientific question answering.
ChemCrow (Bran, Cox et al.): An LLM-powered agent equipped with curated computational chemistry tools, capable of planning and executing tasks across drug design, materials design, and synthesis. ChemCrow demonstrated that tool integration overcomes LLM hallucination issues by grounding responses in reliable data sources.
Autonomous scientific research (Boiko et al.): Systems with focus on cloud laboratory operability.

The agent paradigm offers tool composability through natural language interfaces, allowing users to chain multiple computational tools into custom pipelines.

Outlook and Limitations

The authors identify several themes for the future:

The three stages represent increasing generality, from task-specific single-modality models to open-ended agents.
Natural language interfaces are progressively closing the gap between chemical and human language.
Tool integration through agents provides grounding that mitigates hallucination, a known limitation of direct LLM application to chemistry.
The review acknowledges that LLMs have a “high propensity to generate false and inaccurate content” on chemical tasks, making tool-augmented approaches preferable to direct application.

The chapter does not provide quantitative benchmarks or systematic comparisons across the methods discussed, as its goal is to organize the landscape rather than evaluate individual methods.

Reproducibility Details

This is a review/survey chapter and does not introduce new models, datasets, or experiments. The reproducibility assessment applies to the referenced works rather than the review itself.

Key Referenced Resources

Several open-source tools and datasets discussed in the review are publicly available:

Artifact	Type	License	Notes
RXNMapper	Code	MIT	Attention-based atom mapping
ChemCrow	Code	MIT	LLM-powered chemistry agent
MoleculeNet	Dataset	Various	Molecular ML benchmarks
Open Reaction Database	Dataset	CC-BY-SA-4.0	Curated reaction data
Therapeutics Data Commons	Dataset	MIT	Drug discovery ML datasets

Reproducibility Classification

Not applicable (review paper). Individual referenced works range from Highly Reproducible (open-source models like RXNMapper, ChemCrow) to Partially Reproducible (some models without released code) to Closed (proprietary LLMs like GPT-3/GPT-4 used in fine-tuning studies).

Paper Information

Citation: Bran, A. M., & Schwaller, P. (2024). Transformers and Large Language Models for Chemistry and Drug Discovery. In Drug Development Supported by Informatics (pp. 143-163). Springer Nature Singapore. https://doi.org/10.1007/978-981-97-4828-0_8

@incollection{bran2024transformers,
  title={Transformers and Large Language Models for Chemistry and Drug Discovery},
  author={Bran, Andres M. and Schwaller, Philippe},
  booktitle={Drug Development Supported by Informatics},
  pages={143--163},
  year={2024},
  publisher={Springer Nature Singapore},
  doi={10.1007/978-981-97-4828-0_8}
}

Transformers for Molecular Property Prediction Review

Thu, 26 Mar 2026 00:00:00 +0000

A Systematization of Transformers for Molecular Property Prediction

This is a Systematization paper. Sultan et al. provide the first comprehensive, structured review of sequence-based transformer models applied to molecular property prediction (MPP). The review catalogs 16 models published between 2019 and 2023, organizes them by architecture type (encoder-decoder, encoder-only, decoder-only), and systematically examines seven key design decisions that arise when building a transformer for MPP. The paper’s primary contribution is identifying gaps in current evaluation practices and articulating what standardization the field needs for meaningful progress.

The Problem: Inconsistent Evaluation Hinders Progress

Molecular property prediction is essential for drug discovery, crop protection, and environmental science. Deep learning approaches, including transformers, have been increasingly applied to this task by learning molecular representations from string notations like SMILES and SELFIES. However, the field faces several challenges:

Small labeled datasets: Labeled molecular property datasets typically contain only hundreds or thousands of molecules, making supervised learning alone insufficient.
No standardized evaluation protocol: Different papers use different data splits (scaffold vs. random), different splitting implementations, different numbers of repetitions (3 to 50), and sometimes do not share their test sets. This makes direct comparison across models infeasible.
Unclear design choices: With many possible configurations for pre-training data, chemical language, tokenization, positional embeddings, model size, pre-training objectives, and fine-tuning approaches, the field lacks systematic analyses to guide practitioners.

The authors note that standard machine learning methods with fixed-size molecular fingerprints remain strong baselines for real-world datasets, illustrating that the promise of transformers for MPP has not yet been fully realized.

Seven Design Questions for Molecular Transformers

The central organizing framework of this review addresses seven questions practitioners must answer when building a transformer for MPP. For each, the authors synthesize findings across the 16 reviewed models.

Reviewed Models

The paper catalogs 16 models organized by architecture:

Architecture	Base Model	Models
Encoder-Decoder	Transformer, BART	ST, Transformer-CNN, X-Mol, ChemFormer
Encoder-Only	BERT	SMILES-BERT, MAT, MolBERT, Mol-BERT, Chen et al., K-BERT, FP-BERT, MolFormer
Encoder-Only	RoBERTa	ChemBERTa, ChemBERTa-2, SELFormer
Decoder-Only	XLNet	Regression Transformer (RT)

The core attention mechanism shared by all these models is the scaled dot-product attention:

$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^{T}}{\sqrt{d_{k}}}\right)V $$

where $Q$, $K$, and $V$ are the query, key, and value matrices, and $d_{k}$ is the dimension of the key vectors.

Question 1: Which Database and How Many Molecules?

Pre-training data sources vary considerably. The three main databases are ZINC (37 billion molecules in ZINC22), ChEMBL (2.4 million unique molecules with 20 million bioactivity measurements), and PubChem (111 million unique molecules). Pre-training set sizes ranged from 900K (ST on ChEMBL) to 1.1B molecules (MolFormer on ZINC + PubChem).

Model	Database	Size	Language
ST	ChEMBL	900K	SMILES
MolBERT	ChEMBL (GuacaMol)	1.6M	SMILES
ChemBERTa	PubChem	100K-10M	SMILES, SELFIES
ChemBERTa-2	PubChem	5M-77M	SMILES
MAT	ZINC	2M	List of atoms
MolFormer	ZINC + PubChem	1.1B	SMILES
Chen et al.	C, CP, CPZ	2M-775M	SMILES

A key finding is that larger pre-training datasets do not consistently improve downstream performance. MolFormer showed minimal difference between models trained on 100M vs. 1.1B molecules. ChemBERTa-2 found that the model trained on 5M molecules using MLM performed comparably to 77M molecules for BBBP (both around 0.70 ROC-AUC). Chen et al. reported comparable $R^{2}$ values of $0.925 \pm 0.01$, $0.917 \pm 0.012$, and $0.915 \pm 0.01$ for ESOL across datasets of 2M, 103M, and 775M molecules, respectively. The data composition and covered chemical space appear to matter more than raw size.

Question 2: Which Chemical Language?

Most models use SMILES. ChemBERTa, RT, and SELFormer also explored SELFIES. MAT uses a simple list of atoms with structural features, while Mol-BERT and FP-BERT use circular fingerprints.

Direct comparisons between SMILES and SELFIES (by ChemBERTa on Tox21 SR-p53 and RT for drug-likeness prediction) found no significant performance difference. The RT authors reported that SELFIES models performed approximately $0.004 \pm 0.01$ better on RMSE, while SMILES models performed approximately $0.004 \pm 0.01$ better on Pearson correlation. The choice of chemical language does not appear to be a major factor in prediction performance, and even non-string representations (atom lists in MAT, fingerprints in Mol-BERT) perform competitively.

Question 3: How to Tokenize?

Tokenization methods span atom-level (42-66 vocabulary tokens), regex-based (47-2,362 tokens), BPE (509-52K tokens), and substructure-based (3,357-13,325 tokens) approaches. No systematic comparison of tokenization strategies exists in the literature. The vocabulary size varied dramatically, from 42 tokens for MolBERT to over 52K for ChemBERTa. The authors argue that chemically meaningful tokenization (e.g., functional group-based fragmentation) could improve both performance and explainability.

Question 4: How to Add Positional Embeddings?

Most models inherited the absolute positional embedding from their NLP base models. MolBERT and RT adopted relative positional embeddings. MolFormer combined absolute and Rotary Positional Embedding (RoPE). MAT incorporated spatial information (inter-atomic 3D distances and adjacency) alongside self-attention.

MolFormer’s comparison showed that RoPE became superior to absolute embeddings only when the pre-training dataset was very large. The performance difference (MAE on QM9) between absolute and RoPE embeddings for models trained on 111K, 111M, and 1.1B molecules was approximately $-0.20 \pm 0.18$, $-0.44 \pm 0.22$, and $0.27 \pm 0.12$, respectively.

The authors highlight that SMILES and SELFIES are linearizations of a 2D molecular graph, so consecutive tokens in a sequence are not necessarily spatially close. Positional embeddings that reflect 2D or 3D molecular structure remain underexplored.

Question 5: How Many Parameters?

Model sizes range from approximately 7M (ST, Mol-BERT) to over 100M parameters (MAT). Most chemical language models operate with 100M parameters or fewer, much smaller than NLP models like BERT (110M-330M) or GPT-3 (175B).

Model	Dimensions	Heads	Layers	Parameters
ST	256	4	4	7M
MolBERT	768	12	12	85M
MolFormer	768	12	6, 12	43M, 85M
SELFormer	768	12, 4	8, 12	57M, 85M
MAT	1024	16	8	101M
ChemBERTa	768	12	6	43M

SELFormer and MolFormer both tested different model sizes. SELFormer’s larger model (approximately 86M parameters) showed approximately 0.034 better ROC-AUC for BBBP compared to the smaller model. MolFormer’s larger model (approximately 87M parameters) performed approximately 0.04 better ROC-AUC on average for BBBP, HIV, BACE, and SIDER. The field lacks the systematic scaling analyses (analogous to Kaplan et al. and Hoffmann et al. in NLP) needed to establish proper scaling laws for chemical language models.

Question 6: Which Pre-training Objectives?

Pre-training objectives fall into domain-agnostic and domain-specific categories:

Model	Pre-training Objective	Fine-tuning
MolFormer	MLM	Frozen, Update
SMILES-BERT	MLM	Update
MolBERT	MLM, PhysChemPred, SMILES-EQ	Frozen, Update
K-BERT	Atom feature, MACCS prediction, CL	Update last layer
ChemBERTa-2	MLM, MTR	Update
MAT	MLM, 2D Adjacency, 3D Distance	Update
ChemFormer	Denoising Span MLM, Augmentation	Update
RT	PLM (Permutation Language Modeling)	-

Domain-specific objectives (predicting physico-chemical properties, atom features, or MACCS keys) showed promising but inconsistent results. MolBERT’s PhysChemPred performed closely to the full three-objective model (approximately $0.72 \pm 0.06$ vs. $0.71 \pm 0.06$ ROC-AUC in virtual screening). The SMILES-EQ objective (identifying equivalent SMILES) was found to lower performance when combined with other objectives. K-BERT’s contrastive learning objective did not significantly change performance (average ROC-AUC of 0.806 vs. 0.807 with and without CL).

ChemBERTa-2’s Multi-Task Regression (MTR) objective performed noticeably better than MLM-only for almost all four classification tasks across pre-training dataset sizes.

Question 7: How to Fine-tune?

Fine-tuning through weight updates generally outperforms frozen representations. SELFormer showed this most dramatically, with a difference of 2.187 RMSE between frozen and updated models on FreeSolv. MolBERT showed a much smaller difference (0.575 RMSE on FreeSolv), likely because its domain-specific pre-training objectives already produced representations closer to the downstream tasks.

Benchmarking Challenges and Performance Comparison

Downstream Datasets

The review focuses on nine benchmark datasets across three categories from MoleculeNet:

Dataset	Molecules	Tasks	Type	Application
ESOL	1,128	1 regression	Physical chemistry	Aqueous solubility
FreeSolv	642	1 regression	Physical chemistry	Hydration free energy
Lipophilicity	4,200	1 regression	Physical chemistry	LogD at pH 7.4
BBBP	2,050	1 classification	Physiology	Blood-brain barrier
ClinTox	1,484	2 classification	Physiology	Clinical trial toxicity
SIDER	1,427	27 classification	Physiology	Drug side effects
Tox21	7,831	12 classification	Physiology	Nuclear receptor/stress pathways
BACE	1,513	1 classification	Biophysics	Beta-secretase 1 binding
HIV	41,127	1 classification	Biophysics	Anti-HIV activity

Inconsistencies in Evaluation

The authors document substantial inconsistencies that prevent fair model comparison:

Data splitting: Models used different splitting methods (scaffold vs. random) and different implementations even when using the same method. Not all models adhered to scaffold splitting for classification tasks as recommended.
Different test sets: Even models using the same split type may not evaluate on identical test molecules due to different random seeds.
Varying repetitions: Repetitions ranged from 3 (RT) to 50 (Chen et al.), making some analyses more statistically robust than others.
Metric inconsistency: Most use ROC-AUC for classification and RMSE for regression, but some models report only averages without standard deviations, while others report standard errors.

Performance Findings

When comparing only models evaluated on the same test sets (Figure 2 in the paper), the authors observe that transformer models show comparable, but not consistently superior, performance to existing ML and DL models. The performance varies considerably across models and datasets.

For BBBP, the Mol-BERT model reported lower ROC-AUC than its corresponding MPNN (approximately 0.88 vs. 0.91), while MolBERT outperformed its corresponding CDDD model (approximately 0.86 vs. 0.76 ROC-AUC) and its SVM baseline (approximately 0.86 vs. 0.70 ROC-AUC). A similar mixed pattern appeared for HIV: ChemBERTa performed worse than its corresponding ML models, while MolBERT performed better than its ML (approximately 0.08 higher ROC-AUC) and DL (approximately 0.03 higher ROC-AUC) baselines. For SIDER, Mol-BERT performed approximately 0.1 better ROC-AUC than its corresponding MPNN. For regression, MAT and MolBERT showed improved performance over their ML and DL baselines on ESOL, FreeSolv, and Lipophilicity. For example, MAT performed approximately 0.2 lower RMSE than an SVM model and approximately 0.03 lower RMSE than the Weave model on ESOL.

Key Takeaways and Future Directions

The review concludes with six main takeaways:

Performance: Transformers using SMILES show comparable but not consistently superior performance to existing ML and DL models for MPP.
Scaling: No systematic analysis of model parameter scaling relative to data size exists for chemical language models. Such analysis is essential.
Pre-training data: Dataset size alone is not the sole determinant of downstream performance. Composition and chemical space coverage matter.
Chemical language: SMILES and SELFIES perform similarly. Alternative representations (atom lists, fingerprints) also work when the architecture is adjusted.
Domain knowledge: Domain-specific pre-training objectives show promise, but tokenization and positional encoding remain underexplored.
Benchmarking: The community needs standardized data splitting, fixed test sets, statistical analysis, and consistent reporting to enable meaningful comparison.

The authors also highlight the need for attention visualization and explainability analysis, investigation of NLP-originated techniques (pre-training regimes, fine-tuning strategies like LoRA, explainability methods), and adaptation of these techniques to the specific characteristics of chemical data (smaller vocabularies, shorter sequences).

Reproducibility Details

Data

This is a review paper. No new data or models are introduced. All analyses use previously reported results from the 16 reviewed papers, with additional visualization and comparison. The authors provide a GitHub repository with the code and data used to generate their comparative figures.

Algorithms

Not applicable (review paper). The paper describes training strategies at a conceptual level, referencing the original publications for implementation details.

Models

Not applicable (review paper). The paper catalogs 16 models with their architecture details, parameter counts, and training configurations across Tables 1, 4, 5, 6, and 7.

Evaluation

The paper compiles performance across nine MoleculeNet datasets. Key comparison figures (Figures 2 and 7) restrict to models evaluated on the same test sets for fair comparison, using ROC-AUC for classification and RMSE for regression.

Hardware

Not applicable (review paper).

Artifact	Type	License	Notes
Transformers4MPP_review	Code	MIT	Figure generation code and compiled data

Paper Information

Citation: Sultan, A., Sieg, J., Mathea, M., & Volkamer, A. (2024). Transformers for Molecular Property Prediction: Lessons Learned from the Past Five Years. Journal of Chemical Information and Modeling, 64(16), 6259-6280. https://doi.org/10.1021/acs.jcim.4c00747

@article{sultan2024transformers,
  title={Transformers for Molecular Property Prediction: Lessons Learned from the Past Five Years},
  author={Sultan, Afnan and Sieg, Jochen and Mathea, Miriam and Volkamer, Andrea},
  journal={Journal of Chemical Information and Modeling},
  volume={64},
  number={16},
  pages={6259--6280},
  year={2024},
  publisher={American Chemical Society},
  doi={10.1021/acs.jcim.4c00747}
}

Transformer CLMs for SMILES: Literature Review 2024

Thu, 26 Mar 2026 00:00:00 +0000

A Systematization of Transformer-Based Chemical Language Models

This paper is a Systematization (literature review) that surveys the landscape of transformer-based chemical language models (CLMs) operating on SMILES representations. It organizes the field into three architectural categories (encoder-only, decoder-only, encoder-decoder), discusses tokenization strategies, pre-training and fine-tuning methodologies, and identifies open challenges and future research directions. The review covers approximately 30 distinct CLMs published through early 2024.

Why Review Transformer CLMs for SMILES?

The chemical space is vast, with databases like ZINC20 exceeding 5.5 billion compounds, and the amount of unlabeled molecular data far outstrips available labeled data for specific tasks like toxicity prediction or binding affinity estimation. Traditional molecular representations (fingerprints, descriptors, graph-based methods) require expert-engineered features and extensive domain knowledge.

Transformer-based language models, originally developed for NLP, have emerged as a compelling alternative. By treating SMILES strings as a “chemical language,” these models can leverage large-scale unsupervised pre-training on abundant unlabeled molecules, then fine-tune on small labeled datasets for specific downstream tasks. Earlier approaches like Seq2Seq and Seq3Seq fingerprint methods used RNN-based encoder-decoders, but these suffered from vanishing gradients and sequential processing bottlenecks when handling long SMILES sequences.

The authors motivate this review by noting that no prior survey has comprehensively organized transformer-based CLMs by architecture type while simultaneously covering tokenization, embedding strategies, and downstream application domains.

Architectural Taxonomy: Encoder, Decoder, and Encoder-Decoder Models

The core organizational contribution is a three-way taxonomy of transformer CLMs based on their architectural backbone.

Encoder-Only Models (BERT Family)

These models capture bidirectional context, making them well suited for extracting molecular representations for property prediction tasks. The review covers:

BERT (Lee and Nam, 2022): Adapted for SMILES processing with linguistic knowledge infusion, using BPE tokenization
MOLBERT (Fabian et al., 2020): Chemistry-specific BERT for physicochemical property and bioactivity prediction
SMILES-BERT (Wang et al., 2019): BERT variant designed to learn molecular representations directly from SMILES without feature engineering
ChemBERTa / ChemBERTa-2 (Chithrananda et al., 2020; Ahmad et al., 2022): RoBERTa-based models optimized for chemical property prediction, with ChemBERTa-2 exploring multi-task pre-training
GPT-MolBERTa (Balaji et al., 2023): Combines GPT molecular features with a RoBERTa backbone
MoLFormer (Ross et al., 2022): Large-scale model trained on 1.1 billion molecules, published in Nature Machine Intelligence
SELFormer (Yuksel et al., 2023): Operates on SELFIES representations rather than SMILES
Mol-BERT / MolRoPE-BERT (Li and Jiang, 2021; Liu et al., 2023): Differ in positional embedding strategy, with MolRoPE-BERT using rotary position embedding to handle longer sequences
BET (Chen et al., 2021): Extracts predictive representations from hundreds of millions of molecules

Decoder-Only Models (GPT Family)

These models excel at generative tasks, including de novo molecular design:

GPT-2-based model (Adilov, 2021): Generative pre-training from molecules
MolXPT (Liu et al., 2023): Wraps molecules with text for generative pre-training, connecting chemical and natural language
BioGPT (Luo et al., 2022): Focuses on biomedical text generation and mining
MolGPT (Haroon et al., 2023): Uses relative attention to capture token distances and relationships for de novo drug design
Mol-Instructions (Fang et al., 2023): Large-scale biomolecular instruction dataset for LLMs

Encoder-Decoder Models

These combine encoding and generation capabilities for sequence-to-sequence tasks:

Chemformer (Irwin et al., 2022): BART-based model for reaction prediction and molecular property prediction
MolT5 (adapted T5): Unified text-to-text framework for molecular tasks
SMILES Transformer (Honda et al., 2019): Pre-trained molecular fingerprints for low-data drug discovery
X-MOL (Xue et al., 2020): Large-scale pre-training for molecular understanding
Regression Transformer (Born and Manica, 2023): Operates on SELFIES, enabling concurrent regression and generation
TransAntivirus (Mao et al., 2023): Specialized for antiviral drug design using IUPAC nomenclature

Tokenization, Embedding, and Pre-Training Strategies

SMILES Tokenization

The review identifies tokenization as a critical preprocessing step that affects downstream performance. SMILES tokenization differs from standard NLP tokenization because SMILES strings lack whitespace and use parentheses for branching rather than sentence separation. The key approaches include:

Strategy	Source	Description
Atom-in-SMILES (AIS)	Ucak et al. (2023)	Atom-level tokens preserving chemical identity
SMILES Pair Encoding (SPE)	Li and Fourches (2021)	BPE-inspired substructure tokenization
Byte-Pair Encoding (BPE)	Chithrananda et al. (2020); Lee and Nam (2022)	Standard subword tokenization adapted for SMILES
SMILESTokenizer	Chithrananda et al. (2020)	Character-level tokenization with chemical adjustments

Positional Embeddings

The models use various positional encoding strategies: absolute, relative key, relative key-query, rotary (RoPE), and sinusoidal. Notably, SMILES-based models omit segmentation embeddings since SMILES data consists of single sequences rather than sentence pairs.

Pre-Training and Fine-Tuning Pipeline

The standard workflow follows two phases:

Pre-training: Unsupervised training on large unlabeled SMILES databases (ZINC, PubChem, ChEMBL) using masked language modeling (MLM), where the model learns to predict masked tokens within SMILES strings
Fine-tuning: Supervised adaptation on smaller labeled datasets for specific tasks (classification or regression)

The self-attention mechanism, central to all transformer CLMs, is formulated as:

$$ Z = \text{Softmax}\left(\frac{(XW^Q)(XW^K)^T}{\sqrt{d_k}}\right) XW^V $$

where $X \in \mathbb{R}^{N \times M}$ is the input feature matrix, $W^Q$, $W^K$, $W^V \in \mathbb{R}^{M \times d_k}$ are learnable weight matrices, and $\sqrt{d_k}$ is the scaling factor.

Benchmark Datasets and Evaluation Landscape

The review catalogs the standard evaluation ecosystem for CLMs. Pre-training databases include ZINC, PubChem, and ChEMBL. Fine-tuning and evaluation rely heavily on MoleculeNet benchmarks:

Category	Datasets	Task Type	Example Size
Physical Chemistry	ESOL, FreeSolv, Lipophilicity	Regression	642 to 4,200
Biophysics	PCBA, MUV, HIV, PDBbind, BACE	Classification/Regression	11,908 to 437,929
Physiology	BBBP, Tox21, ToxCast, SIDER, ClinTox	Classification	1,427 to 8,575

The authors also propose four new fine-tuning datasets targeting diseases: COVID-19 drug compounds, cocrystal formation, antimalarial drugs (Plasmodium falciparum targets), and cancer gene expression/drug response data.

Challenges, Limitations, and Future Directions

Current Challenges

The review identifies several persistent limitations:

Data efficiency: Despite transfer learning, transformer CLMs still require substantial pre-training data, and labeled datasets for specific tasks remain scarce
Interpretability: The complexity of transformer architectures makes it difficult to understand how specific molecular features contribute to predictions
Computational cost: Training large-scale models demands significant GPU resources, limiting accessibility
Handling rare molecules: Models struggle with molecular structures that deviate significantly from training data distributions
SMILES limitations: Non-unique representations, invalid strings, exceeded atom valency, and inadequate spatial information capture

SMILES Representation Issues

The authors highlight five specific problems with SMILES as an input representation:

Non-canonical representations reduce string uniqueness for the same molecule
Many symbol combinations produce chemically invalid outputs
Valid SMILES strings can encode chemically impossible molecules (e.g., exceeded valency)
Spatial information is inadequately captured
Syntactic and semantic robustness is limited

Future Research Directions

The review proposes several directions:

Alternative molecular representations: Exploring SELFIES, DeepSMILES, IUPAC, and InChI beyond SMILES
Role of SMILES token types: Strategic masking of metals, non-metals, bonds, and branches during MLM pre-training to identify which components are most critical
Few-shot learning: Combining few-shot approaches with large-scale pre-trained CLMs for data-scarce scenarios
Drug repurposing: Training CLMs to distinguish identical compounds with different biological activity profiles across therapeutic domains
Improved benchmarks: Incorporating disease-specific datasets (malaria, cancer, COVID-19) for more realistic evaluation
Ethical considerations: Addressing dual-use risks, data biases, and responsible open-source release of CLMs

Reproducibility Details

This is a literature review paper. It does not introduce new models, code, or experimental results. The reproducibility assessment focuses on the accessibility of the reviewed works and proposed datasets.

Data

Purpose	Dataset	Size	Notes
Pre-training	ZINC20	5.5B+ compounds	Publicly available
Pre-training	PubChem	100M+ compounds	Publicly available
Pre-training	ChEMBL	2M+ compounds	Publicly available
Fine-tuning	MoleculeNet (8 datasets)	642 to 437,929	Standard benchmark suite
Proposed	COVID-19 drug compounds	740	From Harigua-Souiai et al. (2021)
Proposed	Cocrystal formation	3,282	From Mswahili et al. (2021)
Proposed	Antimalarial drugs	4,794	From Mswahili et al. (2024)
Proposed	Cancer gene/drug response	201 drugs, 734 cell lines	From Kim et al. (2021)

Artifacts

Artifact	Type	License	Notes
DAI Lab website	Other	N/A	Authors’ research lab

No code, models, or evaluation scripts are released with this review. The paper does not include a supplementary materials section or GitHub repository.

Hardware

Not applicable (literature review).

Paper Information

Citation: Mswahili, M. E., & Jeong, Y.-S. (2024). Transformer-based models for chemical SMILES representation: A comprehensive literature review. Heliyon, 10(20), e39038. https://doi.org/10.1016/j.heliyon.2024.e39038

@article{mswahili2024transformer,
  title={Transformer-based models for chemical {SMILES} representation: A comprehensive literature review},
  author={Mswahili, Medard Edmund and Jeong, Young-Seob},
  journal={Heliyon},
  volume={10},
  number={20},
  pages={e39038},
  year={2024},
  publisher={Elsevier},
  doi={10.1016/j.heliyon.2024.e39038}
}

Systematic Review of Deep Learning CLMs (2020-2024)

Thu, 26 Mar 2026 00:00:00 +0000

A Systematization of Chemical Language Models for Molecular Generation

This paper is a Systematization that provides a comprehensive, PRISMA-guided systematic review of deep learning chemical language models (CLMs) used for de novo molecular generation. The primary contribution is a structured statistical analysis of 72 retrieved articles from 2020 to June 2024, comparing architectures (RNNs, transformers, VAEs, GANs, S4 models), molecular representations, biased generation strategies, and quality metrics from the MOSES and GuacaMol benchmarking platforms. The review addresses five research questions about architecture configuration effects, best-performing architectures, impactful hyperparameters, common molecular representations, and effective biased generation methods.

Motivation: Evaluating Four Years of Generative CLM Progress

Deep learning molecular generation has expanded rapidly since 2018, when Gomez-Bombarelli et al. and Segler et al. demonstrated that deep generative models could learn to produce novel molecules from SMILES representations. By 2020, multiple architectures (RNNs, transformers, VAEs, GANs) were being applied to chemical language modeling, and benchmarking platforms like MOSES and GuacaMol had been introduced to enable standardized evaluation.

Despite this growth, existing reviews largely focused on theoretical background or drug development applications rather than systematic statistical comparison of model performance. Few studies had examined how architecture choice, training dataset size, molecular representation format, and biased learning strategies interact to affect generation quality metrics like validity, uniqueness, and novelty. This review fills that gap by restricting the analysis to papers reporting MOSES or GuacaMol metrics, enabling quantitative cross-study comparison.

PRISMA-Based Systematic Review Methodology

The review follows the Preferred Reporting Items for Systematic Review and Meta-Analysis (PRISMA) guidelines. Articles were retrieved from Scopus, Web of Science, and Google Scholar using six Boolean search queries combining terms like “Molecule Generation,” “Chemical Language Models,” “Deep Learning,” and specific architecture names. The search window covered January 2020 to June 2024.

Eligibility Criteria

Papers were included if they:

Were written in English
Explicitly presented at least two metrics of uniqueness, validity, or novelty
Defined these metrics consistent with MOSES or GuacaMol concepts
Used deep learning generative models for de novo molecule design
Used conventional (non-quantum) deep learning methods
Were published between January 2020 and June 2024

This yielded 48 articles from query-based search and 25 from citation search, totaling 72 articles. Of these, 62 used CLM approaches (string-based molecular representations) and 10 used graph-based representations.

Data Collection

For each article, the authors extracted: journal details, database name, training dataset size, molecular representation type (SMILES, SELFIES, InChI, DeepSMILES), architecture details (embedding length, layers, hidden units, trainable parameters, dropout, temperature, batch size, epochs, learning rate, optimizer), biased method usage (TL, RL, conditional learning), and generation metrics (validity, uniqueness, novelty, scaffold diversity, SNN, FCD).

Evaluation Metrics

The review focuses on three core MOSES metrics:

$$ \text{Validity}(V_m) = \frac{\text{Valid molecules}}{\text{Molecules produced}} $$

$$ \text{Uniqueness} = \frac{\text{set}(V_m)}{V_m} $$

$$ \text{Novelty} = 1 - \frac{V_m \cap T_d}{V_m} $$

where $V_m$ denotes valid molecules and $T_d$ the training dataset.

Architecture Distribution and Performance Comparison

Architecture Trends (2020-2024)

The review found that RNNs and transformers dominate CLM usage, with a growing trend toward transformers over time. The breakdown across 62 CLM articles: 24 RNN-based, 23 transformer-based, 16 VAE-based, 8 GAN-based, and 1 S4-based model. Among RNN variants, LSTM was the most common, followed by GRU, despite GRU having fewer trainable parameters.

The increase in transformer adoption is attributed to self-attention mechanisms enabling parallel computation and effective long-range dependency capture. Meanwhile, GANs and VAEs saw lower adoption rates, partly due to higher memory and time complexity and reduced ability to generate large molecules.

Molecular Representations and Databases

SMILES was used exclusively in 77.27% of CLM articles, reflecting its wide database availability and compact format. SELFIES, DeepSMILES, and InChI each appeared in smaller fractions. The dominant databases were ChEMBL and ZINC (27 articles each), followed by PubChem (4 articles). Approximately 71% of reviewed articles focused on drug discovery applications.

Database	Molecules (millions)	Representation	Articles
ChEMBL	2.4	SMILES, InChI	27
ZINC	750	SMILES	27
PubChem	115.3	SMILES, InChI	4
COCONUT	0.695	SMILES, InChI	1
DNA-Encoded Library	1,040	SMILES	1

Unbiased Model Performance

Validity: No statistically significant differences were observed across architecture families. Transformers generally achieved high validity through self-attention mechanisms that retain uncompressed sequence information. However, one transformer model (TransMol) achieved only 6.9% validity when using stochastic sampling with Gaussian noise to explore unseen chemical space. GANs showed high dispersion, with validity as low as 8.5% when learning from gene expression signatures rather than molecular structures directly.

Uniqueness: No significant differences in median uniqueness across architectures. Transformer-based models using masked self-attention achieved near-perfect uniqueness scores. Scaffold decoration and fragment-linking approaches sometimes compromised uniqueness due to overfit-driven redundancy.

Validity-Novelty Trade-off: The authors propose a “Valid/Sample” metric (Validity x Novelty) and find an inverse trend between validity and novelty (Spearman $\rho = -0.3575$, p-value = 0.0618). Only 17.9% of models achieved above-median values for both validity (95.6%) and novelty (96.5%) simultaneously. SELFIES-based models achieve 100% validity by construction, which can help address this trade-off.

Biased Model Performance

The review examines three biased generation strategies:

Transfer Learning (TL): The most prevalent biased method, used across all architecture types. Fine-tuning transfers pre-trained parameters to a target model, requiring significantly fewer training molecules (median ~2,507 vs. ~1.1M for unbiased). TL does not significantly affect validity (p = 0.16) or novelty (p = 0.84), but uniqueness decreases significantly (median 90.2% vs. 97.9%, p = 0.014), likely due to overfitting on small target datasets.

Metric	Unbiased (median)	TL Target (median)	p-value
Training size	1,128,920	2,507	<0.0001
Validity	98.05%	95.5%	0.1602
Uniqueness	97.9%	90.2%	0.0144
Novelty	91.6%	96.0%	0.8438

Reinforcement Learning (RL): Applied only to RNNs and transformers in the reviewed set. 90.1% of RL implementations used policy gradient methods with scoring functions for properties like synthesizability, binding affinity, and membrane permeability. No significant effects on generation metrics were observed.

Metric	Unbiased (median)	RL Target (median)	p-value
Validity	91.1%	96.5%	0.1289
Uniqueness	99.9%	89.7%	0.0935
Novelty	91.5%	93.5%	0.2500

Conditional Learning (CL): Integrates domain-specific data (properties, bioactivities, functional groups) directly into training via constraint tokens or property embeddings. Used primarily with encoder-decoder architectures (ARAEs, VAEs, transformers). CL does not significantly degrade generation metrics relative to unbiased models.

Metric	Unbiased (median)	CL Target (median)	p-value
Validity	98.5%	96.8%	0.4648
Uniqueness	99.9%	97.5%	0.0753
Novelty	89.3%	99.6%	0.2945

Key Findings and Directions for Chemical Language Models

Main Conclusions

Transformers are overtaking RNNs as the dominant CLM architecture, driven by self-attention mechanisms that capture long-range dependencies without the gradient vanishing issues of recurrent models.
SMILES remains dominant (77% of models) despite known limitations (non-uniqueness, syntax errors). SELFIES shows promise for improving the validity-novelty trade-off.
No architecture achieves both high validity and high novelty easily. Only 17.9% of unbiased models exceeded medians for both metrics simultaneously, highlighting a fundamental tension in generative chemistry.
Transfer learning requires only ~2,500 molecules to generate targeted compounds, compared to ~1.1M for unbiased training, but at the cost of reduced uniqueness.
Combining biased methods (e.g., TL + RL, CL + TL) shows promise for multi-objective optimization and exploring distant regions of chemical space.
S4 models were newly introduced for CLMs in 2023, showing competitive performance with the dual nature of convolution during training and recurrent generation.

Limitations

The review is restricted to papers reporting MOSES or GuacaMol metrics, which excludes many molecular generation studies that use alternative evaluation frameworks. The statistical comparisons rely on median values reported across different experimental settings, making direct architecture comparisons approximate. Graph-based approaches are included only for coarse comparison (10 of 72 articles) and are not the focus of the analysis.

Reproducibility Details

Data

This is a systematic review, so no new models were trained. The authors collected metadata from 72 published articles. No datasets were generated or analyzed beyond the literature corpus.

Algorithms

Statistical comparisons used Mann-Whitney U tests for paired samples. Spearman correlation was used to assess the validity-novelty relationship. Outlier identification used the Valid/Sample (Validity x Novelty) metric with box plot analysis.

Evaluation

The review evaluates models using MOSES metrics: validity, uniqueness, novelty, scaffold diversity, scaffold novelty, fragment similarity, SNN, internal diversity, and FCD. Statistical tests were applied to compare medians across architecture families and between biased and unbiased models.

Hardware

Not applicable (systematic review, no model training performed).

Paper Information

Citation: Flores-Hernandez, H., & Martínez-Ledesma, E. (2024). A systematic review of deep learning chemical language models in recent era. Journal of Cheminformatics, 16(1), 129. https://doi.org/10.1186/s13321-024-00916-y

@article{floreshernandez2024systematic,
  title={A systematic review of deep learning chemical language models in recent era},
  author={Flores-Hernandez, Hector and Mart{\'i}nez-Ledesma, Emmanuel},
  journal={Journal of Cheminformatics},
  volume={16},
  number={1},
  pages={129},
  year={2024},
  publisher={BioMed Central},
  doi={10.1186/s13321-024-00916-y}
}

Survey of Transformer Architectures in Molecular Science

Thu, 26 Mar 2026 00:00:00 +0000

A Systematization of Transformer Architectures for Molecular Science

This paper is a Systematization review. It organizes and taxonomizes 12 families of transformer architectures that have been applied across molecular science, including chemistry, biology, and drug discovery. The primary contribution is not a new method or dataset, but a structured technical overview of the algorithmic internals of each transformer variant and their specific applications to molecular problems. The review covers 201 references and provides a unified treatment of how these architectures capture molecular patterns from sequential, graphical, and image-based data.

Bridging the Gap Between Transformer Variants and Molecular Applications

Transformer-based models have become widespread in molecular science, yet the authors identify a gap: there is no organized taxonomy linking these diverse techniques in the existing literature. Individual papers introduce specific architectures or applications, but practitioners lack a unified reference that explains the technical differences between GPT, BERT, BART, graph transformers, and other variants in the context of molecular data. The review aims to fill this gap by providing an in-depth investigation of the algorithmic components of each model family, explaining how their architectural innovations contribute to processing complex molecular data. The authors note that the success of transformers in molecular science stems from several factors: the sequential nature of chemical and biological molecules (DNA, RNA, proteins, SMILES strings), the attention mechanism’s ability to capture long-range dependencies within molecular structures, and the capacity for transfer learning through pre-training on large chemical and biological datasets.

Twelve Transformer Families and Their Molecular Mechanisms

The review covers transformer preliminaries before diving into 12 specific architecture families. The core self-attention mechanism computes:

$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$

where $d_k$ is the dimension of the key vectors. The position-wise feed-forward network is:

$$ \text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2 $$

The 12 architecture families covered are:

GPT (Generative Pre-trained Transformer): Uses the decoder part of the transformer for autoregressive generation. Applications include MolGPT for molecular generation, DrugGPT for protein-ligand binding, and cMolGPT for target-specific de novo molecular generation.
BERT (Bidirectional Encoder Representations from Transformers): Uses transformer encoders with masked language modeling and next-sentence prediction for pre-training. Molecular applications include FP-BERT for molecular property prediction using composite fingerprint representations, Graph-BERT for protein-protein interaction identification, SMILES-BERT, and Mol-BERT.
BART (Bidirectional and Auto-Regressive Transformers): Functions as a denoising autoencoder with both encoder and decoder. Molecular applications include Chemformer for sequence-to-sequence chemistry tasks, MS2Mol for mass spectrometry analysis, and MolBART for molecular feature learning.
Graph Transformer: Leverages self-attention on graph-structured data to capture global context. Applications include GraphSite for protein-DNA binding site prediction (using AlphaFold2 structure predictions), KPGT for knowledge-guided molecular graph pre-training, and PAGTN for establishing long-range dependencies in molecular graphs.
Transformer-XL: Incorporates relative positional encoding for modeling long sequences. Used for small molecule retention time prediction, drug design with CHEMBL data (1.27 million molecules), and Heck reaction generation.
T5 (Text-to-Text Transfer Transformer): Unifies NLP tasks into text-to-text mapping. T5Chem was pre-trained on 97 million molecules from PubChem and achieved 99.5% accuracy on reaction classification (USPTO 500 MT). C5T5 uses IUPAC naming for molecular optimization in drug discovery.
Vision Transformer (ViT): Applies transformer architecture to image patches. Used for organic molecule classification (97% accuracy with WGAN-generated data), bacterial identification via SERS, and molecular property prediction from mass spectrometry data (TransG-Net).
DETR (Detection Transformer): End-to-end object detection using transformers. Applied to cryo-EM particle picking (TransPicker), molecular structure image recognition (IMG2SMI), and cell segmentation (Cell-DETR).
Conformer: Integrates convolutional modules into transformer structure. Used for DNA storage error correction (RRCC-DNN), drug-target affinity prediction (NG-DTA with Davis and Kiba datasets).
CLIP (Contrastive Language-Image Pre-training): Multimodal learning linking text and images. Applied to peptide design (Cut&CLIP for protein degradation), gene identification (pathCLIP), and drug discovery (CLOOME for zero-shot transfer learning).
Sparse Transformers: Use sparse attention matrices to reduce complexity to $O(n\sqrt{n})$. Applied to drug-target interaction prediction with gated cross-attention mechanisms.
Mobile and Efficient Transformers: Compressed variants (TinyBERT, MobileBERT) for resource-constrained environments. Molormer uses ProbSparse self-attention for drug-drug interaction prediction. LOGO is a lightweight pre-trained language model for non-coding genome interpretation.

Survey Organization and Coverage of Molecular Domains

As a survey paper, this work does not present new experiments. Instead, it catalogues existing applications across multiple molecular domains:

Drug Discovery and Design: GPT-based ligand design (DrugGPT), BART-based molecular generation (Chemformer, MolBART), graph transformer pre-training for molecular property prediction (KPGT), T5-based chemical reaction prediction (T5Chem), and sparse transformer methods for drug-target interactions.

Protein Science: BERT-based protein-protein interaction prediction (Graph-BERT), graph transformer methods for protein-DNA binding (GraphSite with AlphaFold2 integration), conformer-based drug-target affinity prediction (NG-DTA), and CLIP-based peptide design (Cut&CLIP).

Molecular Property Prediction: FP-BERT for fingerprint-based prediction, SMILES-BERT and Mol-BERT for end-to-end prediction from SMILES, KPGT for knowledge-guided graph pre-training, and Transformer-XL for property modeling with relative positional encoding.

Structural Biology: DETR-based cryo-EM particle picking (TransPicker), vision transformer applications in cell imaging, and Cell-DETR for instance segmentation in microscopy.

Genomics: Conformer-based DNA storage error correction (RRCC-DNN), LOGO for non-coding genome interpretation, and MetaTransformer for metagenomic sequencing analysis.

Future Directions and Limitations of the Survey

The review concludes with four future directions:

ChatGPT integration into molecular science: Using LLMs for data analysis, literature review, and hypothesis generation in chemistry and biology.
Multifunction transformers: Models that extract features across diverse molecular structures and sequences simultaneously.
Molecular-aware transformers: Architectures that handle multiple data types (text, sequence, structure, image, energy, molecular dynamics, function) in a unified framework.
Self-assessment transformers and superintelligence: Speculative discussion of models that learn from seemingly unrelated data sources.

The review has several limitations worth noting. The coverage is broad but shallow: each architecture family receives only 1-2 pages of discussion, and the paper largely describes existing work rather than critically evaluating it. The review does not systematically compare the architectures against each other on common benchmarks. The future directions section (particularly the superintelligence discussion) is speculative and lacks concrete proposals. The paper also focuses primarily on technical architecture descriptions rather than analyzing failure modes, scalability challenges, or reproducibility concerns across the surveyed methods. As a review article, no new data were created or analyzed.

Reproducibility Details

Data

This is a survey paper. No new datasets were created or used. The paper reviews applications involving datasets such as PubChem (97 million molecules for T5Chem), CHEMBL (1.27 million molecules for Transformer-XL drug design), USPTO 500 MT (reaction classification), ESOL (5,328 molecules for property prediction), and Davis/Kiba (drug-target affinity).

Algorithms

No new algorithms are introduced. The paper provides mathematical descriptions of the core transformer components (self-attention, positional encoding, feed-forward networks, layer normalization) and describes how 12 architecture families modify these components.

Models

No new models are presented. The paper surveys existing models including MolGPT, DrugGPT, FP-BERT, SMILES-BERT, Chemformer, MolBART, GraphSite, KPGT, T5Chem, TransPicker, Cell-DETR, CLOOME, and Molormer, among others.

Evaluation

No new evaluation is performed. Performance numbers cited from the literature include: T5Chem reaction classification accuracy of 99.5%, ViT organic molecule classification at 97%, Transformer-XL property prediction RMSE of 0.6 on ESOL, and Heck reaction generation feasibility rate of 47.76%.

Hardware

No hardware requirements are specified, as this is a survey paper.

Artifact	Type	License	Notes
Paper (open access)	Paper	CC-BY-NC-ND	Open access via Wiley

Paper Information

Citation: Jiang, J., Ke, L., Chen, L., Dou, B., Zhu, Y., Liu, J., Zhang, B., Zhou, T., & Wei, G.-W. (2024). Transformer technology in molecular science. WIREs Computational Molecular Science, 14(4), e1725. https://doi.org/10.1002/wcms.1725

@article{jiang2024transformer,
  title={Transformer technology in molecular science},
  author={Jiang, Jian and Ke, Lu and Chen, Long and Dou, Bozheng and Zhu, Yueying and Liu, Jie and Zhang, Bengong and Zhou, Tianshou and Wei, Guo-Wei},
  journal={WIREs Computational Molecular Science},
  volume={14},
  number={4},
  pages={e1725},
  year={2024},
  publisher={Wiley},
  doi={10.1002/wcms.1725}
}

RNNs vs Transformers for Molecular Generation Tasks

Thu, 26 Mar 2026 00:00:00 +0000

An Empirical Comparison of Sequence Architectures for Molecular Generation

This is an Empirical paper that systematically compares two dominant sequence modeling architectures, recurrent neural networks (RNNs) and the Transformer, for chemical language modeling. The primary contribution is a controlled experimental comparison across three generative tasks of increasing complexity, combined with an evaluation of two molecular string representations (SMILES and SELFIES). The paper does not propose a new method; instead, it provides practical guidance on when each architecture is more appropriate for molecular generation.

Why Compare RNNs and Transformers for Molecular Design?

Exploring unknown molecular space and designing molecules with target properties is a central goal in computational drug design. Language models trained on molecular string representations (SMILES, SELFIES) have shown the capacity to learn complex molecular distributions. RNN-based models, including LSTM and GRU variants, were the first widely adopted architectures for this task. Models like CharRNN, ReLeaSE, and conditional RNNs demonstrated success in generating focused molecular libraries. More recently, self-attention-based Transformer models (Mol-GPT, LigGPT) have gained popularity due to their parallelizability and ability to capture long-range dependencies.

Despite the widespread adoption of Transformers across NLP, it was not clear whether they uniformly outperform RNNs for molecular generation. Prior work by Dollar et al. showed that RNN-based models achieved higher validity than Transformer-based models in some settings. Flam-Shepherd et al. demonstrated that RNN language models could learn complex molecular distributions across challenging generative tasks. This paper extends that comparison by adding the Transformer architecture to the same set of challenging tasks and evaluating both SMILES and SELFIES representations.

Experimental Design: Three Tasks, Two Architectures, Two Representations

The core experimental design uses a 2x2 setup: two architectures (RNN and Transformer) crossed with two molecular representations (SMILES and SELFIES), yielding four model variants: SM-RNN, SF-RNN, SM-Transformer, and SF-Transformer.

Three generative tasks

The three tasks, drawn from Flam-Shepherd et al., are designed with increasing complexity:

Penalized LogP task: Generate molecules with high penalized LogP scores (LogP minus synthetic accessibility and long-cycle penalties). The dataset is built from ZINC15 molecules with penalized LogP > 4.0. Molecule sequences are relatively short (50-75 tokens).
Multidistribution task: Learn a multimodal molecular weight distribution constructed from four distinct subsets: GDB13 (MW <= 185), ZINC (185 <= MW <= 425), Harvard Clean Energy Project (460 <= MW <= 600), and POLYMERS (MW > 600). This tests the ability to capture multiple modes simultaneously.
Large-scale task: Generate large molecules from PubChem with more than 100 heavy atoms and MW ranging from 1250 to 5000. This tests long-sequence generation capability.

Model configuration

Models are compared with matched parameter counts (5.2-5.3M to 36.4M parameters). Hyperparameter optimization uses random search over learning rate [0.0001, 0.001], hidden units (500-1000 for RNNs, 376-776 for Transformers), layer number [3, 5], and dropout [0.0, 0.5]. A regex-based tokenizer replaces character-by-character tokenization, reducing token lengths from 10,000 to under 3,000 for large molecules.

Evaluation metrics

The evaluation covers multiple dimensions:

Standard metrics: validity, uniqueness, novelty
Molecular properties: FCD, LogP, SA, QED, Bertz complexity (BCT), natural product likeness (NP), molecular weight (MW)
Wasserstein distance: measures distributional similarity between generated and training molecules for each property
Tanimoto similarity: structural and scaffold similarity between generated and training molecules
Token length (TL): comparison of generated vs. training sequence lengths

For each task, 10,000 molecules are generated and evaluated.

Key Results Across Tasks

Penalized LogP task

Model	FCD	LogP	SA	QED	BCT	NP	MW	TL
SM-RNN	0.56	0.12	0.02	0.01	16.61	0.09	5.90	0.43
SF-RNN	1.63	0.25	0.42	0.02	36.43	0.23	2.35	0.40
SM-Transformer	0.83	0.18	0.02	0.01	23.77	0.09	7.99	0.84
SF-Transformer	1.97	0.22	0.47	0.02	44.43	0.28	5.04	0.53

RNN-based models achieve smaller Wasserstein distances across most properties. The authors attribute this to LogP being computed as a sum of atomic contributions (a local property), which aligns with RNNs’ strength in capturing local structural features. RNNs also generated ring counts closer to the training distribution (4.10 for SM-RNN vs. 4.04 for SM-Transformer, with training data at 4.21). The Transformer performed better on global structural similarity (higher Tanimoto similarity to training data).

Multidistribution task

Model	FCD	LogP	SA	QED	BCT	NP	MW	TL
SM-RNN	0.16	0.07	0.03	0.01	18.34	0.02	7.07	0.81
SF-RNN	1.46	0.38	0.55	0.03	110.72	0.24	10.00	1.58
SM-Transformer	0.16	0.16	0.03	0.01	39.94	0.02	10.03	1.28
SF-Transformer	1.73	0.37	0.63	0.04	107.46	0.30	17.57	2.40

Both SMILES-based models captured all four modes of the MW distribution well. While RNNs had smaller overall Wasserstein distances, the Transformer fitted the higher-MW modes better. This aligns with the observation that longer molecular sequences (which correlate with higher MW) favor the Transformer’s global attention mechanism over the RNN’s sequential processing.

Large-scale task

Model	FCD	LogP	SA	QED	BCT	NP	MW	TL
SM-RNN	0.46	1.89	0.20	0.01	307.09	0.03	105.29	12.05
SF-RNN	1.65	1.78	0.43	0.01	456.98	0.14	100.79	15.26
SM-Transformer	0.36	1.64	0.07	0.01	172.93	0.02	59.04	7.41
SF-Transformer	1.91	2.82	0.47	0.01	464.75	0.18	92.91	11.57

The Transformer demonstrates a clear advantage on large molecules. SM-Transformer achieves substantially lower Wasserstein distances than SM-RNN across nearly all properties, with particularly large improvements in BCT (172.93 vs. 307.09) and MW (59.04 vs. 105.29). The Transformer also produces better Tanimoto similarity scores and more accurate token length distributions.

Standard metrics across all tasks

Task	Metric	SM-RNN	SF-RNN	SM-Transformer	SF-Transformer
LogP	Valid	0.90	1.00	0.89	1.00
LogP	Uniqueness	0.98	0.99	0.98	0.99
LogP	Novelty	0.75	0.71	0.71	0.71
Multi	Valid	0.95	1.00	0.97	1.00
Multi	Uniqueness	0.96	1.00	1.00	1.00
Multi	Novelty	0.91	0.98	0.91	0.98
Large	Valid	0.84	1.00	0.88	1.00
Large	Uniqueness	0.99	0.99	0.98	0.99
Large	Novelty	0.85	0.92	0.86	0.94

SELFIES achieves 100% validity across all tasks by construction, while SMILES validity drops for large molecules. The Transformer achieves slightly higher validity than the RNN for SMILES-based models, particularly on the large-scale task (0.88 vs. 0.84).

Conclusions and Practical Guidelines

The central finding is that neither architecture universally dominates. The choice between RNNs and Transformers should depend on the characteristics of the molecular data:

RNNs are preferred when molecular properties depend on local structural features (e.g., LogP, ring counts) and when sequences are relatively short. They better capture local fragment distributions.
Transformers are preferred when dealing with large molecules (high MW, long sequences) where global attention can capture the overall distribution more effectively. RNNs suffer from information obliteration on long sequences.
SMILES outperforms SELFIES on property distribution metrics across nearly all tasks and models. While SELFIES guarantees 100% syntactic validity, its generated molecules show worse distributional fidelity to training data. The authors argue that validity is a less important concern than property fidelity, since invalid SMILES can be filtered easily.

The authors acknowledge that longer sequences remain challenging for both architectures. For Transformers, the quadratic growth of the attention matrix limits scalability. For RNNs, the vanishing gradient problem limits effective context length.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Task 1	ZINC15 (penalized LogP > 4.0)	Not specified	High penalized LogP molecules
Task 2	GDB-13 + ZINC + CEP + POLYMERS	~200K	Multimodal MW distribution
Task 3	PubChem (>100 heavy atoms)	Not specified	MW range 1250-5000

Data processing code available at https://github.com/danielflamshep/genmoltasks (from the original Flam-Shepherd et al. study).

Algorithms

Tokenization: Regex-based tokenizer (not character-by-character)
Hyperparameter search: Random search over learning rate [0.0001, 0.001], hidden units, layers [3, 5], dropout [0.0, 0.5]
Selection: Top 20% by sum of valid + unique + novelty, then final selection on all indicators
Generation: 10K molecules per model per task

Models

Model	Parameters	Architecture
RNN variants	5.2M - 36.4M	RNN (LSTM/GRU)
Transformer variants	5.3M - 36.4M	Transformer decoder

Evaluation

Wasserstein distance for property distributions (FCD, LogP, SA, QED, BCT, NP, MW, TL), Tanimoto similarity (molecular and scaffold), validity, uniqueness, novelty.

Hardware

Not specified in the paper.

Artifacts

Artifact	Type	License	Notes
trans_language	Code	Not specified	Transformer implementation by the authors
genmoltasks	Code/Data	Apache-2.0	Dataset construction from Flam-Shepherd et al.

Paper Information

Citation: Chen, Y., Wang, Z., Zeng, X., Li, Y., Li, P., Ye, X., & Sakurai, T. (2023). Molecular language models: RNNs or transformer? Briefings in Functional Genomics, 22(4), 392-400. https://doi.org/10.1093/bfgp/elad012

@article{chen2023molecular,
  title={Molecular language models: RNNs or transformer?},
  author={Chen, Yangyang and Wang, Zixu and Zeng, Xiangxiang and Li, Yayang and Li, Pengyong and Ye, Xiucai and Sakurai, Tetsuya},
  journal={Briefings in Functional Genomics},
  volume={22},
  number={4},
  pages={392--400},
  year={2023},
  publisher={Oxford University Press},
  doi={10.1093/bfgp/elad012}
}

Review: Deep Learning for Molecular Design (2019)

Thu, 26 Mar 2026 00:00:00 +0000

A Systematization of Deep Generative Models for Molecular Design

This is a Systematization paper that organizes and compares the rapidly growing literature on deep generative modeling for molecules. Published in 2019, it catalogs 45 papers from the preceding two years, classifying them by architecture (RNNs, VAEs, GANs, reinforcement learning) and molecular representation (SMILES strings, context-free grammars, graph tensors, 3D voxels). The review provides mathematical foundations for each technique, identifies cross-cutting themes, and proposes a framework for reward function design that addresses diversity, novelty, stability, and synthesizability.

The Challenge of Navigating Vast Chemical Space

The space of potential drug-like molecules has been estimated to contain between $10^{23}$ and $10^{60}$ compounds, while only about $10^{8}$ have ever been synthesized. Traditional approaches to molecular design rely on combinatorial methods, mixing known scaffolds and functional groups, but these generate many unstable or unsynthesizable candidates. High-throughput screening (HTS) and virtual screening (HTVS) help but remain computationally expensive. The average cost to bring a new drug to market exceeds one billion USD, with a 13-year average timeline from discovery to market.

By 2016, deep generative models had shown strong results in producing original images, music, and text. The “molecular autoencoder” of Gomez-Bombarelli et al. (2016/2018) first applied these techniques to molecular generation, triggering an explosion of follow-up work. By the time of this review, the landscape had grown complex enough, with many architectures, representation schemes, and no agreed-upon benchmarking standards, to warrant systematic organization.

Molecular Representations and Architecture Taxonomy

The review’s core organizational contribution is a two-axis taxonomy: molecular representations on one axis and deep learning architectures on the other.

Molecular Representations

The review categorizes representations into 3D and 2D graph-based schemes:

3D representations include raw voxels (placing nuclear charges on a grid), smoothed voxels (Gaussian blurring around nuclei), and tensor field networks. These capture full geometric information but suffer from high dimensionality, sparsity, and difficulty encoding rotation/translation invariance.

2D graph representations include:

SMILES strings: The dominant representation, encoding molecular graphs as ASCII character sequences via depth-first traversal. Non-unique (each molecule with $N$ heavy atoms has at least $N$ SMILES representations), but invertible and widely supported.
Canonical SMILES: Unique but potentially encode grammar rules rather than chemical structure.
Context-free grammars (CFGs): Decompose SMILES into grammar rules to improve validity rates, though not to 100%.
Tensor representations: Store atom types in a vertex feature matrix $X \in \mathbb{R}^{N \times |\mathcal{A}|}$ and bond types in an adjacency tensor $A \in \mathbb{R}^{N \times N \times Y}$.
Graph operations: Directly build molecular graphs by adding atoms and bonds, guaranteeing 100% chemical validity.

Deep Learning Architectures

Recurrent Neural Networks (RNNs) generate SMILES strings character by character, typically using LSTM or GRU units. Training uses maximum likelihood estimation (MLE) with teacher forcing:

$$ L^{\text{MLE}} = -\sum_{s \in \mathcal{X}} \sum_{t=2}^{T} \log \pi_{\theta}(s_{t} \mid S_{1:t-1}) $$

Thermal rescaling of the output distribution controls the diversity-validity tradeoff via a temperature parameter $T$. RNNs achieved SMILES validity rates of 94-98%.

Variational Autoencoders (VAEs) learn a continuous latent space by maximizing the evidence lower bound (ELBO):

$$ \mathcal{L}_{\theta,\phi}(x) = \mathbb{E}_{z \sim q_{\phi}(z|x)}[\log p_{\theta}(x|z)] - D_{\text{KL}}[q_{\phi}(z|x), p(z)] $$

The first term encourages accurate reconstruction while the KL divergence term regularizes the latent distribution toward a standard Gaussian prior $p(z) = \mathcal{N}(z, 0, I)$. Variants include grammar VAEs (GVAEs), syntax-directed VAEs, junction tree VAEs, and adversarial autoencoders (AAEs) that replace the KL term with adversarial training.

Generative Adversarial Networks (GANs) train a generator against a discriminator using the minimax objective:

$$ \min_{G} \max_{D} V(D, G) = \mathbb{E}_{x \sim p_{d}(x)}[\log D(x)] + \mathbb{E}_{z \sim p_{z}(z)}[\log(1 - D(G(z)))] $$

The review shows that with an optimal discriminator, the generator objective reduces to minimizing the Jensen-Shannon divergence, which captures both forward and reverse KL divergence terms. This provides a more “balanced” training signal than MLE alone. The Wasserstein GAN (WGAN) uses the Earth mover’s distance for more stable training:

$$ W(p, q) = \inf_{\gamma \in \Pi(p,q)} \mathbb{E}_{(x,y) \sim \gamma} |x - y| $$

Reinforcement Learning recasts molecular generation as a sequential decision problem. The policy gradient (REINFORCE) update is:

$$ \nabla J(\theta) = \mathbb{E}\left[G_{t} \frac{\nabla_{\theta} \pi_{\theta}(a_{t} \mid y_{1:t-1})}{\pi_{\theta}(a_{t} \mid y_{1:t-1})}\right] $$

To prevent RL fine-tuning from causing the generator to “drift” away from viable chemical structures, an augmented reward function incorporates the prior likelihood:

$$ R’(S) = [\sigma R(S) + \log P_{\text{prior}}(S) - \log P_{\text{current}}(S)]^{2} $$

Cataloging 45 Models and Their Design Choices

Rather than running new experiments, the review’s methodology involves systematically cataloging and comparing 45 published models. Table 2 in the paper lists each model’s architecture, representation, training dataset, and dataset size. Key patterns include:

RNN-based models (16 entries): Almost exclusively use SMILES, trained on ZINC or ChEMBL datasets with 0.1M-1.7M molecules.
VAE variants (20 entries): The most diverse category, spanning SMILES VAEs, grammar VAEs, junction tree VAEs, graph-based VAEs, and 3D VAEs. Training sets range from 10K to 72M molecules.
GAN models (7 entries): Include ORGAN, RANC, ATNC, MolGAN, and CycleGAN approaches. Notably, GANs appear to work with fewer training samples.
Other approaches (2 entries): Pure RL methods from Zhou et al. and Stahl et al. that do not require pretraining on a dataset.

The review also catalogs 13 publicly available datasets (Table 3), ranging from QM9 (133K molecules with quantum chemical properties) to GDB-13 (977M combinatorially generated molecules) and ZINC15 (750M+ commercially available compounds).

Metrics and Reward Function Design

A significant contribution is the systematic treatment of reward functions. The review argues that generated molecules should satisfy six desiderata: diversity, novelty, stability, synthesizability, non-triviality, and good properties. Key metrics formalized include:

Diversity using Tanimoto similarity over fingerprints:

$$ r_{\text{diversity}} = 1 - \frac{1}{|\mathcal{G}|} \sum_{(x_{1}, x_{2}) \in \mathcal{G} \times \mathcal{G}} D(x_{1}, x_{2}) $$

Novelty measured as the fraction of generated molecules not appearing in a hold-out test set:

$$ r_{\text{novel}} = 1 - \frac{|\mathcal{G} \cap \mathcal{T}|}{|\mathcal{T}|} $$

Synthesizability primarily assessed via the SA score, sometimes augmented with ring penalties and medicinal chemistry filters.

The review also discusses the Fréchet ChemNet Distance as an analog of FID for molecular generation, and notes the emergence of standardized benchmarking platforms including MOSES, GuacaMol, and DiversityNet.

Key Findings and Future Directions

The review identifies several major trends and conclusions:

Shift from SMILES to graph-based representations. SMILES-based methods struggle with validity (the molecular autoencoder VAE achieved only 0.7-75% valid SMILES depending on sampling strategy). Methods that work directly on molecular graphs with chemistry-preserving operations achieve 100% validity, and the review predicts this trend will continue.

Advantages of adversarial and RL training over MLE. The mathematical analysis shows that MLE only optimizes forward KL divergence, which can lead to models that place probability mass where the data distribution is zero. GAN training optimizes the Jensen-Shannon divergence, which balances forward and reverse KL terms. RL approaches, particularly pure RL without pretraining, showed competitive performance with much less training data.

Genetic algorithms remain competitive. The review notes that the latest genetic algorithm approaches (Grammatical Evolution) could match deep learning methods for molecular optimization under some metrics, and at 100x lower computational cost in some comparisons. This serves as an important baseline calibration.

Reward function design is underappreciated. Early models generated unstable molecules with labile groups (enamines, hemiaminals, enol ethers). Better reward functions that incorporate synthesizability, diversity, and stability constraints significantly improved practical utility.

Need for standardized benchmarks. The review identifies a lack of agreement on evaluation methodology as a major barrier to progress, noting that published comparisons are often subtly biased toward novel methods.

Limitations

As a review paper from early 2019, the work predates several important developments: transformer-based architectures (which would soon dominate), SELFIES representations, diffusion models for molecules, and large-scale pretrained chemical language models. The review focuses primarily on drug-like small molecules and does not deeply cover protein design or materials optimization.

Reproducibility Details

Data

This is a review paper that does not present new experimental results. The paper catalogs 13 publicly available datasets used across the reviewed works:

Purpose	Dataset	Size	Notes
Training/Eval	GDB-13	977M	Combinatorially generated library
Training/Eval	ZINC15	750M+	Commercially available compounds
Training/Eval	GDB-17	50M	Combinatorially generated library
Training/Eval	ChEMBL	2M	Curated bioactive molecules
Training/Eval	QM9	133,885	Small organic molecules with DFT properties
Training/Eval	PubChemQC	3.98M	PubChem compounds with DFT data

Algorithms

The review provides mathematical derivations for MLE training (Eq. 1), VAE ELBO (Eqs. 9-13), AAE objectives (Eqs. 15-16), GAN objectives (Eqs. 19-22), WGAN (Eq. 24), REINFORCE gradient (Eq. 7), and numerous reward function formulations (Eqs. 26-36).

Evaluation

Key evaluation frameworks discussed:

Fréchet ChemNet Distance (molecular analog of FID)
MOSES benchmarking platform
GuacaMol benchmarking suite
Validity rate, uniqueness, novelty, and internal diversity metrics

Paper Information

Citation: Elton, D. C., Boukouvalas, Z., Fuge, M. D., & Chung, P. W. (2019). Deep Learning for Molecular Design: A Review of the State of the Art. Molecular Systems Design & Engineering, 4(4), 828-849. https://doi.org/10.1039/C9ME00039A

@article{elton2019deep,
  title={Deep Learning for Molecular Design -- A Review of the State of the Art},
  author={Elton, Daniel C. and Boukouvalas, Zois and Fuge, Mark D. and Chung, Peter W.},
  journal={Molecular Systems Design \& Engineering},
  volume={4},
  number={4},
  pages={828--849},
  year={2019},
  publisher={Royal Society of Chemistry},
  doi={10.1039/C9ME00039A}
}

Inverse Molecular Design with ML Generative Models

Thu, 26 Mar 2026 00:00:00 +0000

A Foundational Systematization of Inverse Molecular Design

This paper is a Systematization of the nascent field of inverse molecular design using machine learning generative models. Published in Science in 2018, it organizes and contextualizes the rapidly emerging body of work on using deep generative models (variational autoencoders, generative adversarial networks, and reinforcement learning) to navigate chemical space and propose novel molecules with targeted properties. Rather than introducing a new method, the paper synthesizes the conceptual framework connecting molecular representations, generative architectures, and inverse design objectives, establishing a reference point for the field at a critical early stage.

The Challenge of Navigating Chemical Space

The core problem is the sheer scale of chemical space. For pharmacologically relevant small molecules alone, the number of possible structures is estimated at $10^{60}$. Traditional approaches to materials discovery rely on trial and error or high-throughput virtual screening (HTVS), both of which are fundamentally limited by the need to enumerate and evaluate candidates from a predefined library.

The conventional materials discovery pipeline, from concept to commercial product, historically takes 15 to 20 years, involving iterative cycles of simulation, synthesis, device integration, and characterization. Inverse design offers a conceptual alternative: start from a desired functionality and search for molecular structures that satisfy it. This inverts the standard paradigm where a molecule is proposed first and its properties are computed or measured afterward.

The key distinction the authors draw is between discriminative and generative models. A discriminative model learns $p(y|x)$, the conditional probability of properties $y$ given a molecule $x$. A generative model instead learns the joint distribution $p(x,y)$, which can be conditioned to yield either the direct design problem $p(y|x)$ or the inverse design problem $p(x|y)$.

Three Pillars: VAEs, GANs, and Reinforcement Learning

The review organizes inverse molecular design approaches around three generative paradigms and the molecular representations they operate on.

Molecular Representations

The paper surveys representations across three broad categories:

Discrete (text-based): SMILES strings encode molecular structure as 1D text following a grammar syntax. Their adoption has been driven by the availability of NLP deep learning tools.
Continuous (vectors/tensors): Coulomb matrices, bag of bonds, fingerprints, symmetry functions, and electronic density representations. These expose different physical symmetries (permutational, rotational, reflectional, translational invariance).
Weighted graphs: Molecules as undirected graphs where atoms are nodes and bonds are edges, with vectorized features on edges and nodes (bonding type, aromaticity, charge, distance).

An ideal representation for inverse design should be invertible, meaning it supports mapping back to a synthesizable molecular structure. SMILES strings and molecular graphs are invertible, while many continuous representations require lookup tables or auxiliary methods.

Variational Autoencoders (VAEs)

VAEs encode molecules into a continuous latent space and decode latent vectors back to molecular representations. The key insight is that by constraining the encoder to produce latent vectors following a Gaussian distribution, the model gains the ability to interpolate between molecules and sample novel structures. The latent space encodes a geometry: nearby points decode to similar molecules, and gradient-based optimization over this continuous space enables direct property optimization.

The VAE loss function combines a reconstruction term with a KL divergence regularizer:

$$\mathcal{L} = \mathbb{E}_{q(z|x)}[\log p(x|z)] - D_{KL}(q(z|x) | p(z))$$

where $q(z|x)$ is the encoder (approximate posterior), $p(x|z)$ is the decoder, and $p(z)$ is the prior (typically Gaussian).

Semi-supervised variants jointly train on molecules and properties, reorganizing latent space so molecules with similar properties cluster together. Gomez-Bombarelli et al. demonstrated local and global optimization across generated distributions using Bayesian optimization over latent space.

The review traces the evolution from character-level SMILES VAEs to grammar-aware and syntax-directed variants that improve the generation of syntactically valid structures.

Generative Adversarial Networks (GANs)

GANs pit a generator against a discriminator in an adversarial training framework. The generator learns to produce synthetic molecules from noise, while the discriminator learns to distinguish synthetic from real molecules. Training convergence for GANs is challenging, suffering from mode collapse and generator-discriminator imbalance.

For molecular applications, dealing with discrete SMILES data introduces nondifferentiability, addressed through workarounds like SeqGAN’s policy gradient approach and boundary-seeking GANs.

Reinforcement Learning (RL)

RL treats molecule generation as a sequential decision process where an agent (the generator) takes actions (adding characters to a SMILES string) to maximize a reward (desired molecular properties). Since rewards can only be assigned after sequence completion, Monte Carlo Tree Search (MCTS) is used to simulate possible completions and weight paths based on their success.

Applications include generation of drug-like molecules and retrosynthesis planning. Notable examples cited include RL for optimizing putative JAK2 inhibitors and molecules active against dopamine receptor type 2.

Hybrid Approaches

The review highlights that these paradigms are not exclusive. Examples include druGAN (adversarial autoencoder) and ORGANIC (combined GAN and RL), which leverage strengths of multiple frameworks.

Survey of Applications and Design Paradigms

Being a review paper, this work does not present new experiments but surveys existing applications across domains:

Drug Discovery: Most generative model applications at the time of writing targeted pharmaceutical properties, including solubility, melting temperature, synthesizability, and target activity. Popova et al. optimized for JAK2 inhibitors, and Olivecrona et al. targeted dopamine receptor type 2.

Materials Science: HTVS had been applied to organic photovoltaics (screening by frontier orbital energies and conversion efficiency), organic redox flow batteries (redox potential and solubility), organic LEDs (singlet-triplet gap), and inorganic materials via the Materials Project.

Chemical Space Exploration: Evolution strategies had been applied to map chemical space, with structured search procedures incorporating genotype representations and mutation operations. Bayesian sampling with sequential Monte Carlo and gradient-based optimization of properties with respect to molecular systems represented alternative inverse design strategies.

Graph-Based Generation: The paper notes the emerging extension of VAEs to molecular graphs (junction tree VAE) and message passing networks for incremental graph construction, though the graph isomorphism approximation problem remained a practical challenge.

Future Directions and Open Challenges

The authors identify several open directions for the field:

Closed-Loop Discovery: The ultimate goal is to concurrently propose, create, and characterize new materials with simultaneous data flow between components. At the time of writing, very few examples of successful closed-loop approaches existed.

Active Learning: Combining inverse design with Bayesian optimization enables models that adapt as they explore chemical space, expanding in regions of high uncertainty and discovering molecular regions with desirable properties as a function of composition.

Representation Learning: No single molecular representation works optimally for all properties. Graph and hierarchical representations were identified as areas needing further study. Representations that encode relevant physics tend to generalize better.

Improved Architectures: Memory-augmented sequence generation models, Riemannian optimization methods exploiting latent space geometry, multi-level VAEs for structured latent spaces, and inverse RL for learning reward functions were highlighted as promising research directions.

Integration into Education: The authors advocate for integrating ML into curricula across chemical, biochemical, medicinal, and materials sciences.

Limitations

As a review paper from 2018, this work captures the field at an early stage. Several limitations are worth noting:

The survey is dominated by SMILES-based approaches, reflecting the state of the field at the time. Graph-based and 3D-aware generative models were just emerging.
Quantitative benchmarking of generative models was not yet standardized. The review does not provide systematic comparisons across methods.
The synthesis feasibility of generated molecules receives limited attention. The gap between computationally generated candidates and experimentally realizable molecules was (and remains) a significant challenge.
Transformer-based architectures, which would come to dominate chemical language modeling, are not discussed, as the Transformer had only been published the year prior.

Reproducibility Details

As a review/perspective paper, this work does not introduce new models, datasets, or experiments. The reproducibility assessment applies to the cited primary works rather than the review itself.

Key Cited Methods and Their Resources

Method	Authors	Type	Availability
Automatic Chemical Design (VAE)	Gomez-Bombarelli et al.	Code + Data	Published in ACS Central Science
Grammar VAE	Kusner et al.	Code	arXiv:1703.01925
Junction Tree VAE	Jin et al.	Code	arXiv:1802.04364
ORGANIC	Sanchez-Lengeling et al.	Code	ChemRxiv preprint
SeqGAN	Yu et al.	Code	AAAI 2017
Neural Message Passing	Gilmer et al.	Code	arXiv:1704.01212

Paper Information

Citation: Sánchez-Lengeling, B., & Aspuru-Guzik, A. (2018). Inverse molecular design using machine learning: Generative models for matter engineering. Science, 361(6400), 360-365. https://doi.org/10.1126/science.aat2663

@article{sanchez-lengeling2018inverse,
  title={Inverse molecular design using machine learning: Generative models for matter engineering},
  author={S{\'a}nchez-Lengeling, Benjamin and Aspuru-Guzik, Al{\'a}n},
  journal={Science},
  volume={361},
  number={6400},
  pages={360--365},
  year={2018},
  publisher={American Association for the Advancement of Science},
  doi={10.1126/science.aat2663}
}

Generative AI Survey for De Novo Molecule and Protein Design

Thu, 26 Mar 2026 00:00:00 +0000

A Systematization of Generative AI for Drug Design

This is a Systematization paper that provides a broad survey of generative AI methods applied to de novo drug design. The survey organizes the field into two overarching themes: small molecule generation and protein generation. Within each theme, the authors identify subtasks, catalog datasets and benchmarks, describe model architectures, and compare the performance of leading methods using standardized metrics. The paper covers over 200 references and provides 12 comparative benchmark tables.

The primary contribution is a unified organizational framework that allows both micro-level comparisons within each subtask and macro-level observations across the two application domains. The authors highlight parallel developments in both fields, particularly the shift from sequence-based to structure-based approaches and the growing dominance of diffusion models.

The Challenge of Navigating De Novo Drug Design

The drug design process requires creating ligands that interact with specific biological targets. These range from small molecules (tens of atoms) to large proteins (monoclonal antibodies). Traditional discovery methods are computationally expensive, with preclinical trials costing hundreds of millions of dollars and taking 3-6 years. The chemical space of potential drug-like compounds is estimated at $10^{23}$ to $10^{60}$, making brute-force exploration infeasible.

AI-driven generative methods have gained traction in recent years, with over 150 AI-focused biotech companies initiating small-molecule drugs in the discovery phase and 15 in clinical trials. The rate of AI-fueled drug design processes has expanded by almost 40% each year.

The rapid development of the field, combined with its inherent complexity, creates barriers for new researchers. Several prior surveys exist, but they focus on specific aspects: molecule generation, protein generation, antibody generation, or specific model architectures like diffusion models. This survey takes a broader approach, covering both molecule and protein generation under a single organizational framework.

Unified Taxonomy: Two Themes, Seven Subtasks

The survey’s core organizational insight is structuring de novo drug design into two themes with distinct subtasks, while identifying common architectural patterns across them.

Generative Model Architectures

The survey covers four main generative model families used across both molecule and protein generation:

Variational Autoencoders (VAEs) encode inputs into a latent distribution and decode from sampled points. The encoder maps input $x$ to a distribution parameterized by mean $\mu_\phi(x)$ and variance $\sigma^2_\phi(x)$. Training minimizes reconstruction loss plus KL divergence:

$$\mathcal{L} = \mathcal{L}_{\text{recon}} + \beta \mathcal{L}_{\text{KL}}$$

where the KL loss is:

$$\mathcal{L}_{\text{KL}} = -\frac{1}{2} \sum_{k} \left(1 + \log(\sigma_k^{(i)2}) - \mu_k^{(i)2} - \sigma_k^{(i)2}\right)$$

Generative Adversarial Networks (GANs) use a generator-discriminator game. The generator $G$ creates instances from random noise $z$ sampled from a prior $p_z(z)$, while the discriminator $D$ distinguishes real from synthetic data:

$$\min_{G} \max_{D} \mathbb{E}_x[\log D(x; \theta_d)] + \mathbb{E}_{z \sim p(z)}[\log(1 - D(G(z; \theta_g); \theta_d))]$$

Flow-Based Models generate data by applying an invertible function $f: z_0 \mapsto x$ to transform a simple latent distribution (Gaussian) to the target distribution. The log-likelihood is computed using the change-of-variable formula:

$$\log p(x) = \log p_0(z) + \log \left| \det \frac{\partial f}{\partial z} \right|$$

Diffusion Models gradually add Gaussian noise over $T$ steps in a forward process and learn to reverse the noising via a denoising neural network. The forward step is:

$$x_{t+1} = \sqrt{1 - \beta_t} x_t + \sqrt{\beta_t} \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)$$

The training loss minimizes the difference between the true noise and the predicted noise:

$$L_t = \mathbb{E}_{t \sim [1,T], x_0, \epsilon_t} \left[ | \epsilon_t - \epsilon_\theta(x_t, t) |^2 \right]$$

Graph neural networks (GNNs), particularly equivariant GNNs (EGNNs), are commonly paired with these generative methods to handle 2D/3D molecular and protein inputs. Diffusion and flow-based models are often paired with GNNs for processing 2D/3D-based input, while VAEs and GANs are typically used for 1D input.

Small Molecule Generation: Tasks, Datasets, and Models

Target-Agnostic Molecule Design

The goal is to generate a set of novel, valid, and stable molecules without conditioning on any specific biological target. Models are evaluated on atom stability, molecule stability, validity, uniqueness, novelty, and QED (Quantitative Estimate of Drug-Likeness).

Datasets: QM9 (small stable molecules from GDB-17) and GEOM-Drug (more complex, drug-like molecules).

The field has shifted from SMILES-based VAEs (CVAE, GVAE, SD-VAE) to 2D graph methods (JTVAE) and then to 3D diffusion-based models. Current leading methods on QM9:

Model	Type	At Stb. (%)	Mol Stb. (%)	Valid (%)	Val/Uniq. (%)
MiDi	EGNN, Diffusion	99.8	97.5	97.9	97.6
MDM	EGNN, VAE, Diffusion	99.2	89.6	98.6	94.6
JODO	EGNN, Diffusion	99.2	93.4	99.0	96.0
GeoLDM	VAE, Diffusion	98.9	89.4	93.8	92.7
EDM	EGNN, Diffusion	98.7	82.0	91.9	90.7

EDM provided an initial baseline using diffusion with an equivariant GNN. GCDM introduced attention-based geometric message-passing. MDM separately handles covalent bond edges and Van der Waals forces, and also addresses diversity through an additional distribution-controlling noise variable. GeoLDM maps molecules to a lower-dimensional latent space for more efficient diffusion. MiDi uses a “relaxed” EGNN and jointly models 2D and 3D information through a graph representation capturing both spatial and connectivity data.

On the larger GEOM-Drugs dataset, performance drops for most models:

Model	At Stb. (%)	Mol Stb. (%)	Valid (%)	Val/Uniq. (%)
MiDi	99.8	91.6	77.8	77.8
MDM	–	62.2	99.5	99.0
GeoLDM	84.4	–	99.3	–
EDM	81.3	–	–	–

MiDi distinguishes itself for generating more stable complex molecules, though at the expense of validity. Models generally perform well on QM9 but show room for improvement on more complex GEOM-Drugs molecules.

Target-Aware Molecule Design

Target-aware generation produces molecules for specific protein targets, using either ligand-based (LBDD) or structure-based (SBDD) approaches. SBDD methods have become more prevalent as protein structure information becomes increasingly available.

Datasets: CrossDocked2020 (22.5M ligand-protein pairs), ZINC20, Binding MOAD.

Metrics: Vina Score (docking energy), High Affinity Percentage, QED, SA Score (synthetic accessibility), Diversity (Tanimoto similarity).

Model	Type	Vina	Affinity (%)	QED	SA	Diversity
DiffSBDD	EGNN, Diffusion	-7.333	–	0.467	0.554	0.758
Luo et al.	SchNet	-6.344	29.09	0.525	0.657	0.720
TargetDiff	EGNN, Diffusion	-6.3	58.1	0.48	0.58	0.72
LiGAN	CNN, VAE	-6.144	21.1	0.39	0.59	0.66
Pocket2Mol	EGNN, MLP	-5.14	48.4	0.56	0.74	0.69

DrugGPT is an LBDD autoregressive model using transformers on tokenized protein-ligand pairs. Among the SBDD models, LiGAN introduces a 3D CNN-VAE framework, Pocket2Mol emphasizes binding pocket geometry using an EGNN with geometric vector MLP layers, and Luo et al. model atomic probabilities in the binding site using SchNet. TargetDiff performs diffusion on an EGNN and optimizes binding affinity by reflecting low atom type entropy. DiffSBDD applies an inpainting approach by masking and replacing segments of ligand-protein complexes. DiffSBDD leads in Vina score and diversity, while TargetDiff leads in high affinity. Interestingly, diffusion-based methods are outperformed by Pocket2Mol on drug-likeness metrics (QED and SA).

Molecular Conformation Generation

Conformation generation involves producing 3D structures from 2D connectivity graphs. Models are evaluated on Coverage (COV, percentage of ground-truth conformations “covered” within an RMSD threshold) and Matching (MAT, average RMSD to closest ground-truth conformation).

Datasets: GEOM-QM9, GEOM-Drugs, ISO17.

Model	Type	GEOM-QM9 COV (%)	GEOM-QM9 MAT	GEOM-Drugs COV (%)	GEOM-Drugs MAT
Torsional Diff.	Diffusion	92.8	0.178	72.7*	0.582
DGSM	MPNN, Diffusion	91.49	0.2139	78.73	1.0154
GeoDiff	GFN, Diffusion	90.07	0.209	89.13	0.8629
ConfGF	GIN, Diffusion	88.49	0.2673	62.15	1.1629
GeoMol	MPNN	71.26	0.3731	67.16	1.0875

*Torsional Diffusion uses a 0.75 A threshold instead of the standard 1.25 A for GEOM-Drugs coverage, leading to a deflated score. It outperforms GeoDiff and GeoMol when evaluated at the same threshold.

Torsional Diffusion operates in the space of torsion angles rather than Cartesian coordinates, allowing for improved representation and fewer denoising steps. GeoDiff uses Euclidean-space diffusion, treating each atom as a particle and incorporating Markov kernels that preserve E(3) equivariance through a graph field network (GFN) layer.

Protein Generation: From Sequence to Structure

Protein Representation Learning

Representation learning creates embeddings for protein inputs to support downstream tasks. Models are evaluated on contact prediction, fold classification (at family, superfamily, and fold levels), and stability prediction (Spearman’s $\rho$).

Key models include: UniRep (mLSTM RNN), ProtBERT (BERT applied to amino acid sequences), ESM-1B (33-layer, 650M parameter transformer), MSA Transformer (pre-trained on MSA input), and GearNET (Geo-EGNN using 3D structure with directed edges). OntoProtein and KeAP incorporate knowledge graphs for direct knowledge injection.

Protein Structure Prediction

Given an amino acid sequence, models predict 3D point coordinates for each residue. Evaluated using RMSD, GDT-TS, TM-score, and LDDT on CASP14 and CAMEO benchmarks.

AlphaFold2 is the landmark model, integrating MSA and pair representations through transformers with invariant point attention (IPA). ESMFold uses ESM-2 language model representations instead of MSAs, achieving faster processing. RoseTTAFold uses a three-track neural network learning from 1D sequence, 2D distance map, and 3D backbone coordinate information simultaneously. EigenFold uses diffusion, representing the protein as a system of harmonic oscillators.

Model	Type	CAMEO RMSD	CAMEO TMScore	CAMEO GDT-TS	CAMEO lDDT	CASP14 TMScore
AlphaFold2	Transformer	3.30	0.87	0.86	0.90	0.38
ESMFold	Transformer	3.99	0.85	0.83	0.87	0.68
RoseTTAFold	Transformer	5.72	0.77	0.71	0.79	0.37
EigenFold	Diffusion	7.37	0.75	0.71	0.78	–

Sequence Generation (Inverse Folding)

Given a fixed protein backbone structure, models generate amino acid sequences that will fold into that structure. The space of valid sequences is between $10^{65}$ and $10^{130}$.

Evaluated using Amino Acid Recovery (AAR), diversity, RMSD, nonpolar loss, and perplexity (PPL):

$$\text{PPL} = \exp\left(\frac{1}{N} \sum_{i=1}^{N} \log P(x_i | x_1, x_2, \ldots x_{i-1})\right)$$

ProteinMPNN is the current top performer, generating the most accurate sequences and leading in AAR, RMSD, and nonpolar loss. It uses a message-passing neural network with a flexible, order-agnostic autoregressive approach.

Model	Type	AAR (%)	Div.	RMSD	Non.	Time (s)
ProteinMPNN	MPNN	48.7	0.168	1.019	1.061	112
ESM-IF1	Transformer	47.7	0.184	1.265	1.201	1980
GPD	Transformer	46.2	0.219	1.758	1.333	35
ABACUS-R	Transformer	45.7	0.124	1.482	0.968	233280
3D CNN	CNN	44.5	0.272	1.62	1.027	536544
PiFold	GNN	42.8	0.141	1.592	1.464	221
ProteinSolver	GNN	24.6	0.186	5.354	1.389	180

Results are from the independent benchmark by Yu et al. GPD remains the fastest method, generating sequences around three times faster than ProteinMPNN. Current SOTA models recover fewer than half of target amino acid residues, indicating room for improvement.

Backbone Design

Backbone design creates protein structures from scratch, representing the core of de novo protein design. Models generate coordinates for backbone atoms (nitrogen, alpha-carbon, carbonyl, oxygen) and use external tools like Rosetta for side-chain packing.

Two evaluation paradigms exist: context-free generation (evaluated by self-consistency TM, or scTM) and context-given generation (inpainting, evaluated by AAR, PPL, RMSD).

ProtDiff represents residues as 3D Cartesian coordinates and uses particle-filtering diffusion. FoldingDiff instead uses an angular representation (six angles per residue) with a BERT-based DDPM. LatentDiff embeds proteins into a latent space using an equivariant autoencoder, then applies equivariant diffusion, analogous to GeoLDM for molecules. These early models work well for short proteins (up to 128 residues) but struggle with longer structures.

Frame-based methods address this scaling limitation. Genie uses Frenet-Serret frames with paired residue representations and IPA for noise prediction. FrameDiff parameterizes backbone structures on the $SE(3)^N$ manifold of frames using a score-based generative model. RFDiffusion is the current leading model, combining RoseTTAFold structure prediction with diffusion. It fine-tunes RoseTTAFold weights on a masked input sequence and random noise coordinates, using “self-conditioning” on predicted structures. Protpardelle co-designs sequence and structure by creating a “superposition” over possible sidechain states and collapsing them during each iterative diffusion step.

Model	Type	scTM (%)	Design. (%)	PPL	AAR (%)	RMSD
RFDiffusion	Diffusion	–	95.1	–	–	–
Protpardelle	Diffusion	85	–	–	–	–
FrameDiff	Diffusion	84	48.3	–	–	–
Genie	Diffusion	81.5	79.0	–	–	–
LatentDiff	EGNN, Diffusion	31.6	–	–	–	–
FoldingDiff	Diffusion	14.2	–	–	–	–
ProtDiff	EGNN, Diffusion	11.8	–	–	12.47*	8.01*

*ProtDiff context-given results are tested only on beta-lactamase metalloproteins from PDB.

Antibody Design

The survey covers antibody structure prediction, representation learning, and CDR-H3 generation. Antibodies are Y-shaped proteins with complementarity-determining regions (CDRs), where CDR-H3 is the most variable and functionally important region.

For CDR-H3 generation, models have progressed from sequence-based (LSTM) to structure-based (RefineGNN) and sequence-structure co-design approaches (MEAN, AntiDesigner, DiffAb). dyMEAN is the current leading model, providing an end-to-end method incorporating structure prediction, docking, and CDR generation into a single framework. MSA alignment cannot be used for antibody input, which makes general models like AlphaFold2 inefficient for antibody prediction. Specialized models like IgFold use sequence embeddings from AntiBERTy with invariant point attention to achieve faster antibody structure prediction.

Peptide Design

The survey briefly covers peptide generation, including models for therapeutic peptide generation (MMCD), peptide-protein interaction prediction (PepGB), peptide representation learning (PepHarmony), peptide sequencing (AdaNovo), and signal peptide prediction (PEFT-SP).

Current Trends, Challenges, and Future Directions

Current Trends

The survey identifies several parallel trends across molecule and protein generation:

Shift from sequence to structure: In molecule generation, graph-based diffusion models (GeoLDM, MiDi, TargetDiff) now dominate. In protein generation, structure-based representation learning (GearNET) and diffusion-based backbone design (RFDiffusion) have overtaken sequence-only methods.
Dominance of E(3) equivariant architectures: EGNNs appear across nearly all subtasks, reflecting the physical requirement that molecular and protein properties should be invariant to rotation and translation.
Structure-based over ligand-based approaches: In target-aware molecule design, SBDD methods that use 3D protein structures demonstrate clear advantages over LBDD approaches that operate on amino acid sequences alone.

Challenges

For small molecule generation:

Complexity: Models perform well on simple QM9 but struggle with complex GEOM-Drugs molecules.
Applicability: Generating molecules with high binding affinity to targets remains difficult.
Explainability: Methods are black-box, offering no insight into why generated molecules have desired properties.

For protein generation:

Benchmarking: Protein generative tasks lack a standard evaluative procedure, with variance between each model’s metrics and testing conditions.
Performance: SOTA models still struggle with fold classification, gene ontology, and antibody CDR-H3 generation.

The authors also note that many generative tasks are evaluated using predictive models (e.g., classifier networks for binding affinity or molecular properties). Improvements to these classification methods would lead to more precise alignment with real-world biological applications.

Future Directions

The authors identify increasing performance in existing tasks, defining more applicable tasks (especially in molecule-protein binding, antibody generation), and exploring entirely new areas of research as key future directions.

Reproducibility Details

As a survey paper, this work does not produce new models, datasets, or experimental results. All benchmark numbers reported are from the original papers cited.

Data

The survey catalogs the following key datasets across subtasks:

Subtask	Datasets	Notes
Target-agnostic molecule	QM9, GEOM-Drug	QM9 from GDB-17; GEOM-Drug for complex molecules
Target-aware molecule	CrossDocked2020, ZINC20, Binding MOAD	CrossDocked2020 most used (22.5M pairs)
Conformation generation	GEOM-QM9, GEOM-Drugs, ISO17	Conformer sets for molecules
Protein structure prediction	PDB, CASP14, CAMEO	CASP biennial blind evaluation
Protein sequence generation	PDB, UniRef, UniParc, CATH, TS500	CATH for domain classification
Backbone design	PDB, AlphaFoldDB, SCOP, CATH	AlphaFoldDB for expanded structural coverage
Antibody structure	SAbDab, RAB	SAbDab: all antibody structures from PDB
Antibody CDR generation	SAbDab, RAB, SKEMPI	SKEMPI for affinity optimization

Artifacts

Artifact	Type	License	Notes
GenAI4Drug	Code	Not specified	Organized repository of all covered sources

Paper Information

Citation: Tang, X., Dai, H., Knight, E., Wu, F., Li, Y., Li, T., & Gerstein, M. (2024). A survey of generative AI for de novo drug design: New frontiers in molecule and protein generation. Briefings in Bioinformatics, 25(4), bbae338. https://doi.org/10.1093/bib/bbae338

Publication: Briefings in Bioinformatics, Volume 25, Issue 4, 2024.

Additional Resources:

@article{tang2024survey,
  title={A survey of generative AI for de novo drug design: new frontiers in molecule and protein generation},
  author={Tang, Xiangru and Dai, Howard and Knight, Elizabeth and Wu, Fang and Li, Yunyang and Li, Tianxiao and Gerstein, Mark},
  journal={Briefings in Bioinformatics},
  volume={25},
  number={4},
  pages={bbae338},
  year={2024},
  doi={10.1093/bib/bbae338}
}

Foundation Models in Chemistry: A 2025 Perspective

Thu, 26 Mar 2026 00:00:00 +0000

A Systematization of Foundation Models for Chemistry

This is a Systematization paper. It organizes the rapidly growing landscape of foundation models in chemistry into a coherent taxonomy. The paper distinguishes between “small” foundation models (pretrained for a single application domain) and “big” foundation models (adaptable across multiple domains such as property prediction and inverse design). It covers models based on graph neural networks (GNNs) and language models, reviews pretraining strategies (self-supervised, multimodal, supervised), and maps approximately 40 models across four application domains.

Why a Foundation Model Perspective for Chemistry?

Foundation models have transformed NLP and computer vision through large-scale pretraining and transfer learning. In chemistry, however, several persistent challenges motivate the adoption of this paradigm:

Data scarcity: Chemical datasets are often small and expensive to generate (requiring experiments or quantum mechanical calculations), unlike the large annotated datasets available in NLP/CV.
Poor generalization: ML models in chemistry frequently need to extrapolate to out-of-domain compounds (e.g., novel drug candidates, unseen crystal structures), where conventional models struggle.
Limited transferability: Traditional ML interatomic potentials (MLIPs) are trained on system-specific datasets and cannot be easily transferred across different chemical systems.

Foundation models address these by learning general representations from large unlabeled datasets, which can then be adapted to specific downstream tasks via finetuning. The paper argues that summarizing this fast-moving field is timely, given the diversity of approaches emerging across molecular property prediction, MLIPs, inverse design, and multi-domain applications.

Small vs. Big Foundation Models: A Two-Tier Taxonomy

The paper’s central organizing framework distinguishes two scopes of foundation model:

Small foundation models are pretrained models adapted to various tasks within a single application domain. Examples include:

A model pretrained on large molecular databases that predicts multiple molecular properties (band gap, formation energy, etc.)
A universal MLIP that can simulate diverse chemical systems
A pretrained generative model adapted for inverse design of different target properties

Big foundation models span multiple application domains, handling both property prediction and inverse design within a single framework. These typically use multimodal learning (combining SMILES/graphs with text) or build on large language models.

Architectures

The paper reviews two primary architecture families:

Graph Neural Networks (GNNs) represent molecules and crystals as graphs $G = (V, E)$ with nodes (atoms) and edges (bonds). Node features are updated through message passing:

$$ m_{i}^{t+1} = \sum_{j \in N(i)} M_{t}(v_{i}^{t}, v_{j}^{t}, e_{ij}^{t}) $$

$$ v_{i}^{t+1} = U_{t}(v_{i}^{t}, m_{i}^{t+1}) $$

After $T$ message-passing steps, a readout function produces a graph-level feature:

$$ g = R({v_{i}^{T} \mid i \in G}) $$

Recent equivariant GNNs (e.g., NequIP, MACE, EquformerV2) use vectorial features that respect geometric symmetries, improving expressivity for tasks sensitive to 3D structure.

Language Models operate on string representations of molecules (SMILES, SELFIES) or crystal structures. Autoregressive models like GPT maximize:

$$ \prod_{t=1}^{T} P(y_{t} \mid x_{1}, x_{2}, \ldots, x_{t-1}) $$

Transformers use self-attention:

$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^{T}}{\sqrt{d_{k}}}\right)V $$

Pretraining Strategies

The paper categorizes pretraining methods into three self-supervised learning (SSL) approaches plus supervised and multimodal strategies:

Strategy	Mechanism	Example Models
Contrastive learning	Maximize similarity between positive pairs, minimize for negatives	GraphCL, MolCLR, GraphMVP, CrysGNN
Predictive learning	Predict self-generated labels (node context, functional groups, space group)	GROVER, Hu et al., CrysGNN
Generative learning	Reconstruct masked nodes/edges or entire molecules/SMILES	SMILES-BERT, ChemBERTa-2, MoLFormer
Supervised pretraining	Train on energy, forces, stress from DFT databases	M3GNet, CHGNet, MACE-MP-0, MatterSim
Multimodal learning	Learn joint representations across SMILES/graph + text modalities	KV-PLM, MoMu, MoleculeSTM, SPMM

A common finding across studies is that combining local and global information (e.g., via contrastive learning between node-level and graph-level views, or supervised learning on both forces and total energy) produces more transferable representations.

Survey of Models Across Four Domains

Property Prediction

The paper reviews 13 models for molecular and materials property prediction. Key findings:

Contrastive learning approaches (GraphCL, MolCLR, GraphMVP) achieve strong results by defining positive pairs through augmentation, 2D/3D structure views, or crystal system membership.
Language model approaches (SMILES-BERT, ChemBERTa-2, MoLFormer) show that transformers trained on SMILES via masked language modeling can compete with GNN-based approaches.
MoLFormer, pretrained on 1.1 billion SMILES from PubChem and ZINC, outperformed many baselines including GNNs on MoleculeNet and QM9 benchmarks. Its attention maps captured molecular structural features directly from SMILES strings.
For crystalline materials, CrysGNN combined contrastive, predictive, and generative learning, demonstrating improvements even on small experimental datasets.

Machine Learning Interatomic Potentials (MLIPs)

The paper surveys 10 universal MLIPs, all using supervised learning on DFT-calculated energies, forces, and stresses:

Model	Architecture	Training Data Size	Key Capability
M3GNet	GNN	187K (MP)	First universal MLIP
CHGNet	GNN	1.58M (MPtrj)	Predicts magnetic moments
MACE-MP-0	MACE	1.58M (MPtrj)	35 diverse applications
GNoME potential	NequIP	89M	Zero-shot comparable to trained MLIPs
MatterSim	M3GNet/Graphormer	17M	SOTA on Matbench Discovery
eqV2	EquformerV2	118M (OMat24)	Structural relaxation

The GNoME potential, trained on approximately 89 million data points, achieved zero-shot performance comparable to state-of-the-art MLIPs trained from scratch. MatterSim, trained on over 17 million entries across wide temperature (0-5000K) and pressure (0-1000 GPa) ranges, achieved state-of-the-art on Matbench Discovery and accurately computed thermodynamic and lattice dynamic properties.

Inverse Design

Few pretrained generative models for inverse design exist. The paper highlights three:

MatterGen (Microsoft): Diffusion model pretrained on Alexandria/MP databases (607K structures), finetuned for conditional generation on band gap, elastic modulus, spacegroup, and composition. Generated S.U.N. (stable, unique, novel) materials at rates more than 2x the previous state of the art.
GP-MoLFormer (IBM): MoLFormer pretrained on 1.1B SMILES, finetuned via pair-tuning for property-guided molecular optimization.
CrystalLLM: Finetuned LLaMA-2 70B for crystal generation with target spacegroup and composition using string representations and prompting.

Multi-Domain Models

The paper covers two multi-domain categories:

Property prediction + MLIP: Denoising pretraining learns virtual forces that guide noisy configurations back to equilibrium, connecting to force prediction. Joint multi-domain pretraining (JMP) from Meta FAIR achieved state-of-the-art on 34 of 40 tasks spanning molecules, crystals, and MOFs by training simultaneously on diverse energy/force databases.

Property prediction + inverse design: Multimodal models (KV-PLM, MoMu, MoleculeSTM, MolFM, SPMM) learn joint representations from molecular structures and text, enabling text-based inverse design and property prediction in a single framework. LLM-based models (ChemDFM, nach0, finetuned GPT-3) can interact with humans and handle diverse chemistry tasks through instruction tuning.

Trends and Future Directions

Scope Expansion

The authors identify three axes for expanding foundation model scope:

Material types: Most models target molecules or a single material class. Foundation models that span molecules, crystals, surfaces, and MOFs could exploit shared chemistry across materials.
Modalities: Beyond SMILES, graphs, and text, additional modalities (images, spectral data like XRD patterns) remain underexplored.
Downstream tasks: Extending to new chemistry and tasks through emergent capabilities, analogous to the capabilities observed in LLMs at scale.

Performance and Scaling

Key scaling challenges include:

Data quality vs. quantity: Noisy DFT labels (e.g., HOMO-LUMO gaps with high uncertainty from different functionals/basis sets) can limit scalability and out-of-distribution performance.
GNN scalability: While transformers scale to hundreds of billions of parameters, GNNs have rarely been explored above one million parameters due to oversmoothing and the curse of dimensionality. Recent work by Sypetkowski et al. demonstrated scaling GNNs to 3 billion parameters with consistent improvements.
Database integration: Combining datasets from different DFT codes requires proper alignment (e.g., total energy alignment methods).

Efficiency

For MLIPs, efficiency is critical since MD simulations require millions of inference steps. Approaches include:

Knowledge distillation from expensive teacher models to lighter student models
Model compression techniques (quantization, pruning) adapted for GNNs
Investigating whether strict equivariance is always necessary

Interpretability

Foundation models can generate hallucinations or mode-collapsed outputs. The authors highlight recent interpretability advances (feature extraction from Claude 3, knowledge localization and editing in transformers) as promising directions for more reliable chemical applications.

Key Findings and Limitations

Key findings:

Combining local and global information in pretraining consistently improves downstream performance across all domains reviewed.
Self-supervised pretraining enables effective transfer learning even in low-data regimes, a critical advantage for chemistry.
Universal MLIPs have reached the point where zero-shot performance can be comparable to system-specific trained models.
Multimodal learning is the most promising approach for big foundation models capable of spanning property prediction and inverse design.

Limitations acknowledged by the authors:

The precise definition of “foundation model” in chemistry is not established and varies by scope.
Most surveyed models focus on molecules, with crystalline materials less explored.
Benchmarks for low-data regimes and out-of-distribution performance are insufficient.
The paper focuses on three domains (property prediction, MLIPs, inverse design) and does not cover retrosynthesis, reaction prediction, or other chemical tasks in depth.

Reproducibility Details

Data

This is a perspective/review paper. No new data or models are introduced. The paper surveys existing models and their training datasets, summarized in Table 1 of the paper.

Algorithms

Not applicable (review paper). The paper describes pretraining strategies (contrastive, predictive, generative, supervised, multimodal) at a conceptual level with references to the original works.

Models

Not applicable (review paper). The paper catalogs approximately 40 foundation models across four domains. See Table 1 in the paper for the complete listing.

Evaluation

Not applicable (review paper). The paper references benchmark results from the original studies (MoleculeNet, QM9, Matbench, Matbench Discovery, JARVIS-DFT) but does not perform independent evaluation.

Hardware

Not applicable (review paper).

Paper Information

Citation: Choi, J., Nam, G., Choi, J., & Jung, Y. (2025). A Perspective on Foundation Models in Chemistry. JACS Au, 5(4), 1499-1518. https://doi.org/10.1021/jacsau.4c01160

@article{choi2025perspective,
  title={A Perspective on Foundation Models in Chemistry},
  author={Choi, Junyoung and Nam, Gunwook and Choi, Jaesik and Jung, Yousung},
  journal={JACS Au},
  volume={5},
  number={4},
  pages={1499--1518},
  year={2025},
  publisher={American Chemical Society},
  doi={10.1021/jacsau.4c01160}
}

Chemical Language Models for De Novo Drug Design Review

Thu, 26 Mar 2026 00:00:00 +0000

A Systematization of Chemical Language Models for Drug Design

This paper is a Systematization (minireview) that surveys the landscape of chemical language models (CLMs) for de novo drug design. It organizes the field along three axes: molecular string representations, deep learning architectures, and generation strategies (distribution learning, goal-directed, and conditional). The review also highlights experimental validations, current gaps, and future opportunities.

Why Chemical Language Models Matter for Drug Design

De novo drug design faces an enormous combinatorial challenge: the “chemical universe” is estimated to contain up to $10^{60}$ drug-like small molecules. Exhaustive enumeration is infeasible, and traditional design algorithms rely on hand-crafted assembly rules. Chemical language models address this by borrowing natural language processing techniques to learn the “chemical language,” generating molecules as string representations (SMILES, SELFIES, DeepSMILES) that satisfy both syntactic validity (chemically valid structures) and semantic correctness (desired pharmacological properties).

CLMs have gained traction because string representations are readily available for most molecular databases, generation is computationally cheap (one molecule per forward pass through a sequence model), and the same architecture can be applied to diverse tasks (property prediction, de novo generation, reaction prediction). At the time of this review, CLMs had produced experimentally validated bioactive molecules in several prospective studies, establishing them as practical tools for drug discovery.

Molecular String Representations: SMILES, DeepSMILES, and SELFIES

The review covers three main string representations used as input/output for CLMs:

SMILES (Simplified Molecular Input Line Entry Systems) converts hydrogen-depleted molecular graphs into strings where atoms are denoted by atomic symbols, bonds and branching by punctuation, and ring openings/closures by numbers. SMILES are non-univocal (multiple valid strings per molecule), and canonicalization algorithms are needed for unique representations. Multiple studies show that using randomized (non-canonical) SMILES for data augmentation improves CLM performance, with diminishing returns beyond 10- to 20-fold augmentation.

DeepSMILES modifies SMILES to improve machine-readability by replacing the paired ring-opening/closure digits with a count-based system and using closing parentheses only (no opening ones). This reduces the frequency of syntactically invalid strings but does not eliminate them entirely.

SELFIES (Self-Referencing Embedded Strings) use a formal grammar that guarantees 100% syntactic validity of decoded molecules. Every SELFIES string maps to a valid molecular graph. However, SELFIES can produce chemically unrealistic molecules (e.g., highly strained ring systems), and the mapping between string edits and molecular changes is less intuitive than for SMILES.

The review notes a key tradeoff: SMILES offer a richer, more interpretable language with well-studied augmentation strategies, while SELFIES guarantee validity at the cost of chemical realism and edit interpretability.

CLM Architectures and Training Strategies

Architectures

The review describes the main architectures used in CLMs:

Recurrent Neural Networks (RNNs), particularly LSTMs and GRUs, dominated early CLM work. These models process SMILES character-by-character and generate new strings autoregressively via next-token prediction. RNNs are computationally efficient and well-suited to the sequential nature of molecular strings.

Variational Autoencoders (VAEs) encode molecules into a continuous latent space and decode them back into strings. This enables smooth interpolation between molecules and latent-space optimization, but generated strings may be syntactically invalid.

Generative Adversarial Networks (GANs) have been adapted for molecular string generation (e.g., ORGAN), though they face training instability and mode collapse challenges that limit their adoption.

Transformers have emerged as an increasingly popular alternative, offering parallelized training and the ability to capture long-range dependencies in molecular strings. The review notes the growing relevance of Transformer-based CLMs, particularly for large-scale pretraining.

Generation Strategies

The review organizes CLM generation into three categories:

Distribution learning: The model learns to reproduce the statistical distribution of a training set of molecules. No explicit scoring function is used during generation. The generated molecules are evaluated post-hoc by comparing their property distributions to the training set. This approach is end-to-end but provides no direct indication of individual molecule quality.
Goal-directed generation: A pretrained CLM is steered toward molecules optimizing a specified scoring function (e.g., predicted bioactivity, physicochemical properties). Common approaches include reinforcement learning (REINVENT and variants), hill-climbing, and Bayesian optimization. Scoring functions provide direct quality signals but can introduce biases, shortcuts, and limited structural diversity.
Conditional generation: An intermediate approach that learns a joint semantic space between molecular structures and desired properties. The desired property profile serves as an input “prompt” for generation (e.g., a protein target, gene expression signature, or 3D shape). This bypasses the need for external scoring functions but has seen limited experimental application.

Transfer Learning and Chemical Space Exploration

Transfer learning is the dominant paradigm for CLM-driven chemical space exploration. A large-scale pretraining step (on $10^5$ to $10^6$ molecules via next-character prediction) is followed by fine-tuning on a smaller set of molecules with desired properties (often 10 to $10^2$ molecules). Key findings from the literature:

The minimum training set size depends on target molecule complexity and heterogeneity.
SMILES augmentation is most beneficial with small training sets (fewer than 10,000 molecules) and plateaus for large, structurally complex datasets.
Fine-tuning with as few as 10 to 100 molecules has produced experimentally validated bioactive designs.
Hyperparameter tuning has relatively little effect on overall CLM performance.

Evaluating CLM Designs and Experimental Validation

The review identifies evaluation as a critical gap. CLMs are often benchmarked on “toy” properties such as calculated logP, molecular weight, or QED (quantitative estimate of drug-likeness). These metrics capture the ability to satisfy predefined criteria but fail to reflect real-world drug discovery complexity and may lead to trivial solutions.

Existing benchmarks (GuacaMol, MOSES) enable comparability across independently developed approaches but do not fully address the quality of generated compounds. The review emphasizes that experimental validation is the ultimate test. At the time of writing, only a few prospective applications had been published:

Dual modulator of retinoid X and PPAR receptors (EC50 ranging from 0.06 to 2.3 uM)
Inhibitor of Pim1 kinase and CDK4 (manually modified from generated design)
Natural-product-inspired RORgamma agonist (EC50 = 0.68 uM)
Molecules designed via combined generative AI and on-chip synthesis

The scarcity of experimental validations reflects the interdisciplinary expertise required and the time/cost of chemical synthesis.

Gaps, Limitations, and Future Directions

The review identifies several key gaps and opportunities:

Scoring function limitations: Current scoring functions struggle with activity cliffs and non-additive structure-activity relationships. Conditional generation methods may help overcome these limitations by learning direct structure-property mappings.

Structure-based design: Generating molecules that match electrostatic and shape features of protein binding pockets holds promise for addressing unexplored targets. However, prospective applications have been limited, potentially due to bias in existing protein-ligand affinity datasets.

Synthesizability: Improving the ability of CLMs to propose synthesizable molecules is expected to increase practical relevance. Automated synthesis platforms may help but could also limit accessible chemical space.

Few-shot learning: Large-scale pretrained CLMs combined with few-shot learning approaches are expected to boost prospective applications.

Extensions beyond small molecules: Extending chemical languages to more complex molecular entities (proteins with non-natural amino acids, crystals, supramolecular chemistry) is an open frontier.

Failure modes: Several studies have documented failure modes in goal-directed generation, including model shortcuts (exploiting scoring function artifacts), limited structural diversity, and generation of chemically unrealistic molecules.

Interdisciplinary collaboration: The review emphasizes that bridging deep learning, cheminformatics, and medicinal chemistry expertise is essential for translating CLM designs into real-world drug candidates.

Reproducibility Details

Data

This is a review paper and does not present novel experimental data. The paper surveys results from the literature.

Algorithms

No novel algorithms are introduced. The review categorizes existing approaches (RNNs, VAEs, GANs, Transformers) and generation strategies (distribution learning, goal-directed, conditional).

Models

No new models are presented. The paper references existing implementations including REINVENT, ORGAN, and various RNN-based and Transformer-based CLMs.

Evaluation

The review discusses existing benchmarks:

GuacaMol: Benchmarking suite for de novo molecular design
MOSES: Benchmarking platform for molecular generation models
QED: Quantitative estimate of drug-likeness
Various physicochemical property metrics (logP, molecular weight)

Hardware

Not applicable (review paper).

Paper Information

Citation: Grisoni, F. (2023). Chemical language models for de novo drug design: Challenges and opportunities. Current Opinion in Structural Biology, 79, 102527. https://doi.org/10.1016/j.sbi.2023.102527

Publication: Current Opinion in Structural Biology, Volume 79, April 2023

@article{grisoni2023chemical,
  title={Chemical language models for de novo drug design: Challenges and opportunities},
  author={Grisoni, Francesca},
  journal={Current Opinion in Structural Biology},
  volume={79},
  pages={102527},
  year={2023},
  publisher={Elsevier},
  doi={10.1016/j.sbi.2023.102527}
}

MolGenSurvey: Systematic Survey of ML for Molecule Design

Mon, 23 Mar 2026 00:00:00 +0000

A Taxonomy for ML-Driven Molecule Design

This is a Systematization paper that reviews machine learning approaches for molecule design across all three major molecular representations (1D string, 2D graph, 3D geometry) and both deep generative and combinatorial optimization paradigms. Prior surveys (including Sánchez-Lengeling & Aspuru-Guzik, 2018, Elton et al., 2019, Xue et al. 2019, Vanhaelen et al. 2020, Alshehri et al. 2020, Jiménez-Luna et al. 2020, and Axelrod et al. 2022) each covered subsets of the literature (e.g., only generative methods, or only specific task types). MolGenSurvey extends these by unifying the field into a single taxonomy based on input type, output type, and generation goal, identifying eight distinct molecule generation tasks. It catalogs over 100 methods across these categories and provides a structured comparison of evaluation metrics, datasets, and experimental setups.

The chemical space of drug-like molecules is estimated at $10^{23}$ to $10^{60}$, making exhaustive enumeration computationally infeasible. Traditional high-throughput screening searches existing databases but is slow and expensive. ML-based generative approaches offer a way to intelligently explore this space, either by learning continuous latent representations (deep generative models) or by directly searching the discrete chemical space (combinatorial optimization methods).

Molecular Representations

The survey identifies three mainstream featurization approaches for molecules, each carrying different tradeoffs for generation tasks.

1D String Descriptions

SMILES and SELFIES are the two dominant string representations. SMILES encodes molecules as character strings following grammar rules for bonds, branches, and ring closures. Its main limitation is that arbitrary strings are often chemically invalid. SELFIES augments the encoding rules for branches and rings to achieve 100% validity by construction.

Other string representations exist (InChI, SMARTS) but are less commonly used for generation. Representation learning over strings has adopted CNNs, RNNs, and Transformers from NLP.

2D Molecular Graphs

Molecules naturally map to graphs where atoms are nodes and bonds are edges. Graph neural networks (GNNs), particularly those following the message-passing neural network (MPNN) framework, have become the standard representation method. The MPNN updates each node’s representation by aggregating information from its $K$-hop neighborhood. Notable architectures include D-MPNN (directional message passing), PNA (diverse aggregation methods), AttentiveFP (attention-based), and Graphormer (transformer-based).

3D Molecular Geometry

Molecules are inherently 3D objects with conformations (3D structures at local energy minima) that determine function. Representing 3D geometry requires models that respect E(3) or SE(3) equivariance (invariance to rotation and translation). The survey catalogs architectures along this line including SchNet, DimeNet, EGNN, SphereNet, and PaiNN.

Additional featurization methods (molecular fingerprints/descriptors, 3D density maps, 3D surface meshes, and chemical images) are noted but have seen limited use in generation tasks.

Deep Generative Models

The survey covers six families of deep generative models applied to molecule design.

Autoregressive Models (ARs)

ARs factorize the joint distribution of a molecule as a product of conditional distributions over its subcomponents:

$$p(\boldsymbol{x}) = \prod_{i=1}^{d} p(\bar{x}_i \mid \bar{x}_1, \bar{x}_2, \ldots, \bar{x}_{i-1})$$

For molecular graphs, this means sequentially predicting the next atom or bond conditioned on the partial structure built so far. RNNs, Transformers, and BERT-style models all implement this paradigm.

Variational Autoencoders (VAEs)

VAEs learn a continuous latent space by maximizing the evidence lower bound (ELBO):

$$\log p(\boldsymbol{x}) \geq \mathbb{E}_{q(\boldsymbol{z}|\boldsymbol{x})}[\log p(\boldsymbol{x}|\boldsymbol{z})] - D_{KL}(q(\boldsymbol{z}|\boldsymbol{x}) | p(\boldsymbol{z}))$$

The first term is the reconstruction objective, and the second is a KL-divergence regularizer encouraging diverse, disentangled latent codes. Key molecular VAEs include ChemVAE (SMILES-based), JT-VAE (junction tree graphs), and GrammarVAE (grammar-constrained SMILES).

Normalizing Flows (NFs)

NFs model $p(\boldsymbol{x})$ via an invertible, deterministic mapping between data and latent space, using the change-of-variable formula with Jacobian determinants. Molecular applications include GraphNVP, MoFlow (one-shot graph generation), GraphAF (autoregressive flow), and GraphDF (discrete flow).

Generative Adversarial Networks (GANs)

GANs use a generator-discriminator game where the generator produces molecules and the discriminator distinguishes real from generated samples. Molecular GANs include MolGAN (graph-based with RL reward), ORGAN (SMILES-based with RL), and Mol-CycleGAN (molecule-to-molecule translation).

Diffusion Models

Diffusion models learn to reverse a gradual noising process. The forward process adds Gaussian noise over $T$ steps; a neural network learns to denoise at each step. The training objective reduces to predicting the noise added at each step:

$$\mathcal{L}_t = \mathbb{E}_{\boldsymbol{x}_0, \boldsymbol{\epsilon}}\left[|\epsilon_t - \epsilon_\theta(\sqrt{\bar{\alpha}_t}\boldsymbol{x}_0 + \sqrt{1 - \bar{\alpha}_t}\epsilon_t, t)|^2\right]$$

Diffusion has been particularly successful for 3D conformation generation (ConfGF, GeoDiff, DGSM).

Energy-Based Models (EBMs)

EBMs define $p(\boldsymbol{x}) = \frac{\exp(-E_\theta(\boldsymbol{x}))}{A}$ where $E_\theta$ is a learned energy function. The challenge is computing the intractable partition function $A$, addressed via contrastive divergence, noise-contrastive estimation, or score matching.

Combinatorial Optimization Methods

Unlike DGMs that learn from data distributions, combinatorial optimization methods (COMs) search directly over discrete chemical space using oracle calls to evaluate candidate molecules.

Reinforcement Learning (RL)

RL formulates molecule generation as a Markov Decision Process: states are partial molecules, actions are adding/removing atoms or bonds, and rewards come from property oracles. Methods include GCPN (graph convolutional policy network), MolDQN (deep Q-network), RationaleRL (property-aware substructure assembly), and REINVENT (SMILES-based policy gradient).

Genetic Algorithms (GA)

GAs maintain a population of molecules and evolve them through mutation and crossover operations. GB-GA operates on molecular graphs, GA+D uses SELFIES with adversarial discriminator enhancement, and JANUS uses SELFIES with parallel exploration strategies.

Bayesian Optimization (BO)

BO builds a Gaussian process surrogate of the objective function and uses an acquisition function to decide which molecules to evaluate next. It is often combined with VAE latent spaces (Constrained-BO-VAE, MSO) to enable continuous optimization.

Monte Carlo Tree Search (MCTS)

MCTS explores the molecular construction tree by branching and evaluating promising intermediates. ChemTS and MP-MCTS combine MCTS with autoregressive SMILES generators.

MCMC Sampling

MCMC methods (MIMOSA, MARS) formulate molecule optimization as sampling from a target distribution defined by multiple property objectives, using graph neural networks as proposal distributions.

Other Approaches

The survey also identifies two additional paradigms that do not fit neatly into either DGM or COM categories. Optimal Transport (OT) is used when matching between groups of molecules, particularly for conformation generation where each molecule has multiple associated 3D structures (e.g., GeoMol, EquiBind). Differentiable Learning formulates discrete molecules as differentiable objects, enabling gradient-based continuous optimization directly on molecular graphs (e.g., DST).

Task Taxonomy: Eight Molecule Generation Tasks

The survey’s central organizational contribution is a unified taxonomy of eight distinct molecule design tasks, defined by three axes: (1) whether generation is de novo (from scratch, no reference molecule) or conditioned on an input molecule, (2) whether the goal is generation (distribution learning, producing valid and diverse molecules) or optimization (goal-directed search for molecules with specific properties), and (3) the input/output data representation (1D string, 2D graph, 3D geometry). The paper’s Table 2 maps all combinations of these axes, showing that many are not meaningful (e.g., 1D string input to 2D graph output with no goal). Only eight combinations correspond to active research areas.

1D/2D Tasks

De novo 1D/2D molecule generation: Generate new molecules from scratch to match a training distribution. Methods span VAEs (ChemVAE, JT-VAE), flows (GraphNVP, MoFlow, GraphAF), GANs (MolGAN, ORGAN), ARs (MolecularRNN), and EBMs (GraphEBM).
De novo 1D/2D molecule optimization: Generate molecules with optimal properties from scratch, using oracle feedback. Methods include RL (GCPN, MolDQN), GA (GB-GA, JANUS), MCTS (ChemTS), and MCMC (MIMOSA, MARS).
1D/2D molecule optimization: Optimize properties of a given input molecule via local search. Methods include graph-to-graph translation (VJTNN, CORE, MOLER), VAE+BO (MSO, Constrained-BO-VAE), GANs (Mol-CycleGAN, LatentGAN), and differentiable approaches (DST).

3D Tasks

De novo 3D molecule generation: Generate novel 3D molecular structures from scratch, respecting geometric validity. Methods include ARs (G-SchNet, G-SphereNet), VAEs (3DMolNet), flows (E-NFs), and RL (MolGym).
De novo 3D conformation generation: Generate 3D conformations from given 2D molecular graphs. Methods include VAEs (CVGAE, ConfVAE), diffusion models (ConfGF, GeoDiff, DGSM), and optimal transport (GeoMol).
De novo binding-based 3D molecule generation: Design 3D molecules for specific protein binding pockets. Methods include density-based VAEs (liGAN), RL (DeepLigBuilder), and ARs (3DSBDD).
De novo binding-pose conformation generation: Find the appropriate 3D conformation of a given molecule for a given protein pocket. Methods include EBMs (DeepDock) and optimal transport (EquiBind).
3D molecule optimization: Optimize 3D molecular properties (scaffold replacement, conformation refinement). Methods include BO (BOA), ARs (3D-Scaffold, cG-SchNet), and VAEs (Coarse-GrainingVAE).

Evaluation Metrics

The survey organizes evaluation metrics into four categories.

Generation Evaluation

Basic metrics assess the quality of generated molecules:

Validity: fraction of chemically valid molecules among all generated molecules
Novelty: fraction of generated molecules absent from the training set
Uniqueness: fraction of distinct molecules among generated samples
Quality: fraction passing a predefined chemical rule filter
Diversity (internal/external): measured via pairwise similarity (Tanimoto, scaffold, or fragment) within generated set and between generated and training sets

Distribution Evaluation

Metrics measuring how well generated molecules capture the training distribution: KL divergence over physicochemical descriptors, Fréchet ChemNet Distance (FCD), and Mean Maximum Discrepancy (MMD).

Optimization Evaluation

Property oracles used as optimization targets: Synthetic Accessibility (SA), Quantitative Estimate of Drug-likeness (QED), LogP, kinase inhibition scores (GSK3-beta, JNK3), DRD2 activity, GuacaMol benchmark oracles, and Vina docking scores. Constrained optimization additionally considers structural similarity to reference molecules via Tanimoto, scaffold, or fragment similarity.

3D Evaluation

3D-specific metrics include stability (matching valence rules in 3D), RMSD and Kabsch-RMSD (conformation alignment), and Coverage/Matching scores for conformation ensembles.

Datasets

The survey catalogs 12 major datasets spanning 1D/2D and 3D molecule generation:

Dataset	Scale	Dimensionality	Purpose
ZINC	250K	1D/2D	Virtual screening compounds
ChEMBL	2.1M	1D/2D	Bioactive molecules
MOSES	1.9M	1D/2D	Benchmarking generation
CEPDB	4.3M	1D/2D	Organic photovoltaics
GDB-13	970M	1D/2D	Enumerated small molecules
QM9	134K	1D/2D/3D	Quantum chemistry properties
GEOM	450K/37M	1D/2D/3D	Conformer ensembles
ISO17	200/431K	1D/2D/3D	Molecule-conformation pairs
Molecule3D	3.9M	1D/2D/3D	DFT ground-state geometries
CrossDock2020	22.5M	1D/2D/3D	Docked ligand poses
scPDB	16K	1D/2D/3D	Binding sites
DUD-E	23K	1D/2D/3D	Active compounds with decoys

Challenges and Opportunities

Challenges

Out-of-distribution generation: Most deep generative models imitate known molecule distributions and struggle to explore truly novel chemical space.
Unrealistic problem formulation: Many task setups do not respect real-world chemistry constraints.
Expensive oracle calls: Methods typically assume unlimited access to property evaluators, which is unrealistic in drug discovery.
Lack of interpretability: Few methods explain why generated molecules have desired properties. Quantitative interpretability evaluation remains an open problem.
No unified evaluation protocols: The field lacks consensus on what defines a “good” drug candidate and how to fairly compare methods.
Insufficient benchmarking: Despite the enormous chemical space ($10^{23}$ to $10^{60}$ drug-like molecules), available benchmarks use only small fractions of large databases.
Low-data regime: Many real-world applications have limited training data, and generating molecules under data scarcity remains difficult.

Opportunities

Extension to complex structured data: Techniques from small molecule generation may transfer to proteins, antibodies, genes, crystal structures, and polysaccharides.
Connection to later drug development phases: Bridging the gap between molecule design and preclinical/clinical trial outcomes could improve real-world impact.
Knowledge discovery: Generative models over molecular latent spaces could reveal chemical rules governing molecular properties, and graph structure learning could uncover implicit non-bonded interactions.

Limitations

The survey was published in March 2022, so it does not cover subsequent advances in diffusion models for molecules (e.g., EDM, DiffSBDD), large language models applied to chemistry, or flow matching approaches.
Coverage focuses on small molecules. Macromolecule design (proteins, nucleic acids) is noted as a future direction rather than surveyed.
The survey catalogs methods but does not provide head-to-head experimental comparisons across all 100+ methods. Empirical discussion relies on individual papers’ reported results.
1D string-based methods receive less detailed coverage than graph and geometry-based approaches, reflecting the field’s shift toward structured representations at the time of writing.
As a survey, this paper produces no code, models, or datasets. The surveyed methods’ individual repositories are referenced in their original publications but are not aggregated here.

Paper Information

Citation: Du, Y., Fu, T., Sun, J., & Liu, S. (2022). MolGenSurvey: A Systematic Survey in Machine Learning Models for Molecule Design. arXiv preprint arXiv:2203.14500.

Publication: arXiv preprint, March 2022. Note: This survey covers literature through early 2022 and does not include subsequent advances in diffusion models, LLMs for chemistry, or flow matching.

Additional Resources:

arXiv: 2203.14500

@article{du2022molgensurvey,
  title={MolGenSurvey: A Systematic Survey in Machine Learning Models for Molecule Design},
  author={Du, Yuanqi and Fu, Tianfan and Sun, Jimeng and Liu, Shengchao},
  journal={arXiv preprint arXiv:2203.14500},
  year={2022}
}