A Systematization of Generative AI for Drug Design

This is a Systematization paper that provides a broad survey of generative AI methods applied to de novo drug design. The survey organizes the field into two overarching themes: small molecule generation and protein generation. Within each theme, the authors identify subtasks, catalog datasets and benchmarks, describe model architectures, and compare the performance of leading methods using standardized metrics. The paper covers over 200 references and provides 12 comparative benchmark tables.

The primary contribution is a unified organizational framework that allows both micro-level comparisons within each subtask and macro-level observations across the two application domains. The authors highlight parallel developments in both fields, particularly the shift from sequence-based to structure-based approaches and the growing dominance of diffusion models.

The Challenge of Navigating De Novo Drug Design

The drug design process requires creating ligands that interact with specific biological targets. These range from small molecules (tens of atoms) to large proteins (monoclonal antibodies). Traditional discovery methods are computationally expensive, with preclinical trials costing hundreds of millions of dollars and taking 3-6 years. The chemical space of potential drug-like compounds is estimated at $10^{23}$ to $10^{60}$, making brute-force exploration infeasible.

AI-driven generative methods have gained traction in recent years, with over 150 AI-focused biotech companies initiating small-molecule drugs in the discovery phase and 15 in clinical trials. The rate of AI-fueled drug design processes has expanded by almost 40% each year.

The rapid development of the field, combined with its inherent complexity, creates barriers for new researchers. Several prior surveys exist, but they focus on specific aspects: molecule generation, protein generation, antibody generation, or specific model architectures like diffusion models. This survey takes a broader approach, covering both molecule and protein generation under a single organizational framework.

Unified Taxonomy: Two Themes, Seven Subtasks

The survey’s core organizational insight is structuring de novo drug design into two themes with distinct subtasks, while identifying common architectural patterns across them.

Generative Model Architectures

The survey covers four main generative model families used across both molecule and protein generation:

Variational Autoencoders (VAEs) encode inputs into a latent distribution and decode from sampled points. The encoder maps input $x$ to a distribution parameterized by mean $\mu_\phi(x)$ and variance $\sigma^2_\phi(x)$. Training minimizes reconstruction loss plus KL divergence:

$$\mathcal{L} = \mathcal{L}_{\text{recon}} + \beta \mathcal{L}_{\text{KL}}$$

where the KL loss is:

$$\mathcal{L}_{\text{KL}} = -\frac{1}{2} \sum_{k} \left(1 + \log(\sigma_k^{(i)2}) - \mu_k^{(i)2} - \sigma_k^{(i)2}\right)$$

Generative Adversarial Networks (GANs) use a generator-discriminator game. The generator $G$ creates instances from random noise $z$ sampled from a prior $p_z(z)$, while the discriminator $D$ distinguishes real from synthetic data:

$$\min_{G} \max_{D} \mathbb{E}_x[\log D(x; \theta_d)] + \mathbb{E}_{z \sim p(z)}[\log(1 - D(G(z; \theta_g); \theta_d))]$$

Flow-Based Models generate data by applying an invertible function $f: z_0 \mapsto x$ to transform a simple latent distribution (Gaussian) to the target distribution. The log-likelihood is computed using the change-of-variable formula:

$$\log p(x) = \log p_0(z) + \log \left| \det \frac{\partial f}{\partial z} \right|$$

Diffusion Models gradually add Gaussian noise over $T$ steps in a forward process and learn to reverse the noising via a denoising neural network. The forward step is:

$$x_{t+1} = \sqrt{1 - \beta_t} x_t + \sqrt{\beta_t} \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)$$

The training loss minimizes the difference between the true noise and the predicted noise:

$$L_t = \mathbb{E}_{t \sim [1,T], x_0, \epsilon_t} \left[ | \epsilon_t - \epsilon_\theta(x_t, t) |^2 \right]$$

Graph neural networks (GNNs), particularly equivariant GNNs (EGNNs), are commonly paired with these generative methods to handle 2D/3D molecular and protein inputs. Diffusion and flow-based models are often paired with GNNs for processing 2D/3D-based input, while VAEs and GANs are typically used for 1D input.

Small Molecule Generation: Tasks, Datasets, and Models

Target-Agnostic Molecule Design

The goal is to generate a set of novel, valid, and stable molecules without conditioning on any specific biological target. Models are evaluated on atom stability, molecule stability, validity, uniqueness, novelty, and QED (Quantitative Estimate of Drug-Likeness).

Datasets: QM9 (small stable molecules from GDB-17) and GEOM-Drug (more complex, drug-like molecules).

The field has shifted from SMILES-based VAEs (CVAE, GVAE, SD-VAE) to 2D graph methods (JTVAE) and then to 3D diffusion-based models. Current leading methods on QM9:

ModelTypeAt Stb. (%)Mol Stb. (%)Valid (%)Val/Uniq. (%)
MiDiEGNN, Diffusion99.897.597.997.6
MDMEGNN, VAE, Diffusion99.289.698.694.6
JODOEGNN, Diffusion99.293.499.096.0
GeoLDMVAE, Diffusion98.989.493.892.7
EDMEGNN, Diffusion98.782.091.990.7

EDM provided an initial baseline using diffusion with an equivariant GNN. GCDM introduced attention-based geometric message-passing. MDM separately handles covalent bond edges and Van der Waals forces, and also addresses diversity through an additional distribution-controlling noise variable. GeoLDM maps molecules to a lower-dimensional latent space for more efficient diffusion. MiDi uses a “relaxed” EGNN and jointly models 2D and 3D information through a graph representation capturing both spatial and connectivity data.

On the larger GEOM-Drugs dataset, performance drops for most models:

ModelAt Stb. (%)Mol Stb. (%)Valid (%)Val/Uniq. (%)
MiDi99.891.677.877.8
MDM62.299.599.0
GeoLDM84.499.3
EDM81.3

MiDi distinguishes itself for generating more stable complex molecules, though at the expense of validity. Models generally perform well on QM9 but show room for improvement on more complex GEOM-Drugs molecules.

Target-Aware Molecule Design

Target-aware generation produces molecules for specific protein targets, using either ligand-based (LBDD) or structure-based (SBDD) approaches. SBDD methods have become more prevalent as protein structure information becomes increasingly available.

Datasets: CrossDocked2020 (22.5M ligand-protein pairs), ZINC20, Binding MOAD.

Metrics: Vina Score (docking energy), High Affinity Percentage, QED, SA Score (synthetic accessibility), Diversity (Tanimoto similarity).

ModelTypeVinaAffinity (%)QEDSADiversity
DiffSBDDEGNN, Diffusion-7.3330.4670.5540.758
Luo et al.SchNet-6.34429.090.5250.6570.720
TargetDiffEGNN, Diffusion-6.358.10.480.580.72
LiGANCNN, VAE-6.14421.10.390.590.66
Pocket2MolEGNN, MLP-5.1448.40.560.740.69

DrugGPT is an LBDD autoregressive model using transformers on tokenized protein-ligand pairs. Among the SBDD models, LiGAN introduces a 3D CNN-VAE framework, Pocket2Mol emphasizes binding pocket geometry using an EGNN with geometric vector MLP layers, and Luo et al. model atomic probabilities in the binding site using SchNet. TargetDiff performs diffusion on an EGNN and optimizes binding affinity by reflecting low atom type entropy. DiffSBDD applies an inpainting approach by masking and replacing segments of ligand-protein complexes. DiffSBDD leads in Vina score and diversity, while TargetDiff leads in high affinity. Interestingly, diffusion-based methods are outperformed by Pocket2Mol on drug-likeness metrics (QED and SA).

Molecular Conformation Generation

Conformation generation involves producing 3D structures from 2D connectivity graphs. Models are evaluated on Coverage (COV, percentage of ground-truth conformations “covered” within an RMSD threshold) and Matching (MAT, average RMSD to closest ground-truth conformation).

Datasets: GEOM-QM9, GEOM-Drugs, ISO17.

ModelTypeGEOM-QM9 COV (%)GEOM-QM9 MATGEOM-Drugs COV (%)GEOM-Drugs MAT
Torsional Diff.Diffusion92.80.17872.7*0.582
DGSMMPNN, Diffusion91.490.213978.731.0154
GeoDiffGFN, Diffusion90.070.20989.130.8629
ConfGFGIN, Diffusion88.490.267362.151.1629
GeoMolMPNN71.260.373167.161.0875

*Torsional Diffusion uses a 0.75 A threshold instead of the standard 1.25 A for GEOM-Drugs coverage, leading to a deflated score. It outperforms GeoDiff and GeoMol when evaluated at the same threshold.

Torsional Diffusion operates in the space of torsion angles rather than Cartesian coordinates, allowing for improved representation and fewer denoising steps. GeoDiff uses Euclidean-space diffusion, treating each atom as a particle and incorporating Markov kernels that preserve E(3) equivariance through a graph field network (GFN) layer.

Protein Generation: From Sequence to Structure

Protein Representation Learning

Representation learning creates embeddings for protein inputs to support downstream tasks. Models are evaluated on contact prediction, fold classification (at family, superfamily, and fold levels), and stability prediction (Spearman’s $\rho$).

Key models include: UniRep (mLSTM RNN), ProtBERT (BERT applied to amino acid sequences), ESM-1B (33-layer, 650M parameter transformer), MSA Transformer (pre-trained on MSA input), and GearNET (Geo-EGNN using 3D structure with directed edges). OntoProtein and KeAP incorporate knowledge graphs for direct knowledge injection.

Protein Structure Prediction

Given an amino acid sequence, models predict 3D point coordinates for each residue. Evaluated using RMSD, GDT-TS, TM-score, and LDDT on CASP14 and CAMEO benchmarks.

AlphaFold2 is the landmark model, integrating MSA and pair representations through transformers with invariant point attention (IPA). ESMFold uses ESM-2 language model representations instead of MSAs, achieving faster processing. RoseTTAFold uses a three-track neural network learning from 1D sequence, 2D distance map, and 3D backbone coordinate information simultaneously. EigenFold uses diffusion, representing the protein as a system of harmonic oscillators.

ModelTypeCAMEO RMSDCAMEO TMScoreCAMEO GDT-TSCAMEO lDDTCASP14 TMScore
AlphaFold2Transformer3.300.870.860.900.38
ESMFoldTransformer3.990.850.830.870.68
RoseTTAFoldTransformer5.720.770.710.790.37
EigenFoldDiffusion7.370.750.710.78

Sequence Generation (Inverse Folding)

Given a fixed protein backbone structure, models generate amino acid sequences that will fold into that structure. The space of valid sequences is between $10^{65}$ and $10^{130}$.

Evaluated using Amino Acid Recovery (AAR), diversity, RMSD, nonpolar loss, and perplexity (PPL):

$$\text{PPL} = \exp\left(\frac{1}{N} \sum_{i=1}^{N} \log P(x_i | x_1, x_2, \ldots x_{i-1})\right)$$

ProteinMPNN is the current top performer, generating the most accurate sequences and leading in AAR, RMSD, and nonpolar loss. It uses a message-passing neural network with a flexible, order-agnostic autoregressive approach.

ModelTypeAAR (%)Div.RMSDNon.Time (s)
ProteinMPNNMPNN48.70.1681.0191.061112
ESM-IF1Transformer47.70.1841.2651.2011980
GPDTransformer46.20.2191.7581.33335
ABACUS-RTransformer45.70.1241.4820.968233280
3D CNNCNN44.50.2721.621.027536544
PiFoldGNN42.80.1411.5921.464221
ProteinSolverGNN24.60.1865.3541.389180

Results are from the independent benchmark by Yu et al. GPD remains the fastest method, generating sequences around three times faster than ProteinMPNN. Current SOTA models recover fewer than half of target amino acid residues, indicating room for improvement.

Backbone Design

Backbone design creates protein structures from scratch, representing the core of de novo protein design. Models generate coordinates for backbone atoms (nitrogen, alpha-carbon, carbonyl, oxygen) and use external tools like Rosetta for side-chain packing.

Two evaluation paradigms exist: context-free generation (evaluated by self-consistency TM, or scTM) and context-given generation (inpainting, evaluated by AAR, PPL, RMSD).

ProtDiff represents residues as 3D Cartesian coordinates and uses particle-filtering diffusion. FoldingDiff instead uses an angular representation (six angles per residue) with a BERT-based DDPM. LatentDiff embeds proteins into a latent space using an equivariant autoencoder, then applies equivariant diffusion, analogous to GeoLDM for molecules. These early models work well for short proteins (up to 128 residues) but struggle with longer structures.

Frame-based methods address this scaling limitation. Genie uses Frenet-Serret frames with paired residue representations and IPA for noise prediction. FrameDiff parameterizes backbone structures on the $SE(3)^N$ manifold of frames using a score-based generative model. RFDiffusion is the current leading model, combining RoseTTAFold structure prediction with diffusion. It fine-tunes RoseTTAFold weights on a masked input sequence and random noise coordinates, using “self-conditioning” on predicted structures. Protpardelle co-designs sequence and structure by creating a “superposition” over possible sidechain states and collapsing them during each iterative diffusion step.

ModelTypescTM (%)Design. (%)PPLAAR (%)RMSD
RFDiffusionDiffusion95.1
ProtpardelleDiffusion85
FrameDiffDiffusion8448.3
GenieDiffusion81.579.0
LatentDiffEGNN, Diffusion31.6
FoldingDiffDiffusion14.2
ProtDiffEGNN, Diffusion11.812.47*8.01*

*ProtDiff context-given results are tested only on beta-lactamase metalloproteins from PDB.

Antibody Design

The survey covers antibody structure prediction, representation learning, and CDR-H3 generation. Antibodies are Y-shaped proteins with complementarity-determining regions (CDRs), where CDR-H3 is the most variable and functionally important region.

For CDR-H3 generation, models have progressed from sequence-based (LSTM) to structure-based (RefineGNN) and sequence-structure co-design approaches (MEAN, AntiDesigner, DiffAb). dyMEAN is the current leading model, providing an end-to-end method incorporating structure prediction, docking, and CDR generation into a single framework. MSA alignment cannot be used for antibody input, which makes general models like AlphaFold2 inefficient for antibody prediction. Specialized models like IgFold use sequence embeddings from AntiBERTy with invariant point attention to achieve faster antibody structure prediction.

Peptide Design

The survey briefly covers peptide generation, including models for therapeutic peptide generation (MMCD), peptide-protein interaction prediction (PepGB), peptide representation learning (PepHarmony), peptide sequencing (AdaNovo), and signal peptide prediction (PEFT-SP).

The survey identifies several parallel trends across molecule and protein generation:

  1. Shift from sequence to structure: In molecule generation, graph-based diffusion models (GeoLDM, MiDi, TargetDiff) now dominate. In protein generation, structure-based representation learning (GearNET) and diffusion-based backbone design (RFDiffusion) have overtaken sequence-only methods.

  2. Dominance of E(3) equivariant architectures: EGNNs appear across nearly all subtasks, reflecting the physical requirement that molecular and protein properties should be invariant to rotation and translation.

  3. Structure-based over ligand-based approaches: In target-aware molecule design, SBDD methods that use 3D protein structures demonstrate clear advantages over LBDD approaches that operate on amino acid sequences alone.

Challenges

For small molecule generation:

  • Complexity: Models perform well on simple QM9 but struggle with complex GEOM-Drugs molecules.
  • Applicability: Generating molecules with high binding affinity to targets remains difficult.
  • Explainability: Methods are black-box, offering no insight into why generated molecules have desired properties.

For protein generation:

  • Benchmarking: Protein generative tasks lack a standard evaluative procedure, with variance between each model’s metrics and testing conditions.
  • Performance: SOTA models still struggle with fold classification, gene ontology, and antibody CDR-H3 generation.

The authors also note that many generative tasks are evaluated using predictive models (e.g., classifier networks for binding affinity or molecular properties). Improvements to these classification methods would lead to more precise alignment with real-world biological applications.

Future Directions

The authors identify increasing performance in existing tasks, defining more applicable tasks (especially in molecule-protein binding, antibody generation), and exploring entirely new areas of research as key future directions.


Reproducibility Details

As a survey paper, this work does not produce new models, datasets, or experimental results. All benchmark numbers reported are from the original papers cited.

Data

The survey catalogs the following key datasets across subtasks:

SubtaskDatasetsNotes
Target-agnostic moleculeQM9, GEOM-DrugQM9 from GDB-17; GEOM-Drug for complex molecules
Target-aware moleculeCrossDocked2020, ZINC20, Binding MOADCrossDocked2020 most used (22.5M pairs)
Conformation generationGEOM-QM9, GEOM-Drugs, ISO17Conformer sets for molecules
Protein structure predictionPDB, CASP14, CAMEOCASP biennial blind evaluation
Protein sequence generationPDB, UniRef, UniParc, CATH, TS500CATH for domain classification
Backbone designPDB, AlphaFoldDB, SCOP, CATHAlphaFoldDB for expanded structural coverage
Antibody structureSAbDab, RABSAbDab: all antibody structures from PDB
Antibody CDR generationSAbDab, RAB, SKEMPISKEMPI for affinity optimization

Artifacts

ArtifactTypeLicenseNotes
GenAI4DrugCodeNot specifiedOrganized repository of all covered sources

Paper Information

Citation: Tang, X., Dai, H., Knight, E., Wu, F., Li, Y., Li, T., & Gerstein, M. (2024). A survey of generative AI for de novo drug design: New frontiers in molecule and protein generation. Briefings in Bioinformatics, 25(4), bbae338. https://doi.org/10.1093/bib/bbae338

Publication: Briefings in Bioinformatics, Volume 25, Issue 4, 2024.

Additional Resources:

@article{tang2024survey,
  title={A survey of generative AI for de novo drug design: new frontiers in molecule and protein generation},
  author={Tang, Xiangru and Dai, Howard and Knight, Elizabeth and Wu, Fang and Li, Yunyang and Li, Tianxiao and Gerstein, Mark},
  journal={Briefings in Bioinformatics},
  volume={25},
  number={4},
  pages={bbae338},
  year={2024},
  doi={10.1093/bib/bbae338}
}