Target-Aware Generation on Hunter Heidenreich | ML Research Scientist

Protein-to-Drug Molecule Translation via Transformer

Thu, 26 Mar 2026 00:00:00 +0000

Protein-Targeted Drug Generation as Machine Translation

This is a Method paper that proposes using the Transformer neural network architecture for protein-specific de novo drug generation. The primary contribution is framing the problem of generating molecules that bind to a target protein as a machine translation task: translating from the “language” of amino acid sequences to the SMILES representation of candidate drug molecules. The model takes only a protein’s amino acid sequence as input and generates novel molecules with predicted binding affinity, requiring no prior knowledge of active ligands, physicochemical descriptors, or the protein’s three-dimensional structure.

Limitations of Existing Generative Drug Design Approaches

Existing deep learning methods for de novo molecule generation suffer from several limitations. Most RNN-based approaches require a library of known active compounds against the target protein to fine-tune the generator or train a reward predictor for reinforcement learning. Structure-based drug design methods require the three-dimensional structure of the target protein, which can be costly and technically difficult to obtain through protein expression, purification, and crystallization. Autoencoder-based approaches (variational and adversarial) similarly depend on prior knowledge of protein binders or their physicochemical characteristics.

The estimated drug-like molecule space is on the order of $10^{60}$, while only around $10^{8}$ compounds have been synthesized. High-throughput screening is expensive and time-consuming, and virtual screening operates only on known molecules. Computational de novo design methods often generate molecules that are hard to synthesize or restrict accessible chemical space through coded rules. A method that requires only a protein’s amino acid sequence would substantially simplify the initial stages of drug discovery, particularly for targets with limited or no information about inhibitors and 3D structure.

Sequence-to-Sequence Translation with Self-Attention

The core insight is to treat protein-targeted drug generation as a translation problem between two “languages,” applying the Transformer architecture that had demonstrated strong results in neural machine translation. The encoder maps a protein amino acid sequence $(a_1, \ldots, a_n)$ to continuous representations $\mathbf{z} = (z_1, \ldots, z_n)$, and the decoder autoregressively generates a SMILES string conditioned on $\mathbf{z}$.

The self-attention mechanism computes:

$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$

where $d_k$ is a scaling factor. Multihead attention runs $h$ parallel attention heads:

$$ \text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V) $$

$$ \text{Multihead}(Q, K, V) = (\text{head}_1, \ldots, \text{head}_h)W^O $$

Positional encoding uses sinusoidal functions:

$$ PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i / d_{model}}}\right) $$

$$ PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i / d_{model}}}\right) $$

The self-attention mechanism is particularly well-suited for this task for two reasons. First, protein sequences can be much longer than SMILES strings (dozens of times longer), making the ability to capture long-range dependencies essential. Second, three-dimensional structural features of the binding pocket may be formed by amino acid residues far apart in the linear sequence, and multihead attention can jointly attend to different positional aspects simultaneously.

Data, Model Architecture, and Docking Evaluation

Data

The training data was retrieved from BindingDB, filtering for interactions between proteins from Homo sapiens, Rattus norvegicus, Mus musculus, and Bos taurus with binding affinity below 100 nM (IC50, Kd, or EC50). After filtering for valid PubChem CIDs, SMILES representations, UniProt IDs, molecular weight under 1000 Da, and amino acid sequence lengths between 80 and 2050, the final dataset contained 238,147 records with 1,613 unique proteins and 154,924 unique ligand SMILES strings.

Five Monte Carlo cross-validation splits were created, with the constraint that test set proteins share less than 20% sequence similarity with training set proteins (measured via Needleman-Wunsch global alignment).

Model Configuration

The model uses the original Transformer implementation via the tensor2tensor library with:

4 encoder/decoder layers of size 128
4 attention heads
Adam optimizer with learning rate decay from the original Transformer paper
Batch size of 4,096 tokens
Training for 600K epochs on a single GPU in Google Colaboratory
Vocabulary of 71 symbols (character-level tokenization)

Beam search decoding was used with two modes: beam size 4 keeping only the top-1 result (“one per one” mode) and beam size 10 keeping all 10 results (“ten per one” mode).

Chemical Validity and Uniqueness

Metric	One per One (avg)	Ten per One (avg)
Valid SMILES (%)	90.2	82.6
Unique SMILES (%)	92.3	81.7
ZINC15 match (%)	30.6	17.1

Docking Evaluation

To assess binding affinity, the authors selected two receptor tyrosine kinases from the test set (IGF-1R and VEGFR2) and performed molecular docking with SMINA. Four sets of ligands were compared: known binders, randomly selected compounds, molecules generated for the target protein, and molecules generated for other targets (cross-docking control).

ROC-AUC analysis showed that the docking tool classified generated molecules for the correct target as binders at rates comparable to known binders. For the best-discriminating structures (PDB 3O23 for IGF-1R, PDB 3BE2 for VEGFR2), Mann-Whitney U tests confirmed statistically significant differences between generated-for-target molecules and random compounds, while the difference between generated-for-target and known binders was not significant (p = 0.40 and 0.26 respectively), suggesting the model generates plausible binders.

Drug-Likeness Properties

Generated molecules were evaluated against Lipinski’s Rule of Five and other drug-likeness criteria:

Property	Constraint	One per One (%)	Ten per One (%)
logP	< 5	84.4	85.6
Molecular weight	< 500 Da	95.8	88.9
H-bond donors	< 5	95.8	91.9
H-bond acceptors	< 10	97.9	93.5
Rotatable bonds	< 10	97.9	91.2
TPSA	< 140	98.0	92.7
SAS	< 6	99.9	100.0

Mean QED values were 0.66 +/- 0.19 (one per one) and 0.58 +/- 0.21 (ten per one).

Structural Novelty

Tanimoto similarity analysis showed that only 8% of generated structures had similarity above the threshold (> 0.85) to training compounds. The majority (51%) had Tanimoto scores below 0.5. The mean nearest-neighbor Tanimoto similarity of generated molecules to the training set (0.54 +/- 0.17 in one-per-one mode) was substantially lower than the mean within-training-set similarity (0.74 +/- 0.14), indicating the model generates structurally diverse molecules outside the training distribution.

Generated Molecules Show Drug-Like Properties and Predicted Binding

The model generates roughly 90% chemically valid SMILES in one-per-one mode, with 92% uniqueness. Docking simulations on IGF-1R and VEGFR2 suggest that generated molecules for the correct target are statistically indistinguishable from known binders, while molecules generated for other targets behave more like random compounds. Drug-likeness properties fall within acceptable ranges for the vast majority of generated compounds.

The authors acknowledge several limitations:

Only two protein targets were analyzed via docking due to computational constraints, and the analysis was limited to proteins with a single well-known druggable binding pocket.
Beam search produces molecules that differ only slightly; diverse beam search or coupling with variational/adversarial autoencoders could improve diversity.
The fraction of molecules matching the ZINC15 database (30.6% in one-per-one mode) could potentially be reduced by pretraining on a larger compound set (e.g., ChEMBL’s 1.5 million molecules).
Model interpretability remains limited and is identified as important future work.
The approach is a proof of concept and requires further validation via in vitro assays across diverse protein targets.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Training/Test	BindingDB (filtered)	238,147 records	1,613 unique proteins, 154,924 unique SMILES; IC50/Kd/EC50 < 100 nM
Docking validation	PDB structures	11 (IGF-1R), 20 (VEGFR2)	SMINA docking with default settings
Database matching	ZINC15	N/A	Used for novelty assessment

Algorithms

Transformer (encoder-decoder) via tensor2tensor library
Beam search decoding (beam sizes 4 and 10)
Needleman-Wunsch global alignment for protein sequence similarity (EMBOSS)
SMINA for molecular docking
RDKit for validity checking, property calculation, and canonicalization

Models

4 layers, 128 hidden size, 4 attention heads
Character-level tokenization with 71-symbol vocabulary
5-fold Monte Carlo cross-validation with < 20% sequence similarity between train/test proteins

Evaluation

Metric	Value	Notes
Valid SMILES	90.2% (1-per-1), 82.6% (10-per-1)	Averaged across 5 splits
Unique SMILES	92.3% (1-per-1), 81.7% (10-per-1)	Averaged across 5 splits
ZINC15 match	30.6% (1-per-1), 17.1% (10-per-1)	Averaged across 5 splits
QED	0.66 +/- 0.19 (1-per-1), 0.58 +/- 0.21 (10-per-1)	Drug-likeness score
SAS compliance	99.9% (1-per-1), 100% (10-per-1)	SAS < 6

Hardware

Google Colaboratory with one GPU
Training for 600K epochs

Artifacts

Artifact	Type	License	Notes
molecule_structure_generation	Code	Not specified	Jupyter Notebook implementation using tensor2tensor

Paper Information

Citation: Grechishnikova, D. (2021). Transformer neural network for protein-specific de novo drug generation as a machine translation problem. Scientific Reports, 11, 321. https://doi.org/10.1038/s41598-020-79682-4

@article{grechishnikova2021transformer,
  title={Transformer neural network for protein-specific de novo drug generation as a machine translation problem},
  author={Grechishnikova, Daria},
  journal={Scientific Reports},
  volume={11},
  number={1},
  pages={321},
  year={2021},
  publisher={Nature Publishing Group},
  doi={10.1038/s41598-020-79682-4}
}

PrefixMol: Prefix Embeddings for Drug Molecule Design

Thu, 26 Mar 2026 00:00:00 +0000

Unified Multi-Conditional Molecular Generation

PrefixMol is a Method paper that introduces a unified generative model for structure-based drug design that simultaneously conditions on protein binding pockets and multiple chemical properties. The primary contribution is a prefix-embedding mechanism, borrowed from NLP multi-task learning, that represents each condition (pocket geometry, Vina score, QED, SA, LogP, Lipinski) as a learnable feature vector prepended to the input sequence of a GPT-based SMILES generator. This allows a single model to handle customized multi-conditional generation without the negative transfer that typically arises from merging separate task-specific models.

Bridging Target-Aware and Chemistry-Aware Molecular Design

Prior structure-based drug design methods (e.g., Pocket2Mol, GraphBP) generate molecules conditioned on protein binding pockets but impose no constraints on the chemical properties of the output. Conversely, controllable molecule generation methods (e.g., REINVENT, RetMol, CMG) can steer chemical properties but ignore protein-ligand interactions. Merging these two objectives into a single model is difficult for two reasons:

Data scarcity: Few datasets contain both protein-ligand binding affinity data and comprehensive molecular property annotations.
Negative transfer: Treating each condition as a separate task in a multi-task framework can hurt overall performance when tasks conflict.

PrefixMol addresses both problems by extending the CrossDocked dataset with molecular property labels and using a parameter-efficient prefix conditioning strategy that decouples task-specific knowledge from the shared generative backbone.

Prefix Conditioning in Attention Layers

The core innovation adapts prefix-tuning from NLP to molecular generation. Given a GPT transformer that generates SMILES token-by-token, PrefixMol prepends $n_c$ learnable condition vectors $\mathbf{p}_{\phi} \in \mathbb{R}^{n_c \times d}$ to the left of the sequence embedding $\mathbf{x} \in \mathbb{R}^{l \times d}$, forming an extended input $\mathbf{x}’ = [\text{PREFIX}; \mathbf{x}]$.

The output of each position is:

$$ h_i = \begin{cases} p_{\phi,i}, & \text{if } i < n_c \\ \text{LM}_\theta(x_i’, h_{

Because the prefix features always sit to the left, the causal attention mask ensures they influence all subsequent token predictions. The key insight is that the attention mechanism decomposes into a weighted sum of self-attention and prefix attention:

$$ \begin{aligned} \text{head} &= (1 - \lambda(\mathbf{x})) \underbrace{\text{Attn}(\mathbf{x}\mathbf{W}_q, \mathbf{c}\mathbf{W}_k, \mathbf{c}\mathbf{W}_v)}_{\text{self-attention}} \\ &\quad + \lambda(\mathbf{x}) \underbrace{\text{Attn}(\mathbf{x}\mathbf{W}_q, \mathbf{p}_\phi\mathbf{W}_k, \mathbf{p}_\phi\mathbf{W}_v)}_{\text{prefix attention}} \end{aligned} $$

where $\lambda(\mathbf{x})$ is a scalar representing the normalized attention weight on the prefix positions. This decomposition shows that conditions modulate generation through an additive attention pathway, and the activation map $\text{softmax}(\mathbf{x}\mathbf{W}_q \mathbf{W}_k^\top \mathbf{p}_\phi^\top)$ directly reveals how each condition steers model behavior.

Condition correlation is similarly revealed. For the prefix features themselves, the causal mask zeros out the cross-attention to the sequence, leaving only the prefix self-correlation term:

$$ \text{head} = \text{Attn}(\mathbf{p}_\phi \mathbf{W}_q, \mathbf{p}_\phi \mathbf{W}_k, \mathbf{p}_\phi \mathbf{W}_v) $$

The attention map $\mathbf{A}(\mathbf{p}_\phi)$ from this term encodes how conditions relate to one another.

Condition Encoders

Each condition has a dedicated encoder:

3D Pocket: A Geometric Vector Transformer (GVF) processes the binding pocket as a 3D graph with SE(3)-equivariant node and edge features. GVF extends GVP-GNN with a global attention module over geometric features. A position-aware attention mechanism with radial basis functions produces the pocket embedding.
Chemical properties: Separate MLPs embed each scalar property (Vina, QED, SA, LogP, Lipinski) into the shared $d$-dimensional space.

Training Objective

PrefixMol is trained with two losses. The auto-regressive loss is:

$$ \mathcal{L}_{AT} = -\sum_{1 < i \leq t} \log p_{\phi, \theta}(x_i \mid \mathbf{x}_{

A triplet property prediction loss encourages generated molecules to match desired properties:

$$ \mathcal{L}_{Pred} = \max\left((\hat{\mathbf{c}} - \mathbf{c})^2 - (\hat{\mathbf{c}} - \dot{\mathbf{c}})^2, 0\right) $$

where $\mathbf{c}$ is the input condition, $\hat{\mathbf{c}}$ is predicted by an MLP head, and $\dot{\mathbf{c}}$ is computed by RDKit from the generated SMILES (gradient is propagated through $\hat{\mathbf{c}}$ since RDKit is non-differentiable).

Experimental Setup and Controllability Evaluation

Dataset

The authors use the CrossDocked dataset (22.5 million protein-ligand structures) with chemical properties appended for each ligand. Data splitting and evaluation follow Pocket2Mol and Masuda et al.

Metrics

Vina score (binding affinity, computed by QVina after UFF refinement)
QED (quantitative estimate of drug-likeness, 0-1)
SA (synthetic accessibility, 0-1)
LogP (octanol-water partition coefficient)
Lipinski (rule-of-five compliance count)
High Affinity (fraction of pockets where generated molecules match or exceed test set affinities)
Diversity (average pairwise Tanimoto distance over Morgan fingerprints)
Sim.Train (maximum Tanimoto similarity to training set)

Baselines

Unconditional comparison against CVAE, AR (Luo et al. 2021a), and Pocket2Mol.

Key Results

Unconditional generation (Table 1): PrefixMol without conditions achieves sub-optimal results on Vina (-6.532), QED (0.551), SA (0.750), and LogP (1.415) compared to Pocket2Mol. However, it substantially outperforms all baselines on diversity (0.856 vs. 0.688 for Pocket2Mol) and novelty (Sim.Train of 0.239 vs. 0.376), indicating it generates genuinely novel molecules rather than memorizing training data.

Single-property control (Table 2): Molecular properties are positively correlated with conditional inputs across VINA, QED, SA, LogP, and Lipinski. With favorable control scales, PrefixMol surpasses Pocket2Mol on QED (0.767 vs. 0.563), SA (0.924 vs. 0.765), and LogP. The Vina score also improves when QED or LogP conditions are increased (e.g., -7.733 at QED control scale +2), revealing coupling between conditions.

Multi-property control (Table 3): Jointly adjusting all five conditions shows consistent positive relationships. For example, at control scale +4, QED reaches 0.722, SA reaches 0.913, and Lipinski saturates at 5.0. Joint QED+SA control at +2.0 achieves Lipinski = 5.0, confirming that certain properties are coupled.

Condition Relation Analysis

By computing partial derivatives of the prefix attention map with respect to each condition, the authors construct a relation matrix $\mathbf{R} = \sum_{i=2}^{6} |\partial \mathbf{A} / \partial c_i|$. Key findings:

Vina is weakly self-controllable but strongly influenced by QED, LogP, and SA, explaining why multi-condition control improves binding affinity even when Vina alone responds poorly.
LogP and QED are the most correlated property pair.
Lipinski is coupled to QED and SA, saturating at 5.0 when both QED and SA control scales reach +2.

Key Findings, Limitations, and Interpretability Insights

PrefixMol demonstrates that prefix embedding is an effective strategy for unifying target-aware and chemistry-aware molecular generation. The main findings are:

A single prefix-conditioned GPT model can control multiple chemical properties simultaneously while targeting specific protein pockets.
Multi-conditional generation outperforms unconditional baselines in drug-likeness metrics, and the controllability enables PrefixMol to surpass Pocket2Mol on QED, SA, and LogP.
The attention mechanism provides interpretable coupling relationships between conditions, offering practical guidance (e.g., improving QED indirectly improves Vina).

Limitations: The paper does not report validity rates for generated SMILES. The unconditional model underperforms Pocket2Mol on binding affinity (Vina), suggesting that generating 2D SMILES strings and relying on post hoc 3D conformer generation may be less effective than direct atom-by-atom 3D generation for binding affinity optimization. The condition relation analysis uses a first-order finite difference approximation ($\Delta = 1$), which may not capture nonlinear interactions. No external validation on prospective drug discovery tasks is provided. Hardware and training time details are not reported.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Training / Evaluation	CrossDocked (extended)	22.5M protein-ligand structures	Extended with molecular properties (QED, SA, LogP, Lipinski, Vina)

Algorithms

GPT-based auto-regressive SMILES generation with prefix conditioning
GVF (Geometric Vector Transformer) for 3D pocket encoding, extending GVP-GNN with global attention
Separate MLP encoders for each chemical property
Triplet property prediction loss with non-differentiable RDKit-computed properties
QVina for Vina score computation with UFF refinement

Models

GPT transformer backbone for SMILES generation
6 prefix condition vectors ($n_c = 6$): Pocket, Vina, QED, SA, LogP, Lipinski
Specific architectural hyperparameters (hidden dimension, number of layers, heads) not reported in the paper

Evaluation

Metric	PrefixMol (unconditional)	Pocket2Mol	Notes
Vina (kcal/mol)	-6.532	-7.288	Lower is better
QED	0.551	0.563	Higher is better
SA	0.750	0.765	Higher is better
Diversity	0.856	0.688	Higher is better
Sim.Train	0.239	0.376	Lower is better

Hardware

Not reported in the paper.

Artifacts

Artifact	Type	License	Notes
PrefixMol	Code	Not specified	Official PyTorch implementation

Paper Information

Citation: Gao, Z., Hu, Y., Tan, C., & Li, S. Z. (2023). PrefixMol: Target- and Chemistry-aware Molecule Design via Prefix Embedding. arXiv preprint arXiv:2302.07120.

@article{gao2023prefixmol,
  title={PrefixMol: Target- and Chemistry-aware Molecule Design via Prefix Embedding},
  author={Gao, Zhangyang and Hu, Yuqi and Tan, Cheng and Li, Stan Z.},
  journal={arXiv preprint arXiv:2302.07120},
  year={2023}
}

Lingo3DMol: Language Model for 3D Molecule Design

Thu, 26 Mar 2026 00:00:00 +0000

A Language Model Approach to Structure-Based Drug Design

This is a Method paper that introduces Lingo3DMol, a pocket-based 3D molecule generation model combining transformer language models with geometric deep learning. The primary contribution is threefold: (1) a new molecular representation called FSMILES (fragment-based SMILES) that encodes both 2D topology and 3D spatial coordinates, (2) a dual-decoder architecture that jointly predicts molecular topology and atomic positions, and (3) an auxiliary non-covalent interaction (NCI) predictor that guides molecule generation toward favorable binding modes.

Limitations of Existing 3D Molecular Generative Models

Existing approaches to structure-based drug design fall into two categories, each with notable limitations. Graph-based autoregressive methods (e.g., Pocket2Mol) represent molecules as 3D graphs and use GNNs for generation, but frequently produce non-drug-like structures: large rings (seven or more atoms), honeycomb-like ring arrays, and molecules with either too many or too few rings. The autoregressive sampling process tends to get stuck in local optima early in generation and accumulates errors at each step. Diffusion-based methods (e.g., TargetDiff) avoid autoregressive generation but still produce a notable proportion of undesirable structures due to weak perception of molecular topology, since they do not directly encode or predict bonds. Both approaches struggle with metrics like QED (quantitative estimate of drug-likeness) and SAS (synthetic accessibility score), and neither reliably reproduces known active compounds when evaluated on protein pockets.

FSMILES: Fragment-Based SMILES with Dual Coordinate Systems

The core innovation of Lingo3DMol is a new molecular sequence representation called FSMILES that addresses the topology problem inherent in atom-by-atom generation. FSMILES reorganizes a molecule into fragments using a ring-first, depth-first traversal. Each fragment is represented using standard SMILES syntax, and the full molecule is assembled by combining fragments with a specific connection syntax. Ring size information is encoded directly in atom tokens (e.g., C_6 for a carbon in a six-membered ring), providing the autoregressive decoder with critical context about local topology before it needs to close the ring.

The model integrates two coordinate systems. Local spherical coordinates encode bond length ($r$), bond angle ($\theta$), and dihedral angle ($\phi$) relative to three reference atoms (root1, root2, root3). These are predicted using separate MLP heads:

$$r = \operatorname{argmax}\left(\operatorname{softmax}\left(\operatorname{MLP}_1\left(\left[E_{\text{type}}(\text{cur}), H_{\text{topo}}, h_{\text{root1}}\right]\right)\right)\right)$$

$$\theta = \operatorname{argmax}\left(\operatorname{softmax}\left(\operatorname{MLP}_2\left(\left[E_{\text{type}}(\text{cur}), H_{\text{topo}}, h_{\text{root1}}, h_{\text{root2}}\right]\right)\right)\right)$$

$$\phi = \operatorname{argmax}\left(\operatorname{softmax}\left(\operatorname{MLP}_3\left(\left[E_{\text{type}}(\text{cur}), H_{\text{topo}}, h_{\text{root1}}, h_{\text{root2}}, h_{\text{root3}}\right]\right)\right)\right)$$

Global Euclidean coordinates ($x, y, z$) are predicted by a separate 3D decoder ($D_{\text{3D}}$). During inference, the model defines a search space around the predicted local coordinates ($r \pm 0.1$ A, $\theta \pm 2°$, $\phi \pm 2°$) and selects the global position with the highest joint probability within that space. This fusion strategy exploits the rigidity of bond lengths and angles (which makes local prediction easier) while maintaining global spatial awareness.

NCI/Anchor Prediction Model

A separately trained NCI/anchor prediction model identifies potential non-covalent interaction sites and anchor points in the protein pocket. This model shares the transformer architecture of the generation model and is initialized from pretrained parameters. It predicts whether each pocket atom will form hydrogen bonds, halogen bonds, salt bridges, or pi-pi stacking interactions with the ligand, and whether it lies within 4 A of any ligand atom (anchor points). The predicted NCI sites serve two purposes: they are incorporated as input features to the encoder, and they provide starting positions for molecule generation (the first atom is placed within 4.5 A of a sampled NCI site).

Pretraining and Architecture

The model uses a denoising pretraining strategy inspired by BART. During pretraining on 12 million drug-like molecules, the model receives perturbed molecules (with 25% of atoms deleted, coordinates perturbed by $\pm 0.5$ A, and 25% of carbon element types corrupted) and learns to reconstruct the original structure. The architecture is transformer-based with graph structural information encoded through distance and edge vector bias terms in the attention mechanism:

$$A_{\text{biased}} = \operatorname{softmax}\left(\frac{QK^{\top}}{\sqrt{d_k}} + B_D + B_J\right)V$$

The overall loss combines FSMILES token prediction, absolute coordinate prediction, and local coordinate predictions ($r$, $\theta$, $\phi$) with their auxiliary counterparts:

$$L = L_{\text{FSMILES}} + L_{\text{abs-coord}} + L_r + L_\theta + L_\phi + L_{r,\text{aux}} + L_{\theta,\text{aux}} + L_{\phi,\text{aux}}$$

Fine-tuning is performed on 11,800 protein-ligand complex samples from PDBbind 2020, with the first three encoder layers frozen to prevent overfitting.

Evaluation on DUD-E with Drug-Likeness Filtering

The evaluation uses the DUD-E dataset (101 targets, 20,000+ active compounds), comparing Lingo3DMol against Pocket2Mol and TargetDiff. A key methodological contribution is the emphasis on filtering generated molecules for drug-likeness (QED >= 0.3 and SAS <= 5) before evaluating binding metrics, as the authors demonstrate that molecules with good docking scores can still be poor drug candidates.

Molecular properties and binding mode (Table 1, drug-like molecules only):

Metric	Pocket2Mol	TargetDiff	Lingo3DMol
Drug-like molecules (% of total)	61%	49%	82%
Mean QED	0.56	0.60	0.59
Mean SAS	3.5	4.0	3.1
ECFP TS > 0.5 (% of targets)	8%	3%	33%
Mean min-in-place GlideSP	-6.7	-6.2	-6.8
Mean GlideSP redocking	-7.5	-7.0	-7.8
Mean RMSD vs. low-energy conformer (A)	1.1	1.1	0.9
Diversity	0.84	0.88	0.82

Lingo3DMol generates substantially more drug-like molecules (82% vs. 61% and 49%) and finds similar-to-active compounds for 33% of targets compared to 8% (Pocket2Mol) and 3% (TargetDiff). The model also achieves the best min-in-place GlideSP scores and lowest RMSD versus low-energy conformers, indicating higher quality binding poses and more realistic 3D geometries.

Molecular geometry: Lingo3DMol demonstrated the lowest Jensen-Shannon divergence for all atom-atom distance distributions and produced significantly fewer molecules with large rings (0.23% with 7-membered rings vs. 2.59% for Pocket2Mol and 11.70% for TargetDiff).

Information leakage analysis: The authors controlled for information leakage by excluding proteins with >30% sequence identity to DUD-E targets from training. When DUD-E targets were stratified by sequence identity to Pocket2Mol’s training set, Lingo3DMol’s advantage widened as leakage decreased, suggesting the performance gap is genuine rather than an artifact of training overlap.

Ablation studies (Table 2):

Metric	Standard	Random NCI	No Pretraining
Drug-like (%)	82%	47%	71%
ECFP TS > 0.5	33%	6%	3%
Mean min-in-place GlideSP	-6.8	-5.8	-4.9
Dice score	0.25	0.15	0.13

Both pretraining and the NCI predictor are essential. Removing pretraining reduces the number of valid molecules and binding quality. Replacing the trained NCI predictor with random NCI site selection severely degrades drug-likeness and the ability to generate active-like compounds.

Key Findings, Limitations, and Future Directions

Lingo3DMol demonstrates that combining language model sequence generation with geometric deep learning can produce drug-like 3D molecules that outperform graph-based and diffusion-based alternatives in binding mode quality, drug-likeness, and similarity to known actives. The FSMILES representation successfully constrains generated molecules to realistic topologies by encoding ring size information and using fragment-level generation.

Several limitations are acknowledged. Capturing all non-covalent interactions within a single molecule remains difficult with autoregressive generation. The model does not enforce equivariance (SE(3) invariance is approximated via rotation/translation augmentation and invariant features rather than built into the architecture). The pretraining dataset is partially proprietary (12M molecules from a commercial library, of which 1.4M from public sources are shared). Diversity of generated drug-like molecules is slightly lower than baselines, though the authors argue that baseline diversity explores chemical space away from known active regions. A comprehensive evaluation of drug-like properties beyond QED and SAS metrics is identified as an important next step.

Future directions include investigating electron density representations for molecular interactions, incorporating SE(3) equivariant architectures (e.g., GVP, Vector Neurons), and developing more systematic drug-likeness evaluation frameworks.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pretraining	In-house commercial library	12M molecules (1.4M public)	Filtered for drug-likeness; conformers via ConfGen
Fine-tuning	PDBbind 2020 (general set)	11,800 samples (8,201 PDB IDs)	Filtered for <30% sequence identity to DUD-E targets
NCI labels	PDBbind 2020	Same as fine-tuning	Labeled using ODDT for H-bonds, halogen bonds, salt bridges, pi-pi stacking
Evaluation	DUD-E	101 targets, 20,000+ active compounds	Standard benchmark for structure-based drug design
Geometry evaluation	CrossDocked2020	100 targets	Used for bond length and atom distance distribution comparisons

Algorithms

Transformer-based encoder-decoder with graph structural bias terms (distance matrix $B_D$, edge vector matrix $B_J$)
Denoising pretraining: 25% atom deletion, coordinate perturbation ($\pm 0.5$ A), 25% carbon element type corruption
Depth-first search sampling with reward function combining model confidence and anchor fulfillment
Fine-tuning: first three encoder layers frozen
Local-global coordinate fusion during inference with search space: $r \pm 0.1$ A, $\theta \pm 2°$, $\phi \pm 2°$

Models

Generation model: transformer encoder-decoder with dual decoders ($D_{\text{2D}}$ for topology, $D_{\text{3D}}$ for global coordinates)
NCI/anchor prediction model: same architecture, initialized from pretrained parameters
Pretrained, fine-tuned, and NCI model checkpoints available on GitHub and figshare

Evaluation

Metric	Lingo3DMol	Best Baseline	Notes
Drug-like molecules (%)	82%	61% (P2M)	QED >= 0.3, SAS <= 5
ECFP TS > 0.5 (% targets)	33%	8% (P2M)	Tanimoto similarity to known actives
Min-in-place GlideSP	-6.8	-6.7 (P2M)	Lower is better
GlideSP redocking	-7.8	-7.5 (P2M)	Lower is better
RMSD vs. low-energy conformer	0.9 A	1.1 A (both)	Lower is better
Generation speed (100 mol)	874 +/- 401 s	962 +/- 622 s (P2M)	NVIDIA Tesla V100

Hardware

Inference benchmarked on NVIDIA Tesla V100 GPUs
Generation of 100 valid molecules per target: 874 +/- 401 seconds

Artifacts

Artifact	Type	License	Notes
Lingo3DMol	Code	GPL-3.0	Inference code and model architecture
Model checkpoints	Model	GPL-3.0	Pretraining, fine-tuning, and NCI checkpoints
Training data	Dataset	Not specified	Partial pretraining data (1.4M public molecules), fine-tuning complexes, evaluation molecules
Online service	Other	N/A	Web interface for molecule generation

Paper Information

Citation: Feng, W., Wang, L., Lin, Z., Zhu, Y., Wang, H., Dong, J., Bai, R., Wang, H., Zhou, J., Peng, W., Huang, B., & Zhou, W. (2024). Generation of 3D molecules in pockets via a language model. Nature Machine Intelligence, 6(1), 62-73. https://doi.org/10.1038/s42256-023-00775-6

@article{feng2024generation,
  title={Generation of 3D molecules in pockets via a language model},
  author={Feng, Wei and Wang, Lvwei and Lin, Zaiyun and Zhu, Yanhao and Wang, Han and Dong, Jianqiang and Bai, Rong and Wang, Huting and Zhou, Jielong and Peng, Wei and Huang, Bo and Zhou, Wenbiao},
  journal={Nature Machine Intelligence},
  volume={6},
  number={1},
  pages={62--73},
  year={2024},
  publisher={Nature Publishing Group},
  doi={10.1038/s42256-023-00775-6}
}

Evolutionary Molecular Design via Deep Learning + GA

Thu, 26 Mar 2026 00:00:00 +0000

Fingerprint-Based Evolutionary Molecular Design

This is a Method paper that introduces an evolutionary design methodology (EDM) for goal-directed molecular optimization. The primary contribution is a three-component framework where (1) molecules are encoded as extended-connectivity fingerprint (ECFP) vectors, (2) a genetic algorithm evolves these fingerprint vectors through mutation and crossover, (3) a recurrent neural network (RNN) decodes the evolved fingerprints back into valid SMILES strings, and (4) a deep neural network (DNN) evaluates molecular fitness. The key advantage over prior evolutionary approaches is that no hand-crafted chemical rules or fragment libraries are needed, as the RNN learns valid molecular reconstruction from data.

Challenges in Evolutionary Molecular Optimization

Evolutionary algorithms for molecular design face two core challenges. First, maintaining chemical validity of evolved molecules is difficult when operating on graph or string representations directly. Prior methods rely on predefined chemical rules and fragment libraries to constrain structural modifications (atom/bond additions, deletions, substitutions), but these introduce bias and risk convergence to local optima. Each new application domain requires specifying new chemical rules, which may not exist for emerging areas. Second, fitness evaluation must be both efficient and accurate. Simple evaluation methods like structural similarity indices or semi-empirical quantum chemistry calculations reduce computational cost but may not capture complex property relationships.

High-throughput computational screening (HTCS) is a common alternative, but it depends on the quality of predefined virtual chemical libraries and often requires multiple iterative enumerations, limiting its ability to explore novel chemical space.

Core Innovation: Evolving Fingerprints with Neural Decoding

The key insight is to perform genetic operations in fingerprint space rather than in molecular graph or SMILES string space. The framework comprises three learned functions:

Encoding function $e(\cdot)$: Converts a SMILES string $\mathbf{m}$ into a 5000-dimensional ECFP vector $\mathbf{x}$ using Morgan fingerprints with a neighborhood radius of 6. This is a deterministic hash-based encoding (not learned).

Decoding function $d(\cdot)$: An RNN with three hidden layers of 500 LSTM units that reconstructs a SMILES string from an ECFP vector. The RNN generates SMILES as a sequence of three-character substrings, conditioning each prediction on the current substring and the input ECFP vector:

$$d(\mathbf{x}) = \mathbf{m}, \quad \text{where } p(\mathbf{m}_{t+1} | \mathbf{m}_{t}, \mathbf{x})$$

The three-character substring approach reduces the ratio of invalid SMILES by imposing additional constraints on subsequent characters.

Property prediction function $f(\cdot)$: A five-layer DNN with 250 hidden units per layer that predicts molecular properties from ECFP vectors:

$$\mathbf{t} = f(e(\mathbf{m}))$$

The RNN is trained by minimizing cross-entropy loss between the softmax output and the target SMILES string $\mathbf{m}_{i}$, learning the relationship $d(e(\mathbf{m}_{i})) = \mathbf{m}_{i}$. The DNN is trained by minimizing mean squared error between predicted and computed property values. Both use the Adam optimizer with mini-batch size 100, 500 training epochs, and dropout rate 0.5.

Genetic Algorithm Operations

The GA evolves ECFP vectors using the DEAP library with the following parameters:

Population size: 50
Crossover rate: 0.7 (uniform crossover, mixing ratio 0.2)
Mutation rate: 0.3 (Gaussian mutation, $N(0, 0.2^{2})$, applied to 1% of elements)
Selection: Tournament selection with size 3, top 3 individuals as parents
Termination: 500 generations or 30 consecutive generations without fitness improvement

The evolutionary loop proceeds as follows: a seed molecule $\mathbf{m}_{0}$ is encoded to $\mathbf{x}_{0}$, mutated to generate a population $\mathbf{P}^{0} = {\mathbf{z}_{1}, \mathbf{z}_{2}, \ldots, \mathbf{z}_{L}}$, each vector is decoded via the RNN, validity is checked with RDKit, fitness is evaluated via the DNN, and the top parents produce the next generation through crossover and mutation.

Experimental Setup: Light-Absorbing Wavelength Optimization

Training Data and Deep Learning Performance

The models were trained on 10,000 to 100,000 molecules randomly sampled from PubChem (molecular weight 200-600 g/mol). Each molecule was labeled with DFT-computed excitation energy ($S_{1}$), HOMO, and LUMO energies using B3LYP/6-31G.

Training Data	Validity (%)	Reconstructability (%)	$S_{1}$ (R, MAE)	HOMO (R, MAE)	LUMO (R, MAE)
100,000	88.8	62.4	0.977, 0.185 eV	0.948, 0.168 eV	0.960, 0.195 eV
50,000	86.7	60.1	0.973, 0.198 eV	0.945, 0.172 eV	0.955, 0.209 eV
30,000	85.3	59.8	0.930, 0.228 eV	0.934, 0.191 eV	0.945, 0.224 eV
10,000	83.2	55.7	0.913, 0.278 eV	0.885, 0.244 eV	0.917, 0.287 eV

Validity refers to the proportion of chemically valid SMILES after RDKit inspection. Reconstructability measures how often the RNN can reproduce the original molecule from its ECFP (62.4% at 100k training samples by matching canonical SMILES among 10,000 generated strings).

Design Task 1: Unconstrained S1 Modification

Fifty seed molecules with $S_{1}$ values between 3.8 eV and 4.2 eV were evolved in both increasing and decreasing directions. With 50,000 training samples, $S_{1}$ increased by approximately 60% on average in the increasing direction and showed slightly lower rates of change in the decreasing direction. The asymmetry is attributed to the skewed $S_{1}$ distribution of training data (average $S_{1}$ of 4.3-4.4 eV, higher than the seed median of 4.0 eV). Performance saturated at approximately 50,000 training samples.

Design Task 2: S1 Modification with HOMO/LUMO Constraints

The same 50 seeds were evolved with constraints: $-7.0 \text{ eV} < \text{HOMO} < -5.0 \text{ eV}$ and $\text{LUMO} < 0.0 \text{ eV}$. In the increasing $S_{1}$ direction, constraints suppressed the rate of change because both HOMO and LUMO bounds limit the achievable HOMO-LUMO gap. In the decreasing direction, constraints had minimal effect because LUMO could freely decrease while HOMO had sufficient room to rise within the allowed range.

Design Task 3: Extrapolation Beyond Training Data

To generate molecules with $S_{1}$ values below 1.77 eV (outside the training distribution, which had mean $S_{1}$ of 4.91 eV), the authors introduced iterative “phases”: generate molecules, compute their properties via DFT, retrain the models, and repeat. Starting from the 30 lowest-$S_{1}$ seed molecules with 300 generation runs per phase:

Phase 1: Average $S_{1}$ = 2.20 eV, 12 molecules below 1.77 eV
Phase 2: Average $S_{1}$ = 2.22 eV, 37 molecules below 1.77 eV
Phase 3: Average $S_{1}$ = 2.31 eV, 58 molecules below 1.77 eV

While the average $S_{1}$ rose slightly across phases, variance decreased (from 1.40 to 1.36), indicating the model concentrated its outputs closer to the target range. This active-learning-like loop demonstrates the framework can extend beyond the training distribution.

Design Task 4: GuacaMol Benchmarks

The method was evaluated on the GuacaMol goal-directed benchmark suite using the ChEMBL25 training dataset. The RNN model was retrained with three-character substrings.

Benchmark	Best of Dataset	SMILES LSTM	SMILES GA	Graph GA	Graph MCTS	cRNN	EDM (ours)
Celecoxib rediscovery	0.505	1.000	0.607	1.000	0.378	1.000	1.000
Troglitazone rediscovery	0.419	1.000	0.558	1.000	0.312	1.000	1.000
Thiothixene rediscovery	0.456	1.000	0.495	1.000	0.308	1.000	1.000
LogP(-1.0)	1.000	1.000	1.000	1.000	0.980	1.000	1.000
LogP(8.0)	1.000	1.000	1.000	1.000	0.979	1.000	1.000
TPSA(150.0)	1.000	1.000	1.000	1.000	1.000	1.000	1.000
CNS MPO	1.000	1.000	1.000	1.000	1.000	1.000	1.000
QED	0.948	0.948	0.948	0.948	0.944	0.948	0.948

The EDM achieves maximum scores on all eight tasks, matching the cRNN baseline. The 256 highest-scoring molecules from the ChEMBL25 test set were used as seeds, with 500 SMILES strings generated per seed.

Key Findings and Limitations

Results

The evolutionary design framework successfully evolved seed molecules toward target properties across all four design tasks. The RNN decoder maintained 88.8% chemical validity at 100k training samples, and the DNN property predictor achieved correlation coefficients above 0.94 for $S_{1}$, HOMO, and LUMO prediction. The iterative retraining procedure enabled exploration outside the training data distribution, generating 58 molecules with $S_{1}$ below 1.77 eV after three phases. On GuacaMol benchmarks, the method achieved maximum scores on all eight tasks, matching SMILES LSTM, Graph GA, and cRNN baselines.

Limitations

Several limitations are worth noting:

Reconstructability ceiling: Only 62.4% of molecules could be reconstructed from their ECFP vectors, meaning the RNN decoder fails to recover the original molecule approximately 38% of the time. This information loss in the ECFP encoding is a fundamental bottleneck.
Data dependence: Performance is sensitive to the training data distribution. The asymmetric evolution rates for increasing vs. decreasing $S_{1}$ directly reflect the skewed training data.
Structural constraints: Three heuristic constraints (fused ring sizes, number of fused rings, alkyl chain lengths) were still needed to maintain reasonable molecular structures, partially undermining the claim of a fully data-driven approach.
DFT reliance: The extrapolation experiment requires DFT calculations in the loop, which are computationally expensive and may limit scalability.
Limited benchmark scope: Only 8 GuacaMol tasks were tested, and all achieved perfect scores, making it difficult to differentiate from competing methods. The paper does not report on harder multi-objective benchmarks.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Training/Evaluation	PubChem random sample	10,000-100,000 molecules	MW 200-600 g/mol, labeled with DFT-computed $S_{1}$, HOMO, LUMO
GuacaMol Benchmark	ChEMBL25	Standard split	Used for retraining RNN; 256 top-scoring seeds

Algorithms

Genetic algorithm: DEAP library; population 50, crossover rate 0.7, mutation rate 0.3, tournament size 3
RNN decoder: 3 hidden layers, 500 LSTM units each, three-character substring generation
DNN predictor: 5 layers, 250 hidden units, sigmoid activations, linear output
Training: Adam optimizer, mini-batch 100, 500 epochs, dropout 0.5

Models

All neural networks were implemented using Keras with the Theano backend (GPU-accelerated). No pre-trained model weights are publicly available.

Evaluation

RNN validity: Proportion of chemically valid SMILES (RDKit check)
Reconstructability: Fraction of seed molecules recoverable from ECFP (canonical SMILES match in 10,000 generated strings)
DNN accuracy: Correlation coefficient (R) and MAE via 10-fold cross-validation
Evolutionary performance: Average rate of $S_{1}$ change across 50 seeds; molecule count in target range
GuacaMol: Standard rediscovery and property satisfaction benchmarks

Hardware

The paper does not specify GPU models, training times, or computational requirements for the evolutionary runs. DFT calculations used the Gaussian 09 program suite with B3LYP/6-31G.

Artifacts

No public code repository or pre-trained models are available. The paper is published under a CC-BY 4.0 license as open access in Scientific Reports.

Artifact	Type	License	Notes
Paper (Nature)	Paper	CC-BY 4.0	Open access

Reproducibility classification: Partially Reproducible. The method is described in sufficient detail for reimplementation, but no code, trained models, or preprocessed datasets are released. The DFT calculations require Gaussian 09, a commercial software package.

Paper Information

Citation: Kwon, Y., Kang, S., Choi, Y.-S., & Kim, I. (2021). Evolutionary design of molecules based on deep learning and a genetic algorithm. Scientific Reports, 11, 17304. https://doi.org/10.1038/s41598-021-96812-8

@article{kwon2021evolutionary,
  title={Evolutionary design of molecules based on deep learning and a genetic algorithm},
  author={Kwon, Youngchun and Kang, Seokho and Choi, Youn-Suk and Kim, Inkoo},
  journal={Scientific Reports},
  volume={11},
  number={1},
  pages={17304},
  year={2021},
  publisher={Nature Publishing Group},
  doi={10.1038/s41598-021-96812-8}
}

BindGPT: GPT for 3D Molecular Design and Docking

Thu, 26 Mar 2026 00:00:00 +0000

A Language Model for Joint 3D Molecular Graph and Conformation Generation

BindGPT is a Method paper that introduces a GPT-based language model for generating 3D molecular structures. The primary contribution is a unified framework that jointly produces molecular graphs (via SMILES) and 3D coordinates (via XYZ tokens) within a single autoregressive model. This eliminates the need for external graph reconstruction tools like OpenBabel, which are error-prone when applied to noisy atom positions. The same pre-trained model serves as a 3D molecular generative model, a conformer generator conditioned on molecular graphs, and a pocket-conditioned 3D molecule generator.

The Graph Reconstruction Problem in 3D Molecular Generation

Most existing 3D molecular generators focus on predicting atom types and positions, relying on supplementary software (e.g., OpenBabel or RDKit) to reconstruct molecular bonds from predicted coordinates. This introduces a fragile dependency: small positional errors can drastically change the reconstructed molecular graph or produce disconnected structures. Additionally, while diffusion models and equivariant GNNs have shown strong results for 3D molecular generation, they often depend on SE(3) equivariance inductive biases and are computationally expensive at sampling time (up to $10^6$ seconds for 1000 valid molecules for EDM). The pocket-conditioned generation task is further limited by the small size of available 3D binding pose datasets (e.g., CrossDocked), making it difficult for specialized models to generalize without large-scale pre-training.

SMILES+XYZ Tokenization: Jointly Encoding Graphs and Coordinates

The core innovation in BindGPT is coupling SMILES notation with XYZ coordinate format in a single token sequence. The sequence starts with a token, followed by character-level SMILES tokens encoding the molecular graph, then an token marking the transition to coordinate data. Each 3D atom position is encoded using 6 tokens (integer and fractional parts for each of the three coordinates). The atom ordering is synchronized between SMILES and XYZ, so atom symbols from SMILES are not repeated in the coordinate section.

For protein pockets, sequences begin with a token followed by atom names and coordinates. Following AlphaFold’s approach, only alpha-carbon coordinates are retained to keep pocket representations compact.

The model uses the GPT-NeoX architecture with rotary position embeddings (RoPE), which enables length generalization between pre-training and fine-tuning where sequence lengths differ substantially. The pre-trained model has 108M parameters with 15 layers, 12 attention heads, and a hidden dimension of 768.

Pre-training on Large-Scale 3D Data

Pre-training uses the Uni-Mol dataset containing 208M conformations for 12M molecules and 3.2M protein pocket structures. Each training batch contains either ligand sequences or pocket sequences (not mixed within a sequence). Since pockets are far fewer than ligands, the training schedule runs 5 pocket epochs per ligand epoch, resulting in roughly 8% pocket tokens overall. Training uses large batches of 1.6M tokens per step with Flash Attention and DeepSpeed optimizations.

Supervised Fine-Tuning with Augmentation

For pocket-conditioned generation, BindGPT is fine-tuned on CrossDocked 2020, which contains aligned pocket-ligand pairs. Unlike prior work that subsamples less than 1% of the best pairs, BindGPT uses all intermediate ligand poses (including lower-quality ones), yielding approximately 27M pocket-ligand pairs. To combat overfitting on the limited diversity (14k unique molecules, 3k pockets), two augmentation strategies are applied:

SMILES randomization: Each molecule can yield 100-1000 different valid SMILES strings
Random 3D rotation: The same rotation matrix is applied to both pocket and ligand coordinates

During fine-tuning, the pocket token sequence is concatenated before the ligand token sequence. An optional variant conditions on binding energy scores from the CrossDocked dataset, enabling contrastive learning between good and bad binding examples.

Reinforcement Learning with Docking Feedback

BindGPT applies REINFORCE (not PPO or REINVENT, which were found less stable) to further optimize pocket-conditioned generation. On each RL step, the model generates 3D ligand structures for a batch of random protein pockets, computes binding energy rewards using QVINA docking software, and updates model parameters. A KL-penalty between the current model and the SFT initialization stabilizes training.

The RL objective can be written as:

$$\mathcal{L}_{\text{RL}} = -\mathbb{E}_{x \sim \pi_\theta}\left[ R(x) \right] + \beta \cdot D_{\text{KL}}(\pi_\theta | \pi_{\text{SFT}})$$

where $R(x)$ is the docking reward from QVINA and $\beta$ controls the strength of the KL regularization.

Experimental Evaluation Across Three 3D Generation Tasks

Datasets

Purpose	Dataset	Size	Notes
Pre-training	Uni-Mol 3D	208M conformations (12M molecules) + 3.2M pockets	Large-scale 3D molecular dataset
Fine-tuning (SFT)	CrossDocked 2020	~27M pocket-ligand pairs	14k molecules x 3k pockets, includes all pose qualities
Fine-tuning (conformer)	GEOM-DRUGS	27M conformations for 300k molecules	Standard benchmark for 3D conformer generation
Evaluation (conformer)	Platinum	Experimentally validated conformations	Zero-shot evaluation holdout
Evaluation (pocket)	CrossDocked holdout	100 pockets	Held out from training

Task 1: 3D Molecule Generation (Pre-training)

Compared against XYZ-Transformer (the only other model capable of large-scale pre-training), BindGPT achieves 98.58% validity (vs. 12.87% for XYZ-TF without hydrogens), higher SA (0.77 vs. 0.21), QED (0.59 vs. 0.30), and Lipinski scores (4.86 vs. 4.79). BindGPT also produces conformations with RMSD of 0.89 (XYZ-TF’s RMSD calculation failed to converge). Generation is 12x faster (13s vs. 165s for 1000 molecules).

Task 2: 3D Molecule Generation (Fine-tuned on GEOM-DRUGS)

Against EDM and MolDiff (diffusion baselines), BindGPT outperforms on nearly all 3D distributional metrics:

Metric	EDM	MolDiff	BindGPT
JS bond lengths	0.246	0.365	0.029
JS bond angles	0.282	0.155	0.075
JS dihedral angles	0.328	0.162	0.098
JS freq. bond types	0.378	0.163	0.045
JS freq. bond pairs	0.396	0.136	0.043
JS freq. bond triplets	0.449	0.125	0.042
Time (1000 molecules)	1.4e6 s	7500 s	200 s

BindGPT is two orders of magnitude faster than diffusion baselines while producing more accurate 3D geometries. MolDiff achieves better drug-likeness scores (QED, SA), but the authors argue 3D distributional metrics are more relevant for evaluating 3D structure fidelity.

Task 3: Pocket-Conditioned Molecule Generation

Method	Vina Score	SA	QED	Lipinski
Pocket2Mol	-7.15 +/- 4.89	0.75	0.57	4.88
TargetDiff	-7.80 +/- 3.61	0.58	0.48	4.51
BindGPT-FT	-5.44 +/- 2.09	0.78	0.50	4.72
BindGPT-RFT	-7.24 +/- 1.68	0.74	0.48	4.32
BindGPT-RL	-8.60 +/- 1.90	0.84	0.43	4.81

The RL-fine-tuned model achieves the best Vina binding scores (-8.60 vs. -7.80 for TargetDiff) with lower variance and the highest SA score (0.84). The SFT-only model (BindGPT-FT) underperforms baselines on binding score, demonstrating that RL is essential for strong pocket-conditioned generation. QED is lower for BindGPT-RL, but the authors note that QED could be included in the RL reward and was excluded for fair comparison.

Conformer Generation

On the Platinum dataset (zero-shot), BindGPT matches the performance of Torsional Diffusion (the specialized state-of-the-art) when assisted by RDKit, with a small gap without RDKit assistance. Uni-Mol fails to generalize to this dataset despite pre-training on the same Uni-Mol data.

Key Findings, Limitations, and Future Directions

BindGPT demonstrates that a simple autoregressive language model without equivariance inductive biases can match or surpass specialized diffusion models and GNNs across multiple 3D molecular generation tasks. The key findings include:

Joint SMILES+XYZ generation eliminates graph reconstruction errors, achieving 98.58% validity compared to 12.87% for XYZ-Transformer
Large-scale pre-training is critical for pocket-conditioned generation, as none of the baselines use pre-training and instead rely on heavy inductive biases
RL fine-tuning with docking feedback substantially improves binding affinity beyond what SFT alone achieves
Sampling is two orders of magnitude faster than diffusion baselines (200s vs. 1.4M s for EDM)

Limitations include the relatively modest model size (108M parameters), with the authors finding this sufficient for current tasks but not exploring larger scales. The RL optimization uses only Vina score as reward; multi-objective optimization incorporating SA, QED, and other properties is left as future work. The model also relies on character-level SMILES tokenization rather than more sophisticated chemical tokenizers. BindGPT is the first model to explicitly generate hydrogens at scale, though validity drops from 98.58% to 77.33% when hydrogens are included.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pre-training	Uni-Mol 3D	208M conformations, 12M molecules, 3.2M pockets	From Zhou et al. (2023)
SFT (pocket)	CrossDocked 2020	~27M pocket-ligand pairs	Full version including low-quality poses
SFT (conformer)	GEOM-DRUGS	27M conformations, 300k molecules	Standard benchmark
Evaluation	Platinum	Experimentally validated conformations	Zero-shot holdout

Algorithms

Architecture: GPT-NeoX with rotary position embeddings (RoPE)
Pre-training: Causal language modeling with 1.6M tokens per batch
SFT augmentation: SMILES randomization + random 3D rotation
RL: REINFORCE with KL-penalty from SFT initialization; QVINA docking as reward

Models

Size: 108M parameters, 15 layers, 12 heads, hidden size 768
Vocabulary: Character-level SMILES tokens + special tokens (, , ) + coordinate tokens (6 per 3D position)

Evaluation

Validity, SA, QED, Lipinski: Standard drug-likeness metrics
Jensen-Shannon divergences: Distribution-level 3D structural metrics (bond lengths, angles, dihedrals, bond types)
RMSD: Alignment quality of generated conformations vs. RDKit reference
RMSD-Coverage: CDF of RMSD between generated and reference conformers
Vina score: Binding energy from QVINA docking software

Hardware

Pre-training and fine-tuning use Flash Attention and DeepSpeed for efficiency
Specific GPU counts and training times are described in Appendix G (not available in the main text)

Artifacts

Artifact	Type	License	Notes
Project Page	Other	Not specified	Project website with additional details

No public code repository or pre-trained model weights were identified. The project website exists but no source code has been released as of this writing.

Reproducibility Status: Partially Reproducible. The paper provides detailed architecture specs and hyperparameters, but no public code or model weights are available. All training datasets (Uni-Mol, CrossDocked, GEOM-DRUGS) are publicly accessible.

Paper Information

Citation: Zholus, A., Kuznetsov, M., Schutski, R., Shayakhmetov, R., Polykovskiy, D., Chandar, S., & Zhavoronkov, A. (2025). BindGPT: A Scalable Framework for 3D Molecular Design via Language Modeling and Reinforcement Learning. Proceedings of the AAAI Conference on Artificial Intelligence, 39(24), 26083-26091. https://doi.org/10.1609/aaai.v39i24.34804

@inproceedings{zholus2025bindgpt,
  title={BindGPT: A Scalable Framework for 3D Molecular Design via Language Modeling and Reinforcement Learning},
  author={Zholus, Artem and Kuznetsov, Maksim and Schutski, Roman and Shayakhmetov, Rim and Polykovskiy, Daniil and Chandar, Sarath and Zhavoronkov, Alex},
  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
  volume={39},
  number={24},
  pages={26083--26091},
  year={2025},
  doi={10.1609/aaai.v39i24.34804}
}

AlphaDrug: MCTS-Guided Target-Specific Drug Design

Thu, 26 Mar 2026 00:00:00 +0000

Target-Conditioned Molecular Generation via Transformer and MCTS

AlphaDrug is a Method paper that proposes a target-specific de novo molecular generation framework. The primary contribution is the combination of two components: (1) an Lmser Transformer (LT) that embeds protein-ligand context through hierarchical skip connections from encoder to decoder, and (2) a Monte Carlo tree search (MCTS) procedure guided by both the LT’s predicted probabilities and docking scores from the SMINA program. The method generates SMILES strings autoregressively, with each symbol selection informed by look-ahead search over potential binding affinities.

Bridging the Gap Between Molecular Generation and Protein Targeting

Most deep learning methods for de novo molecular generation optimize physicochemical properties (LogP, QED, SA) without conditioning on a specific protein target. Virtual screening approaches rely on existing compound databases and are computationally expensive. The few methods that do consider protein targets, such as LiGANN and the transformer-based approach of Grechishnikova (2021), show limited docking performance. The core challenge is twofold: the search space of drug-like molecules is estimated at $10^{60}$ compounds, and learning protein-ligand interaction patterns from sequence data is difficult because proteins and ligands have very different structures and sequence lengths.

AlphaDrug addresses these gaps by proposing a method that jointly learns protein-ligand representations and uses docking-guided search to navigate the vast chemical space.

Lmser Transformer and Docking-Guided MCTS

The key innovations are the Lmser Transformer architecture and the MCTS search strategy.

Lmser Transformer (LT)

The standard transformer for sequence-to-sequence tasks passes information from the encoder’s top layer to the decoder through cross-attention. AlphaDrug identifies an information transfer bottleneck: deep protein features from the encoder’s final layer must serve all decoder layers. Inspired by the Lmser (least mean squared error reconstruction) network, the authors add hierarchical skip connections from each encoder layer to the corresponding decoder layer.

Each decoder layer receives protein features at the matching level of abstraction through a cross-attention mechanism:

$$f_{ca}(Q_m, K_S, V_S) = \text{softmax}\left(\frac{Q_m K_S^T}{\sqrt{d_k}}\right) V_S$$

where $Q_m$ comes from the ligand molecule decoder and $(K_S, V_S)$ are passed through skip connections from the protein encoder. This allows different decoder layers to access different levels of protein features, rather than all layers sharing the same top-level encoding.

The multi-head attention follows the standard formulation:

$$\text{MultiHead}(Q, K, V) = \text{Concat}(H_1, \dots, H_h) W^O$$

$$H_i = f_{ca}(Q W_i^Q, K W_i^K, V W_i^V)$$

MCTS for Molecular Generation

The molecular generation process models SMILES construction as a sequential decision problem. At each step $\tau$, the context $C_\tau = {S, a_1 a_2 \cdots a_\tau}$ consists of the protein sequence $S$ and the intermediate SMILES string. MCTS runs a fixed number of simulations per step, each consisting of four phases:

Select: Starting from the current root node, child nodes are selected using a variant of the PUCT algorithm:

$$\tilde{a}_{\tau+t} = \underset{a \in A}{\arg\max}\left(Q(\tilde{C}_{\tau+t-1}, a) + U(\tilde{C}_{\tau+t-1}, a)\right)$$

where $Q(\tilde{C}, a) = W_a / N_a$ is the average reward and $U(\tilde{C}, a) = c_{puct} \cdot P(a | \tilde{C}) \cdot \sqrt{N_t} / (1 + N_t(a))$ is an exploration bonus based on the LT’s predicted probability.

The Q-values are normalized to $[0, 1]$ using the range of docking scores in the tree:

$$Q(\tilde{C}, a) \leftarrow \frac{Q(\tilde{C}, a) - \min_{m \in \mathcal{M}} f_d(S, m)}{\max_{m \in \mathcal{M}} f_d(S, m) - \min_{m \in \mathcal{M}} f_d(S, m)}$$

Expand: At a leaf node, the LT computes next-symbol probabilities and adds child nodes to the tree.

Rollout: A complete molecule is generated greedily using LT probabilities. Valid molecules are scored with SMINA docking; invalid molecules receive the minimum observed docking score.

Backup: Docking values propagate back up the tree, updating visit counts and cumulative rewards.

Training Objective

The LT is trained on known protein-ligand pairs using cross-entropy loss:

$$J(\Theta) = -\sum_{(S,m) \in \mathcal{D}} \sum_{\tau=1}^{L_m} \sum_{a \in \mathcal{A}} y_a \ln P(a \mid C_\tau(S, m))$$

MCTS is only activated during inference, not during training.

Experiments on Diverse Protein Targets

Dataset

The authors use BindingDB, filtered to 239,455 protein-ligand pairs across 981 unique proteins. Filtering criteria include: human proteins only, IC50 < 100 nM, molecular weight < 1000 Da, and single-chain targets. Proteins are clustered at 30% sequence identity using MMseqs2, with 25 clusters held out for testing (100 proteins), and the remainder split 90/10 for training (192,712 pairs) and validation (17,049 pairs).

Baselines

T+BS10: Standard transformer with beam search (K=10) from Grechishnikova (2021)
LT+BS10: The proposed Lmser Transformer with beam search
LiGANN: 3D pocket-to-ligand shape generation via BicycleGAN
SBMolGen: ChemTS-based method with docking constraints
SBDD-3D: 3D autoregressive graph-based generation
Decoys: Random compounds from ZINC database
Known ligands: Original binding partners from the database

Main Results

Method	Docking	Uniqueness	LogP	QED	SA	NP
Decoys	7.3	-	2.4	0.8	2.4	-1.2
Known ligands	9.8	-	2.2	0.5	3.3	-1.0
LiGANN	6.7	94.7%	2.9	0.6	3.0	-1.1
SBMolGen	7.7	100%	2.6	0.7	2.8	-1.2
SBDD-3D	7.7	99.3%	1.5	0.6	4.0	0.3
T+BS10	8.5	90.6%	3.8	0.5	2.8	-0.8
LT+BS10	8.5	98.1%	4.0	0.5	2.7	-1.0
AlphaDrug (freq)	10.8	99.5%	4.9	0.4	2.9	-1.0
AlphaDrug (max)	11.6	100%	5.2	0.4	2.7	-0.8

AlphaDrug (max) achieves the highest average docking score (11.6), surpassing known ligands (9.8). Statistical significance is confirmed with two-tailed t-test P-values below 0.01 for all comparisons.

MCTS vs. Beam Search Under Equal Compute

When constrained to the same number of docking evaluations, MCTS consistently outperforms beam search:

Docking times (N)	BS	MCTS	P-value
N = 105 (S=10)	8.4 (10.9)	10.9 (11.5)	1.8e-34 (4.5e-3)
N = 394 (S=50)	8.3 (11.4)	11.6 (12.2)	1.4e-31 (1.8e-3)
N = 1345 (S=500)	8.4 (11.9)	12.4 (13.2)	2.2e-39 (8.2e-6)

Values in parentheses are average top-1 scores per protein.

Ablation: Effect of Protein Sequence Input

Replacing the full transformer (T) or LT with a transformer encoder only (TE, no protein input) demonstrates that protein conditioning improves both uniqueness and docking score per symbol (SpS):

Method	Uniqueness	SpS	Molecular length
TE + MCTS (S=50)	81.0%	0.1926	62.93
T + MCTS (S=50)	98.0%	0.2149	55.63
LT + MCTS (S=50)	100.0%	0.2159	56.54

The SpS metric (docking score normalized by molecule length) isolates the quality improvement from the tendency of longer molecules to score higher.

Computational Efficiency

A docking lookup table caches previously computed protein-molecule docking scores, reducing actual docking calls by 81-86% compared to the theoretical maximum ($L \times S$ calls per molecule). With $S = 10$, AlphaDrug generates molecules in about 52 minutes per protein; with $S = 50$, about 197 minutes per protein.

Docking Gains with Acknowledged Limitations

Key Findings

86% of AlphaDrug-generated molecules have higher docking scores than known ligands for their respective targets.
The LT architecture with hierarchical skip connections improves uniqueness (from 90.6% to 98.1% with beam search) and provides slight SpS gains over the vanilla transformer.
MCTS is the dominant factor in performance improvement: even with only 10 simulations, it boosts docking scores by 31.3% over greedy LT decoding.
Case studies on three proteins (3gcs, 3eig, 4o28) show that generated molecules share meaningful substructures with known ligands, suggesting chemical plausibility.

Limitations

The authors identify three areas for improvement:

Sequence-only representation: AlphaDrug uses amino acid sequences rather than 3D protein structures. While it outperforms existing 3D methods (SBDD-3D), incorporating 3D pocket geometry could further improve performance.
External docking as value function: SMINA docking calls are computationally expensive and become a bottleneck during MCTS. A learnable end-to-end value network would reduce this cost and allow joint policy-value training.
Full rollout requirement: Every MCTS simulation requires generating a complete molecule for docking evaluation. Estimating binding affinity from partial molecules remains an open challenge.

The physicochemical properties (QED, SA) of AlphaDrug’s outputs are comparable to baselines but not explicitly optimized. LogP values trend toward the upper end of the Ghose filter range (4.9-5.2 vs. the 5.6 limit), which may indicate lipophilicity bias.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Training	BindingDB (filtered)	192,712 protein-ligand pairs	Human proteins, IC50 < 100 nM, MW < 1000 Da
Validation	BindingDB (filtered)	17,049 pairs	Same filtering criteria
Testing	BindingDB (filtered)	100 proteins from 25 clusters	Clustered at 30% sequence identity via MMseqs2

Algorithms

MCTS with PUCT selection criterion, $c_{puct} = 1.5$
$S = 50$ simulations per step (default), $S = 10$ for fast variant
Greedy rollout policy using LT probabilities
Docking lookup table for efficiency (caches SMINA results)
Two generation modes: max (deterministic, highest visit count) and freq (stochastic, proportional to visit counts)

Models

Lmser Transformer with hierarchical encoder-to-decoder skip connections
Sinusoidal positional encoding
Multi-head cross-attention at each decoder layer
Detailed hyperparameters (embedding dimensions, number of layers/heads) are in the supplementary material (Table S1)

Evaluation

Metric	AlphaDrug (max)	Known ligands	Best baseline (T+BS10)
Docking score	11.6	9.8	8.5
Uniqueness	100%	-	90.6%
Validity	100%	-	Not reported

Hardware

Hardware specifications are not explicitly reported in the paper. Generation time is reported as approximately 52 minutes per protein ($S = 10$) and 197 minutes per protein ($S = 50$), with docking (via SMINA) being the dominant cost.

Artifacts

Artifact	Type	License	Notes
CMACH508/AlphaDrug	Code	MIT	Official implementation, includes data processing and generation scripts

Paper Information

Citation: Qian, H., Lin, C., Zhao, D., Tu, S., & Xu, L. (2022). AlphaDrug: protein target specific de novo molecular generation. PNAS Nexus, 1(4), pgac227. https://doi.org/10.1093/pnasnexus/pgac227

@article{qian2022alphadrug,
  title={AlphaDrug: protein target specific de novo molecular generation},
  author={Qian, Hao and Lin, Cheng and Zhao, Dengwei and Tu, Shikui and Xu, Lei},
  journal={PNAS Nexus},
  volume={1},
  number={4},
  pages={pgac227},
  year={2022},
  doi={10.1093/pnasnexus/pgac227}
}

TamGen: GPT-Based Target-Aware Drug Design and Generation

Wed, 25 Mar 2026 00:00:00 +0000

A Method for Target-Conditioned Molecular Generation

This is a Method paper that introduces TamGen (Target-aware molecular generation), a three-module architecture for generating drug-like compounds conditioned on protein binding pocket structures. The primary contribution is a GPT-like chemical language model pre-trained on 10 million SMILES from PubChem, combined with a Transformer-based protein encoder and a VAE-based contextual encoder for compound refinement. The authors validate TamGen on the CrossDocked2020 benchmark and apply it through a Design-Refine-Test pipeline to discover 14 novel inhibitors of the Mycobacterium tuberculosis ClpP protease, with $\text{IC}_{50}$ values ranging from 1.88 to 35.2 $\mu$M.

Bridging Generative AI and Practical Drug Discovery

Target-based generative drug design aims to create novel compounds with desired pharmacological properties from scratch, exploring the estimated $10^{60}$ feasible compounds in chemical space rather than screening existing libraries of $10^{4}$ to $10^{8}$ molecules. Prior approaches using diffusion models, GANs, VAEs, and autoregressive models have demonstrated the feasibility of generating compounds conditioned on target proteins. However, most generated compounds lack satisfactory physicochemical properties for drug-likeness, and validations with biophysical or biochemical assays are largely missing.

The key limitations of existing 3D generation methods (TargetDiff, Pocket2Mol, ResGen, 3D-AR) include:

Generated compounds frequently contain multiple fused rings, leading to poor synthetic accessibility
High cellular toxicity and decreased developability associated with excessive fused ring counts
Slow generation speeds (tens of minutes to hours per 100 compounds)
Limited real-world experimental validation of generated candidates

TamGen addresses these issues by operating in 1D SMILES space rather than 3D coordinate space, leveraging pre-training on natural compound distributions to produce more drug-like molecules.

TamGen consists of three components: a compound decoder, a protein encoder, and a contextual encoder.

Compound Decoder (Chemical Language Model)

The compound decoder is a GPT-style autoregressive model pre-trained on 10 million SMILES randomly sampled from PubChem. The pre-training objective follows standard next-token prediction:

$$ \min -\sum_{y \in \mathcal{D}_0} \frac{1}{M_y} \sum_{i=1}^{M_y} \log P(y_i \mid y_{i-1}, y_{i-2}, \ldots, y_1) $$

where $M_y$ is the SMILES sequence length. This enables both unconditional and conditional generation. The decoder uses 12 Transformer layers with hidden dimension 768.

Protein Encoder with Distance-Aware Attention

The protein encoder processes binding pocket residues using both sequential and geometric information. Given amino acids $\mathbf{a} = (a_1, \ldots, a_N)$ with 3D coordinates $\mathbf{r} = (r_1, \ldots, r_N)$, the input representation combines amino acid embeddings with coordinate embeddings:

$$ h_i^{(0)} = E_a a_i + E_r \rho\left(r_i - \frac{1}{N}\sum_{j=1}^{N} r_j\right) $$

where $\rho$ denotes a random roto-translation operation applied as data augmentation, and coordinates are centered to the origin.

The encoder uses a distance-aware self-attention mechanism that weights attention scores by spatial proximity:

$$ \begin{aligned} \hat{\alpha}_j &= \exp\left(-\frac{|r_i - r_j|^2}{\tau}\right)(h_i^{(l)\top} W h_j^{(l)}) \\ \alpha_j &= \frac{\exp \hat{\alpha}_j}{\sum_{k=1}^{N} \exp \hat{\alpha}_k} \\ \hat{\boldsymbol{h}}_i^{(l+1)} &= \sum_{j=1}^{N} \alpha_j (W_v h_j^{(l)}) \end{aligned} $$

where $\tau$ is a temperature hyperparameter and $W$, $W_v$ are learnable parameters. The encoder uses 4 layers with hidden dimension 256. Outputs are passed to the compound decoder via cross-attention.

VAE-Based Contextual Encoder

A VAE-based contextual encoder determines the mean $\mu$ and standard deviation $\sigma$ for any (compound, protein) pair. During training, the model recovers the input compound. During application, a seed compound enables compound refinement. The full training objective combines reconstruction loss with KL regularization:

$$ \min_{\Theta, q} \frac{1}{|\mathcal{D}|} \sum_{(\mathbf{x}, \mathbf{y}) \in \mathcal{D}} -\log P(\mathbf{y} \mid \mathbf{x}, z; \Theta) + \beta \mathcal{D}_{\text{KL}}(q(z \mid \mathbf{x}, \mathbf{y}) | p(z)) $$

where $\beta$ is a hyperparameter controlling the KL divergence weight, and $p(z)$ is a standard Gaussian prior.

Benchmark Evaluation and Tuberculosis Drug Discovery

CrossDocked2020 Benchmark

TamGen was evaluated against five baselines (liGAN, 3D-AR, Pocket2Mol, ResGen, TargetDiff) on the CrossDocked2020 dataset (~100k drug-target pairs for training, 100 test binding pockets). For each target, 100 compounds were generated per method. Evaluation metrics included:

Docking score (AutoDock-Vina): binding affinity estimate
QED: quantitative estimate of drug-likeness
Lipinski’s Rule of Five: physicochemical property compliance
SAS: synthetic accessibility score
LogP: lipophilicity (optimal range 0-5 for oral administration)
Molecular diversity: Tanimoto similarity between Morgan fingerprints

TamGen ranked first or second on 5 of 6 metrics and achieved the best overall score using mean reciprocal rank (MRR) across all metrics. On synthetic accessibility for high-affinity compounds, TamGen performed best. The generated compounds averaged 1.78 fused rings, closely matching FDA-approved drugs, while competing 3D methods produced compounds with significantly more fused rings.

TamGen was also 85x to 394x faster than competing methods: generating 100 compounds per target in an average of 9 seconds on a single A6000 GPU, compared to tens of minutes or hours for the baselines.

Design-Refine-Test Pipeline for ClpP Inhibitors

The practical application targeted ClpP protease of Mycobacterium tuberculosis, an emerging antibiotic target with no documented advanced inhibitors beyond Bortezomib.

Design stage: Using the ClpP binding pocket from PDB structure 5DZK, TamGen generated 2,612 unique compounds. Compounds were filtered by molecular docking (retaining those with better scores than Bortezomib) and Ligandformer phenotypic activity prediction. Peptidomimetic compounds were excluded for poor ADME properties. Four seed compounds were selected.

Refine stage: Using the 4 seed compounds plus 3 weakly active compounds ($\text{IC}_{50}$ 100-200 $\mu$M) from prior experiments, TamGen generated 8,635 unique compounds conditioned on both the target and seeds. After filtering, 296 compounds were selected for testing.

Test stage: From a 446k commercial compound library, 159 analogs (MCS similarity > 0.55) were identified. Five analogs showed significant inhibitory effects. Dose-response experiments revealed $\text{IC}_{50}$ values below 20 $\mu$M for all five, with Analog-005 achieving $\text{IC}_{50}$ of 1.9 $\mu$M. Three additional novel compounds were synthesized for SAR analysis:

Compound	Series	Source	$\text{IC}_{50}$ ($\mu$M)	Key Feature
Analog-005	II	Commercial library	1.9	Most potent analog
Analog-003	I	Commercial library	< 20	Strongest single-dose inhibition
Syn-A003-01	I	TamGen (synthesized)	< 20	Diphenylurea scaffold

Both compound series (diphenylurea and benzenesulfonamide scaffolds) represent novel ClpP inhibitor chemotypes distinct from Bortezomib. Additionally, 6 out of 8 directly synthesized TamGen compounds demonstrated $\text{IC}_{50}$ below 40 $\mu$M, confirming TamGen’s ability to produce viable hits without the library search step.

Ablation Studies

Four ablation experiments clarified the contributions of TamGen’s components:

Without pre-training: Significantly worse docking scores and simpler structures. The optimal decoder depth dropped from 12 to 4 layers without pre-training due to overfitting.
Shuffled pocket-ligand pairs (TamGen-r): Substantially worse docking scores, confirming TamGen learns meaningful pocket-ligand interactions rather than generic compound distributions.
Without distance-aware attention: Significant decline in docking scores when removing the geometric attention term from Eq. 2.
Without coordinate augmentation: Performance degradation when removing the roto-translation augmentation $\rho$, highlighting the importance of geometric invariance.

Validated Drug-Like Generation with Practical Limitations

TamGen demonstrates that 1D SMILES-based generation with pre-training on natural compounds produces molecules with better drug-likeness properties than 3D generation methods. The experimental validation against ClpP is a notable strength, as most generative drug design methods lack biochemical assay confirmation.

Key limitations acknowledged by the authors include:

Insufficient sensitivity to minor target differences: TamGen cannot reliably distinguish targets with point mutations or protein isoforms, limiting applicability for cancer-related proteins
Requires known structure and pocket: As a structure-based method, TamGen needs the 3D structure of the target protein and binding pocket information
Limited cellular validation: The study focuses on hit identification; cellular activities and toxicities of proposed compounds were not extensively tested
1D generation trade-off: SMILES-based generation does not fully exploit 3D protein-ligand geometric interactions available in coordinate space

Future directions include integrating insights from 3D autoregressive methods, using Monte Carlo Tree Search or reinforcement learning to guide generation for better docking scores and ADME/T properties, and property-guided generation as explored in PrefixMol.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pre-training	PubChem (random sample)	10M SMILES	Compound decoder pre-training
Fine-tuning	CrossDocked2020	~100k pairs	Filtered pocket-ligand pairs
Extended fine-tuning	CrossDocked + PDB	~300k pairs	Used for TB compound generation
Evaluation	CrossDocked2020 test	100 pockets	Same split as TargetDiff/Pocket2Mol

Algorithms

Compound decoder: 12-layer GPT with hidden dimension 768, pre-trained for 200k steps
Protein encoder: 4-layer Transformer with hidden dimension 256, distance-aware attention
VAE encoder: 4-layer standard Transformer encoder with hidden dimension 256
Optimizer: Adam with initial learning rate $3 \times 10^{-5}$
VAE $\beta$: 0.1 or 1.0 depending on generation stage
Beam search: beam sizes of 4, 10, or 20 depending on stage
Pocket definition: residues within 10 or 15 Angstrom distance cutoff from ligand center

Models

Pre-trained model weights are available via Zenodo at https://doi.org/10.5281/zenodo.13751391.

Evaluation

Metric	TamGen	Best Baseline	Notes
Overall MRR	Best	TargetDiff (2nd)	Ranked across 6 metrics
Fused rings (avg)	1.78	~3-5 (others)	Matches FDA-approved drug average
Generation speed	9 sec/100 compounds	~13 min (ResGen)	Single A6000 GPU
ClpP hit rate	6/8 synthesized	N/A	$\text{IC}_{50}$ < 40 $\mu$M

Hardware

Pre-training: 8x V100 GPUs for 200k steps
Inference benchmarking: 1x A6000 GPU
Generation time: ~9 seconds per 100 compounds per target

Artifact	Type	License	Notes
TamGen code	Code	MIT	Official implementation
Model weights and data	Model + Data	CC-BY-4.0	Pre-trained weights, source data

Paper Information

Citation: Wu, K., Xia, Y., Deng, P., Liu, R., Zhang, Y., Guo, H., Cui, Y., Pei, Q., Wu, L., Xie, S., Chen, S., Lu, X., Hu, S., Wu, J., Chan, C.-K., Chen, S., Zhou, L., Yu, N., Chen, E., Liu, H., Guo, J., Qin, T., & Liu, T.-Y. (2024). TamGen: drug design with target-aware molecule generation through a chemical language model. Nature Communications, 15, 9360. https://doi.org/10.1038/s41467-024-53632-4

@article{wu2024tamgen,
  title={TamGen: drug design with target-aware molecule generation through a chemical language model},
  author={Wu, Kehan and Xia, Yingce and Deng, Pan and Liu, Renhe and Zhang, Yuan and Guo, Han and Cui, Yumeng and Pei, Qizhi and Wu, Lijun and Xie, Shufang and Chen, Si and Lu, Xi and Hu, Song and Wu, Jinzhi and Chan, Chi-Kin and Chen, Shawn and Zhou, Liangliang and Yu, Nenghai and Chen, Enhong and Liu, Haiguang and Guo, Jinjiang and Qin, Tao and Liu, Tie-Yan},
  journal={Nature Communications},
  volume={15},
  number={1},
  pages={9360},
  year={2024},
  publisher={Nature Publishing Group},
  doi={10.1038/s41467-024-53632-4}
}