<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Target-Aware Generation on Hunter Heidenreich | ML Research Scientist</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/target-aware/</link><description>Recent content in Target-Aware Generation on Hunter Heidenreich | ML Research Scientist</description><image><title>Hunter Heidenreich | ML Research Scientist</title><url>https://hunterheidenreich.com/img/avatar.webp</url><link>https://hunterheidenreich.com/img/avatar.webp</link></image><generator>Hugo -- 0.147.7</generator><language>en-US</language><copyright>2026 Hunter Heidenreich</copyright><lastBuildDate>Thu, 26 Mar 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/target-aware/index.xml" rel="self" type="application/rss+xml"/><item><title>Protein-to-Drug Molecule Translation via Transformer</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/target-aware/transformer-protein-drug-generation/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/target-aware/transformer-protein-drug-generation/</guid><description>A Transformer model frames protein-targeted drug generation as machine translation from amino acid sequences to SMILES molecular strings.</description><content:encoded><![CDATA[<h2 id="protein-targeted-drug-generation-as-machine-translation">Protein-Targeted Drug Generation as Machine Translation</h2>
<p>This is a <strong>Method</strong> paper that proposes using the Transformer neural network architecture for protein-specific de novo drug generation. The primary contribution is framing the problem of generating molecules that bind to a target protein as a machine translation task: translating from the &ldquo;language&rdquo; of amino acid sequences to the SMILES representation of candidate drug molecules. The model takes only a protein&rsquo;s amino acid sequence as input and generates novel molecules with predicted binding affinity, requiring no prior knowledge of active ligands, physicochemical descriptors, or the protein&rsquo;s three-dimensional structure.</p>
<h2 id="limitations-of-existing-generative-drug-design-approaches">Limitations of Existing Generative Drug Design Approaches</h2>
<p>Existing deep learning methods for de novo molecule generation suffer from several limitations. Most RNN-based approaches require a library of known active compounds against the target protein to fine-tune the generator or train a reward predictor for reinforcement learning. Structure-based drug design methods require the three-dimensional structure of the target protein, which can be costly and technically difficult to obtain through protein expression, purification, and crystallization. Autoencoder-based approaches (variational and adversarial) similarly depend on prior knowledge of protein binders or their physicochemical characteristics.</p>
<p>The estimated drug-like molecule space is on the order of $10^{60}$, while only around $10^{8}$ compounds have been synthesized. High-throughput screening is expensive and time-consuming, and virtual screening operates only on known molecules. Computational de novo design methods often generate molecules that are hard to synthesize or restrict accessible chemical space through coded rules. A method that requires only a protein&rsquo;s amino acid sequence would substantially simplify the initial stages of drug discovery, particularly for targets with limited or no information about inhibitors and 3D structure.</p>
<h2 id="sequence-to-sequence-translation-with-self-attention">Sequence-to-Sequence Translation with Self-Attention</h2>
<p>The core insight is to treat protein-targeted drug generation as a translation problem between two &ldquo;languages,&rdquo; applying the Transformer architecture that had demonstrated strong results in neural machine translation. The encoder maps a protein amino acid sequence $(a_1, \ldots, a_n)$ to continuous representations $\mathbf{z} = (z_1, \ldots, z_n)$, and the decoder autoregressively generates a SMILES string conditioned on $\mathbf{z}$.</p>
<p>The self-attention mechanism computes:</p>
<p>$$
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
$$</p>
<p>where $d_k$ is a scaling factor. Multihead attention runs $h$ parallel attention heads:</p>
<p>$$
\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)
$$</p>
<p>$$
\text{Multihead}(Q, K, V) = (\text{head}_1, \ldots, \text{head}_h)W^O
$$</p>
<p>Positional encoding uses sinusoidal functions:</p>
<p>$$
PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i / d_{model}}}\right)
$$</p>
<p>$$
PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i / d_{model}}}\right)
$$</p>
<p>The self-attention mechanism is particularly well-suited for this task for two reasons. First, protein sequences can be much longer than SMILES strings (dozens of times longer), making the ability to capture long-range dependencies essential. Second, three-dimensional structural features of the binding pocket may be formed by amino acid residues far apart in the linear sequence, and multihead attention can jointly attend to different positional aspects simultaneously.</p>
<h2 id="data-model-architecture-and-docking-evaluation">Data, Model Architecture, and Docking Evaluation</h2>
<h3 id="data">Data</h3>
<p>The training data was retrieved from BindingDB, filtering for interactions between proteins from Homo sapiens, Rattus norvegicus, Mus musculus, and Bos taurus with binding affinity below 100 nM (IC50, Kd, or EC50). After filtering for valid PubChem CIDs, SMILES representations, UniProt IDs, molecular weight under 1000 Da, and amino acid sequence lengths between 80 and 2050, the final dataset contained 238,147 records with 1,613 unique proteins and 154,924 unique ligand SMILES strings.</p>
<p>Five Monte Carlo cross-validation splits were created, with the constraint that test set proteins share less than 20% sequence similarity with training set proteins (measured via <a href="https://en.wikipedia.org/wiki/Needleman%E2%80%93Wunsch_algorithm">Needleman-Wunsch</a> global alignment).</p>
<h3 id="model-configuration">Model Configuration</h3>
<p>The model uses the original Transformer implementation via the tensor2tensor library with:</p>
<ul>
<li>4 encoder/decoder layers of size 128</li>
<li>4 attention heads</li>
<li>Adam optimizer with learning rate decay from the original Transformer paper</li>
<li>Batch size of 4,096 tokens</li>
<li>Training for 600K epochs on a single GPU in Google Colaboratory</li>
<li>Vocabulary of 71 symbols (character-level tokenization)</li>
</ul>
<p>Beam search decoding was used with two modes: beam size 4 keeping only the top-1 result (&ldquo;one per one&rdquo; mode) and beam size 10 keeping all 10 results (&ldquo;ten per one&rdquo; mode).</p>
<h3 id="chemical-validity-and-uniqueness">Chemical Validity and Uniqueness</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>One per One (avg)</th>
          <th>Ten per One (avg)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Valid SMILES (%)</td>
          <td>90.2</td>
          <td>82.6</td>
      </tr>
      <tr>
          <td>Unique SMILES (%)</td>
          <td>92.3</td>
          <td>81.7</td>
      </tr>
      <tr>
          <td>ZINC15 match (%)</td>
          <td>30.6</td>
          <td>17.1</td>
      </tr>
  </tbody>
</table>
<h3 id="docking-evaluation">Docking Evaluation</h3>
<p>To assess binding affinity, the authors selected two receptor tyrosine kinases from the test set (IGF-1R and VEGFR2) and performed molecular docking with <a href="/notes/computational-chemistry/benchmark-problems/smina-docking-benchmark/">SMINA</a>. Four sets of ligands were compared: known binders, randomly selected compounds, molecules generated for the target protein, and molecules generated for other targets (cross-docking control).</p>
<p>ROC-AUC analysis showed that the docking tool classified generated molecules for the correct target as binders at rates comparable to known binders. For the best-discriminating structures (PDB 3O23 for IGF-1R, PDB 3BE2 for VEGFR2), Mann-Whitney U tests confirmed statistically significant differences between generated-for-target molecules and random compounds, while the difference between generated-for-target and known binders was not significant (p = 0.40 and 0.26 respectively), suggesting the model generates plausible binders.</p>
<h3 id="drug-likeness-properties">Drug-Likeness Properties</h3>
<p>Generated molecules were evaluated against <a href="https://en.wikipedia.org/wiki/Lipinski%27s_rule_of_five">Lipinski&rsquo;s Rule of Five</a> and other drug-likeness criteria:</p>
<table>
  <thead>
      <tr>
          <th>Property</th>
          <th>Constraint</th>
          <th>One per One (%)</th>
          <th>Ten per One (%)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>logP</td>
          <td>&lt; 5</td>
          <td>84.4</td>
          <td>85.6</td>
      </tr>
      <tr>
          <td>Molecular weight</td>
          <td>&lt; 500 Da</td>
          <td>95.8</td>
          <td>88.9</td>
      </tr>
      <tr>
          <td>H-bond donors</td>
          <td>&lt; 5</td>
          <td>95.8</td>
          <td>91.9</td>
      </tr>
      <tr>
          <td>H-bond acceptors</td>
          <td>&lt; 10</td>
          <td>97.9</td>
          <td>93.5</td>
      </tr>
      <tr>
          <td>Rotatable bonds</td>
          <td>&lt; 10</td>
          <td>97.9</td>
          <td>91.2</td>
      </tr>
      <tr>
          <td>TPSA</td>
          <td>&lt; 140</td>
          <td>98.0</td>
          <td>92.7</td>
      </tr>
      <tr>
          <td>SAS</td>
          <td>&lt; 6</td>
          <td>99.9</td>
          <td>100.0</td>
      </tr>
  </tbody>
</table>
<p>Mean QED values were 0.66 +/- 0.19 (one per one) and 0.58 +/- 0.21 (ten per one).</p>
<h3 id="structural-novelty">Structural Novelty</h3>
<p>Tanimoto similarity analysis showed that only 8% of generated structures had similarity above the threshold (&gt; 0.85) to training compounds. The majority (51%) had Tanimoto scores below 0.5. The mean nearest-neighbor Tanimoto similarity of generated molecules to the training set (0.54 +/- 0.17 in one-per-one mode) was substantially lower than the mean within-training-set similarity (0.74 +/- 0.14), indicating the model generates structurally diverse molecules outside the training distribution.</p>
<h2 id="generated-molecules-show-drug-like-properties-and-predicted-binding">Generated Molecules Show Drug-Like Properties and Predicted Binding</h2>
<p>The model generates roughly 90% chemically valid SMILES in one-per-one mode, with 92% uniqueness. Docking simulations on IGF-1R and VEGFR2 suggest that generated molecules for the correct target are statistically indistinguishable from known binders, while molecules generated for other targets behave more like random compounds. Drug-likeness properties fall within acceptable ranges for the vast majority of generated compounds.</p>
<p>The authors acknowledge several limitations:</p>
<ul>
<li>Only two protein targets were analyzed via docking due to computational constraints, and the analysis was limited to proteins with a single well-known druggable binding pocket.</li>
<li>Beam search produces molecules that differ only slightly; diverse beam search or coupling with variational/adversarial autoencoders could improve diversity.</li>
<li>The fraction of molecules matching the ZINC15 database (30.6% in one-per-one mode) could potentially be reduced by pretraining on a larger compound set (e.g., ChEMBL&rsquo;s 1.5 million molecules).</li>
<li>Model interpretability remains limited and is identified as important future work.</li>
<li>The approach is a proof of concept and requires further validation via in vitro assays across diverse protein targets.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data-1">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training/Test</td>
          <td>BindingDB (filtered)</td>
          <td>238,147 records</td>
          <td>1,613 unique proteins, 154,924 unique SMILES; IC50/Kd/EC50 &lt; 100 nM</td>
      </tr>
      <tr>
          <td>Docking validation</td>
          <td>PDB structures</td>
          <td>11 (IGF-1R), 20 (VEGFR2)</td>
          <td>SMINA docking with default settings</td>
      </tr>
      <tr>
          <td>Database matching</td>
          <td>ZINC15</td>
          <td>N/A</td>
          <td>Used for novelty assessment</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Transformer (encoder-decoder) via tensor2tensor library</li>
<li>Beam search decoding (beam sizes 4 and 10)</li>
<li>Needleman-Wunsch global alignment for protein sequence similarity (EMBOSS)</li>
<li>SMINA for molecular docking</li>
<li>RDKit for validity checking, property calculation, and canonicalization</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>4 layers, 128 hidden size, 4 attention heads</li>
<li>Character-level tokenization with 71-symbol vocabulary</li>
<li>5-fold Monte Carlo cross-validation with &lt; 20% sequence similarity between train/test proteins</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Value</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Valid SMILES</td>
          <td>90.2% (1-per-1), 82.6% (10-per-1)</td>
          <td>Averaged across 5 splits</td>
      </tr>
      <tr>
          <td>Unique SMILES</td>
          <td>92.3% (1-per-1), 81.7% (10-per-1)</td>
          <td>Averaged across 5 splits</td>
      </tr>
      <tr>
          <td>ZINC15 match</td>
          <td>30.6% (1-per-1), 17.1% (10-per-1)</td>
          <td>Averaged across 5 splits</td>
      </tr>
      <tr>
          <td>QED</td>
          <td>0.66 +/- 0.19 (1-per-1), 0.58 +/- 0.21 (10-per-1)</td>
          <td>Drug-likeness score</td>
      </tr>
      <tr>
          <td>SAS compliance</td>
          <td>99.9% (1-per-1), 100% (10-per-1)</td>
          <td>SAS &lt; 6</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>Google Colaboratory with one GPU</li>
<li>Training for 600K epochs</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/dariagrechishnikova/molecule_structure_generation">molecule_structure_generation</a></td>
          <td>Code</td>
          <td>Not specified</td>
          <td>Jupyter Notebook implementation using tensor2tensor</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Grechishnikova, D. (2021). Transformer neural network for protein-specific de novo drug generation as a machine translation problem. <em>Scientific Reports</em>, 11, 321. <a href="https://doi.org/10.1038/s41598-020-79682-4">https://doi.org/10.1038/s41598-020-79682-4</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{grechishnikova2021transformer,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Transformer neural network for protein-specific de novo drug generation as a machine translation problem}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Grechishnikova, Daria}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Scientific Reports}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{11}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{321}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2021}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Nature Publishing Group}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1038/s41598-020-79682-4}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>PrefixMol: Prefix Embeddings for Drug Molecule Design</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/target-aware/prefixmol-target-chemistry-aware-generation/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/target-aware/prefixmol-target-chemistry-aware-generation/</guid><description>PrefixMol uses prefix embeddings in a GPT SMILES generator to jointly condition on protein pockets and chemical properties for drug design.</description><content:encoded><![CDATA[<h2 id="unified-multi-conditional-molecular-generation">Unified Multi-Conditional Molecular Generation</h2>
<p>PrefixMol is a <strong>Method</strong> paper that introduces a unified generative model for structure-based drug design that simultaneously conditions on protein binding pockets and multiple chemical properties. The primary contribution is a prefix-embedding mechanism, borrowed from NLP multi-task learning, that represents each condition (pocket geometry, Vina score, QED, SA, LogP, <a href="https://en.wikipedia.org/wiki/Lipinski%27s_rule_of_five">Lipinski</a>) as a learnable feature vector prepended to the input sequence of a GPT-based <a href="/notes/computational-chemistry/molecular-representations/smiles-original-paper/">SMILES</a> generator. This allows a single model to handle customized multi-conditional generation without the negative transfer that typically arises from merging separate task-specific models.</p>
<h2 id="bridging-target-aware-and-chemistry-aware-molecular-design">Bridging Target-Aware and Chemistry-Aware Molecular Design</h2>
<p>Prior structure-based drug design methods (e.g., Pocket2Mol, GraphBP) generate molecules conditioned on protein binding pockets but impose no constraints on the chemical properties of the output. Conversely, controllable molecule generation methods (e.g., <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/reinvent-deep-rl-molecular-design/">REINVENT</a>, <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/autoregressive/retmol-retrieval-molecule-generation/">RetMol</a>, CMG) can steer chemical properties but ignore protein-ligand interactions. Merging these two objectives into a single model is difficult for two reasons:</p>
<ol>
<li><strong>Data scarcity</strong>: Few datasets contain both protein-ligand binding affinity data and comprehensive molecular property annotations.</li>
<li><strong>Negative transfer</strong>: Treating each condition as a separate task in a multi-task framework can hurt overall performance when tasks conflict.</li>
</ol>
<p>PrefixMol addresses both problems by extending the CrossDocked dataset with molecular property labels and using a parameter-efficient prefix conditioning strategy that decouples task-specific knowledge from the shared generative backbone.</p>
<h2 id="prefix-conditioning-in-attention-layers">Prefix Conditioning in Attention Layers</h2>
<p>The core innovation adapts prefix-tuning from NLP to molecular generation. Given a GPT transformer that generates SMILES token-by-token, PrefixMol prepends $n_c$ learnable condition vectors $\mathbf{p}_{\phi} \in \mathbb{R}^{n_c \times d}$ to the left of the sequence embedding $\mathbf{x} \in \mathbb{R}^{l \times d}$, forming an extended input $\mathbf{x}&rsquo; = [\text{PREFIX}; \mathbf{x}]$.</p>
<p>The output of each position is:</p>
<p>$$
h_i = \begin{cases} p_{\phi,i}, &amp; \text{if } i &lt; n_c \\ \text{LM}_\theta(x_i&rsquo;, h_{&lt;i}), &amp; \text{otherwise} \end{cases}
$$</p>
<p>Because the prefix features always sit to the left, the causal attention mask ensures they influence all subsequent token predictions. The key insight is that the attention mechanism decomposes into a weighted sum of self-attention and prefix attention:</p>
<p>$$
\begin{aligned}
\text{head} &amp;= (1 - \lambda(\mathbf{x})) \underbrace{\text{Attn}(\mathbf{x}\mathbf{W}_q, \mathbf{c}\mathbf{W}_k, \mathbf{c}\mathbf{W}_v)}_{\text{self-attention}} \\
&amp;\quad + \lambda(\mathbf{x}) \underbrace{\text{Attn}(\mathbf{x}\mathbf{W}_q, \mathbf{p}_\phi\mathbf{W}_k, \mathbf{p}_\phi\mathbf{W}_v)}_{\text{prefix attention}}
\end{aligned}
$$</p>
<p>where $\lambda(\mathbf{x})$ is a scalar representing the normalized attention weight on the prefix positions. This decomposition shows that conditions modulate generation through an additive attention pathway, and the activation map $\text{softmax}(\mathbf{x}\mathbf{W}_q \mathbf{W}_k^\top \mathbf{p}_\phi^\top)$ directly reveals how each condition steers model behavior.</p>
<p><strong>Condition correlation</strong> is similarly revealed. For the prefix features themselves, the causal mask zeros out the cross-attention to the sequence, leaving only the prefix self-correlation term:</p>
<p>$$
\text{head} = \text{Attn}(\mathbf{p}_\phi \mathbf{W}_q, \mathbf{p}_\phi \mathbf{W}_k, \mathbf{p}_\phi \mathbf{W}_v)
$$</p>
<p>The attention map $\mathbf{A}(\mathbf{p}_\phi)$ from this term encodes how conditions relate to one another.</p>
<h3 id="condition-encoders">Condition Encoders</h3>
<p>Each condition has a dedicated encoder:</p>
<ul>
<li><strong>3D Pocket</strong>: A Geometric Vector Transformer (GVF) processes the binding pocket as a 3D graph with SE(3)-equivariant node and edge features. GVF extends GVP-GNN with a global attention module over geometric features. A position-aware attention mechanism with radial basis functions produces the pocket embedding.</li>
<li><strong>Chemical properties</strong>: Separate MLPs embed each scalar property (Vina, QED, SA, LogP, Lipinski) into the shared $d$-dimensional space.</li>
</ul>
<h3 id="training-objective">Training Objective</h3>
<p>PrefixMol is trained with two losses. The auto-regressive loss is:</p>
<p>$$
\mathcal{L}_{AT} = -\sum_{1 &lt; i \leq t} \log p_{\phi, \theta}(x_i \mid \mathbf{x}_{&lt;i}, \mathbf{p}_\phi)
$$</p>
<p>A triplet property prediction loss encourages generated molecules to match desired properties:</p>
<p>$$
\mathcal{L}_{Pred} = \max\left((\hat{\mathbf{c}} - \mathbf{c})^2 - (\hat{\mathbf{c}} - \dot{\mathbf{c}})^2, 0\right)
$$</p>
<p>where $\mathbf{c}$ is the input condition, $\hat{\mathbf{c}}$ is predicted by an MLP head, and $\dot{\mathbf{c}}$ is computed by RDKit from the generated SMILES (gradient is propagated through $\hat{\mathbf{c}}$ since RDKit is non-differentiable).</p>
<h2 id="experimental-setup-and-controllability-evaluation">Experimental Setup and Controllability Evaluation</h2>
<h3 id="dataset">Dataset</h3>
<p>The authors use the CrossDocked dataset (22.5 million protein-ligand structures) with chemical properties appended for each ligand. Data splitting and evaluation follow Pocket2Mol and Masuda et al.</p>
<h3 id="metrics">Metrics</h3>
<ul>
<li><strong>Vina score</strong> (binding affinity, computed by QVina after UFF refinement)</li>
<li><strong>QED</strong> (quantitative estimate of drug-likeness, 0-1)</li>
<li><strong>SA</strong> (synthetic accessibility, 0-1)</li>
<li><strong>LogP</strong> (octanol-water partition coefficient)</li>
<li><strong>Lipinski</strong> (rule-of-five compliance count)</li>
<li><strong>High Affinity</strong> (fraction of pockets where generated molecules match or exceed test set affinities)</li>
<li><strong>Diversity</strong> (average pairwise Tanimoto distance over Morgan fingerprints)</li>
<li><strong>Sim.Train</strong> (maximum Tanimoto similarity to training set)</li>
</ul>
<h3 id="baselines">Baselines</h3>
<p>Unconditional comparison against CVAE, AR (Luo et al. 2021a), and Pocket2Mol.</p>
<h3 id="key-results">Key Results</h3>
<p><strong>Unconditional generation</strong> (Table 1): PrefixMol without conditions achieves sub-optimal results on Vina (-6.532), QED (0.551), SA (0.750), and LogP (1.415) compared to Pocket2Mol. However, it substantially outperforms all baselines on diversity (0.856 vs. 0.688 for Pocket2Mol) and novelty (Sim.Train of 0.239 vs. 0.376), indicating it generates genuinely novel molecules rather than memorizing training data.</p>
<p><strong>Single-property control</strong> (Table 2): Molecular properties are positively correlated with conditional inputs across VINA, QED, SA, LogP, and Lipinski. With favorable control scales, PrefixMol surpasses Pocket2Mol on QED (0.767 vs. 0.563), SA (0.924 vs. 0.765), and LogP. The Vina score also improves when QED or LogP conditions are increased (e.g., -7.733 at QED control scale +2), revealing coupling between conditions.</p>
<p><strong>Multi-property control</strong> (Table 3): Jointly adjusting all five conditions shows consistent positive relationships. For example, at control scale +4, QED reaches 0.722, SA reaches 0.913, and Lipinski saturates at 5.0. Joint QED+SA control at +2.0 achieves Lipinski = 5.0, confirming that certain properties are coupled.</p>
<h3 id="condition-relation-analysis">Condition Relation Analysis</h3>
<p>By computing partial derivatives of the prefix attention map with respect to each condition, the authors construct a relation matrix $\mathbf{R} = \sum_{i=2}^{6} |\partial \mathbf{A} / \partial c_i|$. Key findings:</p>
<ul>
<li><strong>Vina is weakly self-controllable</strong> but strongly influenced by QED, LogP, and SA, explaining why multi-condition control improves binding affinity even when Vina alone responds poorly.</li>
<li><strong>LogP and QED</strong> are the most correlated property pair.</li>
<li><strong>Lipinski is coupled to QED and SA</strong>, saturating at 5.0 when both QED and SA control scales reach +2.</li>
</ul>
<h2 id="key-findings-limitations-and-interpretability-insights">Key Findings, Limitations, and Interpretability Insights</h2>
<p>PrefixMol demonstrates that prefix embedding is an effective strategy for unifying target-aware and chemistry-aware molecular generation. The main findings are:</p>
<ol>
<li>A single prefix-conditioned GPT model can control multiple chemical properties simultaneously while targeting specific protein pockets.</li>
<li>Multi-conditional generation outperforms unconditional baselines in drug-likeness metrics, and the controllability enables PrefixMol to surpass Pocket2Mol on QED, SA, and LogP.</li>
<li>The attention mechanism provides interpretable coupling relationships between conditions, offering practical guidance (e.g., improving QED indirectly improves Vina).</li>
</ol>
<p><strong>Limitations</strong>: The paper does not report validity rates for generated SMILES. The unconditional model underperforms Pocket2Mol on binding affinity (Vina), suggesting that generating 2D SMILES strings and relying on post hoc 3D conformer generation may be less effective than direct atom-by-atom 3D generation for binding affinity optimization. The condition relation analysis uses a first-order finite difference approximation ($\Delta = 1$), which may not capture nonlinear interactions. No external validation on prospective drug discovery tasks is provided. Hardware and training time details are not reported.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training / Evaluation</td>
          <td>CrossDocked (extended)</td>
          <td>22.5M protein-ligand structures</td>
          <td>Extended with molecular properties (QED, SA, LogP, Lipinski, Vina)</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>GPT-based auto-regressive SMILES generation with prefix conditioning</li>
<li>GVF (Geometric Vector Transformer) for 3D pocket encoding, extending GVP-GNN with global attention</li>
<li>Separate MLP encoders for each chemical property</li>
<li>Triplet property prediction loss with non-differentiable RDKit-computed properties</li>
<li>QVina for Vina score computation with UFF refinement</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>GPT transformer backbone for SMILES generation</li>
<li>6 prefix condition vectors ($n_c = 6$): Pocket, Vina, QED, SA, LogP, Lipinski</li>
<li>Specific architectural hyperparameters (hidden dimension, number of layers, heads) not reported in the paper</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>PrefixMol (unconditional)</th>
          <th>Pocket2Mol</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Vina (kcal/mol)</td>
          <td>-6.532</td>
          <td>-7.288</td>
          <td>Lower is better</td>
      </tr>
      <tr>
          <td>QED</td>
          <td>0.551</td>
          <td>0.563</td>
          <td>Higher is better</td>
      </tr>
      <tr>
          <td>SA</td>
          <td>0.750</td>
          <td>0.765</td>
          <td>Higher is better</td>
      </tr>
      <tr>
          <td>Diversity</td>
          <td>0.856</td>
          <td>0.688</td>
          <td>Higher is better</td>
      </tr>
      <tr>
          <td>Sim.Train</td>
          <td>0.239</td>
          <td>0.376</td>
          <td>Lower is better</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Not reported in the paper.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/A4Bio/PrefixMol">PrefixMol</a></td>
          <td>Code</td>
          <td>Not specified</td>
          <td>Official PyTorch implementation</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Gao, Z., Hu, Y., Tan, C., &amp; Li, S. Z. (2023). PrefixMol: Target- and Chemistry-aware Molecule Design via Prefix Embedding. <em>arXiv preprint arXiv:2302.07120</em>.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{gao2023prefixmol,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{PrefixMol: Target- and Chemistry-aware Molecule Design via Prefix Embedding}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Gao, Zhangyang and Hu, Yuqi and Tan, Cheng and Li, Stan Z.}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{arXiv preprint arXiv:2302.07120}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2023}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Lingo3DMol: Language Model for 3D Molecule Design</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/target-aware/lingo3dmol-3d-molecule-generation/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/target-aware/lingo3dmol-3d-molecule-generation/</guid><description>Lingo3DMol combines language models with geometric deep learning for structure-based 3D molecule generation using a fragment-based SMILES representation.</description><content:encoded><![CDATA[<h2 id="a-language-model-approach-to-structure-based-drug-design">A Language Model Approach to Structure-Based Drug Design</h2>
<p>This is a <strong>Method</strong> paper that introduces Lingo3DMol, a pocket-based 3D molecule generation model combining transformer language models with geometric deep learning. The primary contribution is threefold: (1) a new molecular representation called FSMILES (fragment-based SMILES) that encodes both 2D topology and 3D spatial coordinates, (2) a dual-decoder architecture that jointly predicts molecular topology and atomic positions, and (3) an auxiliary non-covalent interaction (NCI) predictor that guides molecule generation toward favorable binding modes.</p>
<h2 id="limitations-of-existing-3d-molecular-generative-models">Limitations of Existing 3D Molecular Generative Models</h2>
<p>Existing approaches to structure-based drug design fall into two categories, each with notable limitations. Graph-based autoregressive methods (e.g., Pocket2Mol) represent molecules as 3D graphs and use GNNs for generation, but frequently produce non-drug-like structures: large rings (seven or more atoms), honeycomb-like ring arrays, and molecules with either too many or too few rings. The autoregressive sampling process tends to get stuck in local optima early in generation and accumulates errors at each step. Diffusion-based methods (e.g., TargetDiff) avoid autoregressive generation but still produce a notable proportion of undesirable structures due to weak perception of molecular topology, since they do not directly encode or predict bonds. Both approaches struggle with metrics like QED (quantitative estimate of drug-likeness) and SAS (synthetic accessibility score), and neither reliably reproduces known active compounds when evaluated on protein pockets.</p>
<h2 id="fsmiles-fragment-based-smiles-with-dual-coordinate-systems">FSMILES: Fragment-Based SMILES with Dual Coordinate Systems</h2>
<p>The core innovation of Lingo3DMol is a new molecular sequence representation called FSMILES that addresses the topology problem inherent in atom-by-atom generation. FSMILES reorganizes a molecule into fragments using a ring-first, depth-first traversal. Each fragment is represented using standard SMILES syntax, and the full molecule is assembled by combining fragments with a specific connection syntax. Ring size information is encoded directly in atom tokens (e.g., <code>C_6</code> for a carbon in a six-membered ring), providing the autoregressive decoder with critical context about local topology before it needs to close the ring.</p>
<p>The model integrates two coordinate systems. Local spherical coordinates encode bond length ($r$), bond angle ($\theta$), and dihedral angle ($\phi$) relative to three reference atoms (root1, root2, root3). These are predicted using separate MLP heads:</p>
<p>$$r = \operatorname{argmax}\left(\operatorname{softmax}\left(\operatorname{MLP}_1\left(\left[E_{\text{type}}(\text{cur}), H_{\text{topo}}, h_{\text{root1}}\right]\right)\right)\right)$$</p>
<p>$$\theta = \operatorname{argmax}\left(\operatorname{softmax}\left(\operatorname{MLP}_2\left(\left[E_{\text{type}}(\text{cur}), H_{\text{topo}}, h_{\text{root1}}, h_{\text{root2}}\right]\right)\right)\right)$$</p>
<p>$$\phi = \operatorname{argmax}\left(\operatorname{softmax}\left(\operatorname{MLP}_3\left(\left[E_{\text{type}}(\text{cur}), H_{\text{topo}}, h_{\text{root1}}, h_{\text{root2}}, h_{\text{root3}}\right]\right)\right)\right)$$</p>
<p>Global Euclidean coordinates ($x, y, z$) are predicted by a separate 3D decoder ($D_{\text{3D}}$). During inference, the model defines a search space around the predicted local coordinates ($r \pm 0.1$ A, $\theta \pm 2°$, $\phi \pm 2°$) and selects the global position with the highest joint probability within that space. This fusion strategy exploits the rigidity of bond lengths and angles (which makes local prediction easier) while maintaining global spatial awareness.</p>
<h3 id="ncianchor-prediction-model">NCI/Anchor Prediction Model</h3>
<p>A separately trained NCI/anchor prediction model identifies potential non-covalent interaction sites and anchor points in the protein pocket. This model shares the transformer architecture of the generation model and is initialized from pretrained parameters. It predicts whether each pocket atom will form hydrogen bonds, <a href="https://en.wikipedia.org/wiki/Halogen_bond">halogen bonds</a>, salt bridges, or <a href="https://en.wikipedia.org/wiki/Pi_stacking">pi-pi stacking</a> interactions with the ligand, and whether it lies within 4 A of any ligand atom (anchor points). The predicted NCI sites serve two purposes: they are incorporated as input features to the encoder, and they provide starting positions for molecule generation (the first atom is placed within 4.5 A of a sampled NCI site).</p>
<h3 id="pretraining-and-architecture">Pretraining and Architecture</h3>
<p>The model uses a denoising pretraining strategy inspired by BART. During pretraining on 12 million drug-like molecules, the model receives perturbed molecules (with 25% of atoms deleted, coordinates perturbed by $\pm 0.5$ A, and 25% of carbon element types corrupted) and learns to reconstruct the original structure. The architecture is transformer-based with graph structural information encoded through distance and edge vector bias terms in the attention mechanism:</p>
<p>$$A_{\text{biased}} = \operatorname{softmax}\left(\frac{QK^{\top}}{\sqrt{d_k}} + B_D + B_J\right)V$$</p>
<p>The overall loss combines FSMILES token prediction, absolute coordinate prediction, and local coordinate predictions ($r$, $\theta$, $\phi$) with their auxiliary counterparts:</p>
<p>$$L = L_{\text{FSMILES}} + L_{\text{abs-coord}} + L_r + L_\theta + L_\phi + L_{r,\text{aux}} + L_{\theta,\text{aux}} + L_{\phi,\text{aux}}$$</p>
<p>Fine-tuning is performed on 11,800 protein-ligand complex samples from PDBbind 2020, with the first three encoder layers frozen to prevent overfitting.</p>
<h2 id="evaluation-on-dud-e-with-drug-likeness-filtering">Evaluation on DUD-E with Drug-Likeness Filtering</h2>
<p>The evaluation uses the DUD-E dataset (101 targets, 20,000+ active compounds), comparing Lingo3DMol against Pocket2Mol and TargetDiff. A key methodological contribution is the emphasis on filtering generated molecules for drug-likeness (QED &gt;= 0.3 and SAS &lt;= 5) before evaluating binding metrics, as the authors demonstrate that molecules with good docking scores can still be poor drug candidates.</p>
<p><strong>Molecular properties and binding mode (Table 1, drug-like molecules only):</strong></p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Pocket2Mol</th>
          <th>TargetDiff</th>
          <th>Lingo3DMol</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Drug-like molecules (% of total)</td>
          <td>61%</td>
          <td>49%</td>
          <td><strong>82%</strong></td>
      </tr>
      <tr>
          <td>Mean QED</td>
          <td>0.56</td>
          <td>0.60</td>
          <td>0.59</td>
      </tr>
      <tr>
          <td>Mean SAS</td>
          <td>3.5</td>
          <td>4.0</td>
          <td><strong>3.1</strong></td>
      </tr>
      <tr>
          <td>ECFP TS &gt; 0.5 (% of targets)</td>
          <td>8%</td>
          <td>3%</td>
          <td><strong>33%</strong></td>
      </tr>
      <tr>
          <td>Mean min-in-place GlideSP</td>
          <td>-6.7</td>
          <td>-6.2</td>
          <td><strong>-6.8</strong></td>
      </tr>
      <tr>
          <td>Mean GlideSP redocking</td>
          <td>-7.5</td>
          <td>-7.0</td>
          <td><strong>-7.8</strong></td>
      </tr>
      <tr>
          <td>Mean RMSD vs. low-energy conformer (A)</td>
          <td>1.1</td>
          <td>1.1</td>
          <td><strong>0.9</strong></td>
      </tr>
      <tr>
          <td>Diversity</td>
          <td>0.84</td>
          <td><strong>0.88</strong></td>
          <td>0.82</td>
      </tr>
  </tbody>
</table>
<p>Lingo3DMol generates substantially more drug-like molecules (82% vs. 61% and 49%) and finds similar-to-active compounds for 33% of targets compared to 8% (Pocket2Mol) and 3% (TargetDiff). The model also achieves the best min-in-place GlideSP scores and lowest RMSD versus low-energy conformers, indicating higher quality binding poses and more realistic 3D geometries.</p>
<p><strong>Molecular geometry:</strong> Lingo3DMol demonstrated the lowest Jensen-Shannon divergence for all atom-atom distance distributions and produced significantly fewer molecules with large rings (0.23% with 7-membered rings vs. 2.59% for Pocket2Mol and 11.70% for TargetDiff).</p>
<p><strong>Information leakage analysis:</strong> The authors controlled for information leakage by excluding proteins with &gt;30% sequence identity to DUD-E targets from training. When DUD-E targets were stratified by sequence identity to Pocket2Mol&rsquo;s training set, Lingo3DMol&rsquo;s advantage widened as leakage decreased, suggesting the performance gap is genuine rather than an artifact of training overlap.</p>
<p><strong>Ablation studies (Table 2):</strong></p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Standard</th>
          <th>Random NCI</th>
          <th>No Pretraining</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Drug-like (%)</td>
          <td><strong>82%</strong></td>
          <td>47%</td>
          <td>71%</td>
      </tr>
      <tr>
          <td>ECFP TS &gt; 0.5</td>
          <td><strong>33%</strong></td>
          <td>6%</td>
          <td>3%</td>
      </tr>
      <tr>
          <td>Mean min-in-place GlideSP</td>
          <td><strong>-6.8</strong></td>
          <td>-5.8</td>
          <td>-4.9</td>
      </tr>
      <tr>
          <td>Dice score</td>
          <td><strong>0.25</strong></td>
          <td>0.15</td>
          <td>0.13</td>
      </tr>
  </tbody>
</table>
<p>Both pretraining and the NCI predictor are essential. Removing pretraining reduces the number of valid molecules and binding quality. Replacing the trained NCI predictor with random NCI site selection severely degrades drug-likeness and the ability to generate active-like compounds.</p>
<h2 id="key-findings-limitations-and-future-directions">Key Findings, Limitations, and Future Directions</h2>
<p>Lingo3DMol demonstrates that combining language model sequence generation with geometric deep learning can produce drug-like 3D molecules that outperform graph-based and diffusion-based alternatives in binding mode quality, drug-likeness, and similarity to known actives. The FSMILES representation successfully constrains generated molecules to realistic topologies by encoding ring size information and using fragment-level generation.</p>
<p>Several limitations are acknowledged. Capturing all non-covalent interactions within a single molecule remains difficult with autoregressive generation. The model does not enforce equivariance (SE(3) invariance is approximated via rotation/translation augmentation and invariant features rather than built into the architecture). The pretraining dataset is partially proprietary (12M molecules from a commercial library, of which 1.4M from public sources are shared). Diversity of generated drug-like molecules is slightly lower than baselines, though the authors argue that baseline diversity explores chemical space away from known active regions. A comprehensive evaluation of drug-like properties beyond QED and SAS metrics is identified as an important next step.</p>
<p>Future directions include investigating electron density representations for molecular interactions, incorporating SE(3) equivariant architectures (e.g., GVP, Vector Neurons), and developing more systematic drug-likeness evaluation frameworks.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pretraining</td>
          <td>In-house commercial library</td>
          <td>12M molecules (1.4M public)</td>
          <td>Filtered for drug-likeness; conformers via ConfGen</td>
      </tr>
      <tr>
          <td>Fine-tuning</td>
          <td>PDBbind 2020 (general set)</td>
          <td>11,800 samples (8,201 PDB IDs)</td>
          <td>Filtered for &lt;30% sequence identity to DUD-E targets</td>
      </tr>
      <tr>
          <td>NCI labels</td>
          <td>PDBbind 2020</td>
          <td>Same as fine-tuning</td>
          <td>Labeled using ODDT for H-bonds, halogen bonds, salt bridges, pi-pi stacking</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>DUD-E</td>
          <td>101 targets, 20,000+ active compounds</td>
          <td>Standard benchmark for structure-based drug design</td>
      </tr>
      <tr>
          <td>Geometry evaluation</td>
          <td>CrossDocked2020</td>
          <td>100 targets</td>
          <td>Used for bond length and atom distance distribution comparisons</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Transformer-based encoder-decoder with graph structural bias terms (distance matrix $B_D$, edge vector matrix $B_J$)</li>
<li>Denoising pretraining: 25% atom deletion, coordinate perturbation ($\pm 0.5$ A), 25% carbon element type corruption</li>
<li>Depth-first search sampling with reward function combining model confidence and anchor fulfillment</li>
<li>Fine-tuning: first three encoder layers frozen</li>
<li>Local-global coordinate fusion during inference with search space: $r \pm 0.1$ A, $\theta \pm 2°$, $\phi \pm 2°$</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>Generation model: transformer encoder-decoder with dual decoders ($D_{\text{2D}}$ for topology, $D_{\text{3D}}$ for global coordinates)</li>
<li>NCI/anchor prediction model: same architecture, initialized from pretrained parameters</li>
<li>Pretrained, fine-tuned, and NCI model checkpoints available on GitHub and figshare</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Lingo3DMol</th>
          <th>Best Baseline</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Drug-like molecules (%)</td>
          <td>82%</td>
          <td>61% (P2M)</td>
          <td>QED &gt;= 0.3, SAS &lt;= 5</td>
      </tr>
      <tr>
          <td>ECFP TS &gt; 0.5 (% targets)</td>
          <td>33%</td>
          <td>8% (P2M)</td>
          <td>Tanimoto similarity to known actives</td>
      </tr>
      <tr>
          <td>Min-in-place GlideSP</td>
          <td>-6.8</td>
          <td>-6.7 (P2M)</td>
          <td>Lower is better</td>
      </tr>
      <tr>
          <td>GlideSP redocking</td>
          <td>-7.8</td>
          <td>-7.5 (P2M)</td>
          <td>Lower is better</td>
      </tr>
      <tr>
          <td>RMSD vs. low-energy conformer</td>
          <td>0.9 A</td>
          <td>1.1 A (both)</td>
          <td>Lower is better</td>
      </tr>
      <tr>
          <td>Generation speed (100 mol)</td>
          <td>874 +/- 401 s</td>
          <td>962 +/- 622 s (P2M)</td>
          <td>NVIDIA Tesla V100</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>Inference benchmarked on NVIDIA Tesla V100 GPUs</li>
<li>Generation of 100 valid molecules per target: 874 +/- 401 seconds</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/stonewiseAIDrugDesign/Lingo3DMol">Lingo3DMol</a></td>
          <td>Code</td>
          <td>GPL-3.0</td>
          <td>Inference code and model architecture</td>
      </tr>
      <tr>
          <td><a href="https://figshare.com/articles/software/Code_for_Lingo3DMo/24633084">Model checkpoints</a></td>
          <td>Model</td>
          <td>GPL-3.0</td>
          <td>Pretraining, fine-tuning, and NCI checkpoints</td>
      </tr>
      <tr>
          <td><a href="https://figshare.com/articles/dataset/Data_for_Lingo3DMol/24550351">Training data</a></td>
          <td>Dataset</td>
          <td>Not specified</td>
          <td>Partial pretraining data (1.4M public molecules), fine-tuning complexes, evaluation molecules</td>
      </tr>
      <tr>
          <td><a href="https://sw3dmg.stonewise.cn">Online service</a></td>
          <td>Other</td>
          <td>N/A</td>
          <td>Web interface for molecule generation</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Feng, W., Wang, L., Lin, Z., Zhu, Y., Wang, H., Dong, J., Bai, R., Wang, H., Zhou, J., Peng, W., Huang, B., &amp; Zhou, W. (2024). Generation of 3D molecules in pockets via a language model. <em>Nature Machine Intelligence</em>, 6(1), 62-73. <a href="https://doi.org/10.1038/s42256-023-00775-6">https://doi.org/10.1038/s42256-023-00775-6</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{feng2024generation,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Generation of 3D molecules in pockets via a language model}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Feng, Wei and Wang, Lvwei and Lin, Zaiyun and Zhu, Yanhao and Wang, Han and Dong, Jianqiang and Bai, Rong and Wang, Huting and Zhou, Jielong and Peng, Wei and Huang, Bo and Zhou, Wenbiao}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Nature Machine Intelligence}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{6}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{62--73}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2024}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Nature Publishing Group}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1038/s42256-023-00775-6}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Evolutionary Molecular Design via Deep Learning + GA</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/target-aware/evolutionary-design-deep-learning-genetic-algorithm/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/target-aware/evolutionary-design-deep-learning-genetic-algorithm/</guid><description>Kwon et al. combine an RNN decoder for SMILES reconstruction with a genetic algorithm operating on ECFP fingerprints for goal-directed molecular design.</description><content:encoded><![CDATA[<h2 id="fingerprint-based-evolutionary-molecular-design">Fingerprint-Based Evolutionary Molecular Design</h2>
<p>This is a <strong>Method</strong> paper that introduces an evolutionary design methodology (EDM) for goal-directed molecular optimization. The primary contribution is a three-component framework where (1) molecules are encoded as <a href="https://en.wikipedia.org/wiki/Extended-connectivity_fingerprint">extended-connectivity fingerprint</a> (ECFP) vectors, (2) a genetic algorithm evolves these fingerprint vectors through mutation and crossover, (3) a recurrent neural network (RNN) decodes the evolved fingerprints back into valid SMILES strings, and (4) a deep neural network (DNN) evaluates molecular fitness. The key advantage over prior evolutionary approaches is that no hand-crafted chemical rules or fragment libraries are needed, as the RNN learns valid molecular reconstruction from data.</p>
<h2 id="challenges-in-evolutionary-molecular-optimization">Challenges in Evolutionary Molecular Optimization</h2>
<p>Evolutionary algorithms for molecular design face two core challenges. First, maintaining chemical validity of evolved molecules is difficult when operating on graph or string representations directly. Prior methods rely on predefined chemical rules and fragment libraries to constrain structural modifications (atom/bond additions, deletions, substitutions), but these introduce bias and risk convergence to local optima. Each new application domain requires specifying new chemical rules, which may not exist for emerging areas. Second, fitness evaluation must be both efficient and accurate. Simple evaluation methods like structural similarity indices or semi-empirical quantum chemistry calculations reduce computational cost but may not capture complex property relationships.</p>
<p>High-throughput computational screening (HTCS) is a common alternative, but it depends on the quality of predefined virtual chemical libraries and often requires multiple iterative enumerations, limiting its ability to explore novel chemical space.</p>
<h2 id="core-innovation-evolving-fingerprints-with-neural-decoding">Core Innovation: Evolving Fingerprints with Neural Decoding</h2>
<p>The key insight is to perform genetic operations in fingerprint space rather than in molecular graph or SMILES string space. The framework comprises three learned functions:</p>
<p><strong>Encoding function</strong> $e(\cdot)$: Converts a SMILES string $\mathbf{m}$ into a 5000-dimensional ECFP vector $\mathbf{x}$ using Morgan fingerprints with a neighborhood radius of 6. This is a deterministic hash-based encoding (not learned).</p>
<p><strong>Decoding function</strong> $d(\cdot)$: An RNN with three hidden layers of 500 LSTM units that reconstructs a SMILES string from an ECFP vector. The RNN generates SMILES as a sequence of three-character substrings, conditioning each prediction on the current substring and the input ECFP vector:</p>
<p>$$d(\mathbf{x}) = \mathbf{m}, \quad \text{where } p(\mathbf{m}_{t+1} | \mathbf{m}_{t}, \mathbf{x})$$</p>
<p>The three-character substring approach reduces the ratio of invalid SMILES by imposing additional constraints on subsequent characters.</p>
<p><strong>Property prediction function</strong> $f(\cdot)$: A five-layer DNN with 250 hidden units per layer that predicts molecular properties from ECFP vectors:</p>
<p>$$\mathbf{t} = f(e(\mathbf{m}))$$</p>
<p>The RNN is trained by minimizing cross-entropy loss between the softmax output and the target SMILES string $\mathbf{m}_{i}$, learning the relationship $d(e(\mathbf{m}_{i})) = \mathbf{m}_{i}$. The DNN is trained by minimizing mean squared error between predicted and computed property values. Both use the Adam optimizer with mini-batch size 100, 500 training epochs, and dropout rate 0.5.</p>
<h3 id="genetic-algorithm-operations">Genetic Algorithm Operations</h3>
<p>The GA evolves ECFP vectors using the DEAP library with the following parameters:</p>
<ul>
<li><strong>Population size</strong>: 50</li>
<li><strong>Crossover rate</strong>: 0.7 (uniform crossover, mixing ratio 0.2)</li>
<li><strong>Mutation rate</strong>: 0.3 (Gaussian mutation, $N(0, 0.2^{2})$, applied to 1% of elements)</li>
<li><strong>Selection</strong>: Tournament selection with size 3, top 3 individuals as parents</li>
<li><strong>Termination</strong>: 500 generations or 30 consecutive generations without fitness improvement</li>
</ul>
<p>The evolutionary loop proceeds as follows: a seed molecule $\mathbf{m}_{0}$ is encoded to $\mathbf{x}_{0}$, mutated to generate a population $\mathbf{P}^{0} = {\mathbf{z}_{1}, \mathbf{z}_{2}, \ldots, \mathbf{z}_{L}}$, each vector is decoded via the RNN, validity is checked with RDKit, fitness is evaluated via the DNN, and the top parents produce the next generation through crossover and mutation.</p>
<h2 id="experimental-setup-light-absorbing-wavelength-optimization">Experimental Setup: Light-Absorbing Wavelength Optimization</h2>
<h3 id="training-data-and-deep-learning-performance">Training Data and Deep Learning Performance</h3>
<p>The models were trained on 10,000 to 100,000 molecules randomly sampled from PubChem (molecular weight 200-600 g/mol). Each molecule was labeled with DFT-computed excitation energy ($S_{1}$), <a href="https://en.wikipedia.org/wiki/HOMO_and_LUMO">HOMO, and LUMO</a> energies using B3LYP/6-31G.</p>
<table>
  <thead>
      <tr>
          <th>Training Data</th>
          <th>Validity (%)</th>
          <th>Reconstructability (%)</th>
          <th>$S_{1}$ (R, MAE)</th>
          <th>HOMO (R, MAE)</th>
          <th>LUMO (R, MAE)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>100,000</td>
          <td>88.8</td>
          <td>62.4</td>
          <td>0.977, 0.185 eV</td>
          <td>0.948, 0.168 eV</td>
          <td>0.960, 0.195 eV</td>
      </tr>
      <tr>
          <td>50,000</td>
          <td>86.7</td>
          <td>60.1</td>
          <td>0.973, 0.198 eV</td>
          <td>0.945, 0.172 eV</td>
          <td>0.955, 0.209 eV</td>
      </tr>
      <tr>
          <td>30,000</td>
          <td>85.3</td>
          <td>59.8</td>
          <td>0.930, 0.228 eV</td>
          <td>0.934, 0.191 eV</td>
          <td>0.945, 0.224 eV</td>
      </tr>
      <tr>
          <td>10,000</td>
          <td>83.2</td>
          <td>55.7</td>
          <td>0.913, 0.278 eV</td>
          <td>0.885, 0.244 eV</td>
          <td>0.917, 0.287 eV</td>
      </tr>
  </tbody>
</table>
<p>Validity refers to the proportion of chemically valid SMILES after RDKit inspection. Reconstructability measures how often the RNN can reproduce the original molecule from its ECFP (62.4% at 100k training samples by matching canonical SMILES among 10,000 generated strings).</p>
<h3 id="design-task-1-unconstrained-s1-modification">Design Task 1: Unconstrained S1 Modification</h3>
<p>Fifty seed molecules with $S_{1}$ values between 3.8 eV and 4.2 eV were evolved in both increasing and decreasing directions. With 50,000 training samples, $S_{1}$ increased by approximately 60% on average in the increasing direction and showed slightly lower rates of change in the decreasing direction. The asymmetry is attributed to the skewed $S_{1}$ distribution of training data (average $S_{1}$ of 4.3-4.4 eV, higher than the seed median of 4.0 eV). Performance saturated at approximately 50,000 training samples.</p>
<h3 id="design-task-2-s1-modification-with-homolumo-constraints">Design Task 2: S1 Modification with HOMO/LUMO Constraints</h3>
<p>The same 50 seeds were evolved with constraints: $-7.0 \text{ eV} &lt; \text{HOMO} &lt; -5.0 \text{ eV}$ and $\text{LUMO} &lt; 0.0 \text{ eV}$. In the increasing $S_{1}$ direction, constraints suppressed the rate of change because both HOMO and LUMO bounds limit the achievable HOMO-LUMO gap. In the decreasing direction, constraints had minimal effect because LUMO could freely decrease while HOMO had sufficient room to rise within the allowed range.</p>
<h3 id="design-task-3-extrapolation-beyond-training-data">Design Task 3: Extrapolation Beyond Training Data</h3>
<p>To generate molecules with $S_{1}$ values below 1.77 eV (outside the training distribution, which had mean $S_{1}$ of 4.91 eV), the authors introduced iterative &ldquo;phases&rdquo;: generate molecules, compute their properties via DFT, retrain the models, and repeat. Starting from the 30 lowest-$S_{1}$ seed molecules with 300 generation runs per phase:</p>
<ul>
<li>Phase 1: Average $S_{1}$ = 2.20 eV, 12 molecules below 1.77 eV</li>
<li>Phase 2: Average $S_{1}$ = 2.22 eV, 37 molecules below 1.77 eV</li>
<li>Phase 3: Average $S_{1}$ = 2.31 eV, 58 molecules below 1.77 eV</li>
</ul>
<p>While the average $S_{1}$ rose slightly across phases, variance decreased (from 1.40 to 1.36), indicating the model concentrated its outputs closer to the target range. This active-learning-like loop demonstrates the framework can extend beyond the training distribution.</p>
<h3 id="design-task-4-guacamol-benchmarks">Design Task 4: GuacaMol Benchmarks</h3>
<p>The method was evaluated on the <a href="/notes/computational-chemistry/benchmark-problems/guacamol-benchmarking-de-novo-molecular-design/">GuacaMol</a> goal-directed benchmark suite using the ChEMBL25 training dataset. The RNN model was retrained with three-character substrings.</p>
<table>
  <thead>
      <tr>
          <th>Benchmark</th>
          <th>Best of Dataset</th>
          <th><a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/autoregressive/lstm-drug-like-molecule-generation/">SMILES LSTM</a></th>
          <th>SMILES GA</th>
          <th><a href="/notes/computational-chemistry/benchmark-problems/graph-based-genetic-algorithm-chemical-space/">Graph GA</a></th>
          <th><a href="/notes/computational-chemistry/benchmark-problems/graph-based-genetic-algorithm-chemical-space/">Graph MCTS</a></th>
          <th>cRNN</th>
          <th>EDM (ours)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Celecoxib rediscovery</td>
          <td>0.505</td>
          <td>1.000</td>
          <td>0.607</td>
          <td>1.000</td>
          <td>0.378</td>
          <td>1.000</td>
          <td>1.000</td>
      </tr>
      <tr>
          <td>Troglitazone rediscovery</td>
          <td>0.419</td>
          <td>1.000</td>
          <td>0.558</td>
          <td>1.000</td>
          <td>0.312</td>
          <td>1.000</td>
          <td>1.000</td>
      </tr>
      <tr>
          <td>Thiothixene rediscovery</td>
          <td>0.456</td>
          <td>1.000</td>
          <td>0.495</td>
          <td>1.000</td>
          <td>0.308</td>
          <td>1.000</td>
          <td>1.000</td>
      </tr>
      <tr>
          <td>LogP(-1.0)</td>
          <td>1.000</td>
          <td>1.000</td>
          <td>1.000</td>
          <td>1.000</td>
          <td>0.980</td>
          <td>1.000</td>
          <td>1.000</td>
      </tr>
      <tr>
          <td>LogP(8.0)</td>
          <td>1.000</td>
          <td>1.000</td>
          <td>1.000</td>
          <td>1.000</td>
          <td>0.979</td>
          <td>1.000</td>
          <td>1.000</td>
      </tr>
      <tr>
          <td>TPSA(150.0)</td>
          <td>1.000</td>
          <td>1.000</td>
          <td>1.000</td>
          <td>1.000</td>
          <td>1.000</td>
          <td>1.000</td>
          <td>1.000</td>
      </tr>
      <tr>
          <td>CNS MPO</td>
          <td>1.000</td>
          <td>1.000</td>
          <td>1.000</td>
          <td>1.000</td>
          <td>1.000</td>
          <td>1.000</td>
          <td>1.000</td>
      </tr>
      <tr>
          <td>QED</td>
          <td>0.948</td>
          <td>0.948</td>
          <td>0.948</td>
          <td>0.948</td>
          <td>0.944</td>
          <td>0.948</td>
          <td>0.948</td>
      </tr>
  </tbody>
</table>
<p>The EDM achieves maximum scores on all eight tasks, matching the cRNN baseline. The 256 highest-scoring molecules from the ChEMBL25 test set were used as seeds, with 500 SMILES strings generated per seed.</p>
<h2 id="key-findings-and-limitations">Key Findings and Limitations</h2>
<h3 id="results">Results</h3>
<p>The evolutionary design framework successfully evolved seed molecules toward target properties across all four design tasks. The RNN decoder maintained 88.8% chemical validity at 100k training samples, and the DNN property predictor achieved correlation coefficients above 0.94 for $S_{1}$, HOMO, and LUMO prediction. The iterative retraining procedure enabled exploration outside the training data distribution, generating 58 molecules with $S_{1}$ below 1.77 eV after three phases. On GuacaMol benchmarks, the method achieved maximum scores on all eight tasks, matching <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/autoregressive/lstm-drug-like-molecule-generation/">SMILES LSTM</a>, <a href="/notes/computational-chemistry/benchmark-problems/graph-based-genetic-algorithm-chemical-space/">Graph GA</a>, and cRNN baselines.</p>
<h3 id="limitations">Limitations</h3>
<p>Several limitations are worth noting:</p>
<ol>
<li><strong>Reconstructability ceiling</strong>: Only 62.4% of molecules could be reconstructed from their ECFP vectors, meaning the RNN decoder fails to recover the original molecule approximately 38% of the time. This information loss in the ECFP encoding is a fundamental bottleneck.</li>
<li><strong>Data dependence</strong>: Performance is sensitive to the training data distribution. The asymmetric evolution rates for increasing vs. decreasing $S_{1}$ directly reflect the skewed training data.</li>
<li><strong>Structural constraints</strong>: Three heuristic constraints (fused ring sizes, number of fused rings, alkyl chain lengths) were still needed to maintain reasonable molecular structures, partially undermining the claim of a fully data-driven approach.</li>
<li><strong>DFT reliance</strong>: The extrapolation experiment requires DFT calculations in the loop, which are computationally expensive and may limit scalability.</li>
<li><strong>Limited benchmark scope</strong>: Only 8 GuacaMol tasks were tested, and all achieved perfect scores, making it difficult to differentiate from competing methods. The paper does not report on harder multi-objective benchmarks.</li>
</ol>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training/Evaluation</td>
          <td>PubChem random sample</td>
          <td>10,000-100,000 molecules</td>
          <td>MW 200-600 g/mol, labeled with DFT-computed $S_{1}$, HOMO, LUMO</td>
      </tr>
      <tr>
          <td>GuacaMol Benchmark</td>
          <td>ChEMBL25</td>
          <td>Standard split</td>
          <td>Used for retraining RNN; 256 top-scoring seeds</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Genetic algorithm</strong>: DEAP library; population 50, crossover rate 0.7, mutation rate 0.3, tournament size 3</li>
<li><strong>RNN decoder</strong>: 3 hidden layers, 500 LSTM units each, three-character substring generation</li>
<li><strong>DNN predictor</strong>: 5 layers, 250 hidden units, sigmoid activations, linear output</li>
<li><strong>Training</strong>: Adam optimizer, mini-batch 100, 500 epochs, dropout 0.5</li>
</ul>
<h3 id="models">Models</h3>
<p>All neural networks were implemented using Keras with the Theano backend (GPU-accelerated). No pre-trained model weights are publicly available.</p>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li><strong>RNN validity</strong>: Proportion of chemically valid SMILES (RDKit check)</li>
<li><strong>Reconstructability</strong>: Fraction of seed molecules recoverable from ECFP (canonical SMILES match in 10,000 generated strings)</li>
<li><strong>DNN accuracy</strong>: Correlation coefficient (R) and MAE via 10-fold cross-validation</li>
<li><strong>Evolutionary performance</strong>: Average rate of $S_{1}$ change across 50 seeds; molecule count in target range</li>
<li><strong>GuacaMol</strong>: Standard rediscovery and property satisfaction benchmarks</li>
</ul>
<h3 id="hardware">Hardware</h3>
<p>The paper does not specify GPU models, training times, or computational requirements for the evolutionary runs. DFT calculations used the Gaussian 09 program suite with B3LYP/6-31G.</p>
<h3 id="artifacts">Artifacts</h3>
<p>No public code repository or pre-trained models are available. The paper is published under a CC-BY 4.0 license as open access in Scientific Reports.</p>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://www.nature.com/articles/s41598-021-96812-8">Paper (Nature)</a></td>
          <td>Paper</td>
          <td>CC-BY 4.0</td>
          <td>Open access</td>
      </tr>
  </tbody>
</table>
<p><strong>Reproducibility classification</strong>: Partially Reproducible. The method is described in sufficient detail for reimplementation, but no code, trained models, or preprocessed datasets are released. The DFT calculations require Gaussian 09, a commercial software package.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Kwon, Y., Kang, S., Choi, Y.-S., &amp; Kim, I. (2021). Evolutionary design of molecules based on deep learning and a genetic algorithm. <em>Scientific Reports</em>, 11, 17304. <a href="https://doi.org/10.1038/s41598-021-96812-8">https://doi.org/10.1038/s41598-021-96812-8</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{kwon2021evolutionary,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Evolutionary design of molecules based on deep learning and a genetic algorithm}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Kwon, Youngchun and Kang, Seokho and Choi, Youn-Suk and Kim, Inkoo}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Scientific Reports}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{11}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{17304}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2021}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Nature Publishing Group}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1038/s41598-021-96812-8}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>BindGPT: GPT for 3D Molecular Design and Docking</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/target-aware/bindgpt-3d-molecular-design/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/target-aware/bindgpt-3d-molecular-design/</guid><description>BindGPT applies GPT-style language modeling to 3D molecular generation using SMILES+XYZ tokenization, pre-training, and RL-based docking optimization.</description><content:encoded><![CDATA[<h2 id="a-language-model-for-joint-3d-molecular-graph-and-conformation-generation">A Language Model for Joint 3D Molecular Graph and Conformation Generation</h2>
<p>BindGPT is a <strong>Method</strong> paper that introduces a GPT-based language model for generating 3D molecular structures. The primary contribution is a unified framework that jointly produces molecular graphs (via SMILES) and 3D coordinates (via XYZ tokens) within a single autoregressive model. This eliminates the need for external graph reconstruction tools like OpenBabel, which are error-prone when applied to noisy atom positions. The same pre-trained model serves as a 3D molecular generative model, a conformer generator conditioned on molecular graphs, and a pocket-conditioned 3D molecule generator.</p>
<h2 id="the-graph-reconstruction-problem-in-3d-molecular-generation">The Graph Reconstruction Problem in 3D Molecular Generation</h2>
<p>Most existing 3D molecular generators focus on predicting atom types and positions, relying on supplementary software (e.g., OpenBabel or RDKit) to reconstruct molecular bonds from predicted coordinates. This introduces a fragile dependency: small positional errors can drastically change the reconstructed molecular graph or produce disconnected structures. Additionally, while diffusion models and equivariant GNNs have shown strong results for 3D molecular generation, they often depend on SE(3) equivariance inductive biases and are computationally expensive at sampling time (up to $10^6$ seconds for 1000 valid molecules for EDM). The pocket-conditioned generation task is further limited by the small size of available 3D binding pose datasets (e.g., CrossDocked), making it difficult for specialized models to generalize without large-scale pre-training.</p>
<h2 id="smilesxyz-tokenization-jointly-encoding-graphs-and-coordinates">SMILES+XYZ Tokenization: Jointly Encoding Graphs and Coordinates</h2>
<p>The core innovation in BindGPT is coupling SMILES notation with XYZ coordinate format in a single token sequence. The sequence starts with a <code>&lt;LIGAND&gt;</code> token, followed by character-level SMILES tokens encoding the molecular graph, then an <code>&lt;XYZ&gt;</code> token marking the transition to coordinate data. Each 3D atom position is encoded using 6 tokens (integer and fractional parts for each of the three coordinates). The atom ordering is synchronized between SMILES and XYZ, so atom symbols from SMILES are not repeated in the coordinate section.</p>
<p>For protein pockets, sequences begin with a <code>&lt;POCKET&gt;</code> token followed by atom names and coordinates. Following AlphaFold&rsquo;s approach, only alpha-carbon coordinates are retained to keep pocket representations compact.</p>
<p>The model uses the GPT-NeoX architecture with rotary position embeddings (RoPE), which enables length generalization between pre-training and fine-tuning where sequence lengths differ substantially. The pre-trained model has 108M parameters with 15 layers, 12 attention heads, and a hidden dimension of 768.</p>
<h3 id="pre-training-on-large-scale-3d-data">Pre-training on Large-Scale 3D Data</h3>
<p>Pre-training uses the Uni-Mol dataset containing 208M conformations for 12M molecules and 3.2M protein pocket structures. Each training batch contains either ligand sequences or pocket sequences (not mixed within a sequence). Since pockets are far fewer than ligands, the training schedule runs 5 pocket epochs per ligand epoch, resulting in roughly 8% pocket tokens overall. Training uses large batches of 1.6M tokens per step with Flash Attention and DeepSpeed optimizations.</p>
<h3 id="supervised-fine-tuning-with-augmentation">Supervised Fine-Tuning with Augmentation</h3>
<p>For pocket-conditioned generation, BindGPT is fine-tuned on CrossDocked 2020, which contains aligned pocket-ligand pairs. Unlike prior work that subsamples less than 1% of the best pairs, BindGPT uses all intermediate ligand poses (including lower-quality ones), yielding approximately 27M pocket-ligand pairs. To combat overfitting on the limited diversity (14k unique molecules, 3k pockets), two augmentation strategies are applied:</p>
<ol>
<li><strong><a href="/notes/computational-chemistry/molecular-representations/randomized-smiles-generative-models/">SMILES randomization</a></strong>: Each molecule can yield 100-1000 different valid SMILES strings</li>
<li><strong>Random 3D rotation</strong>: The same rotation matrix is applied to both pocket and ligand coordinates</li>
</ol>
<p>During fine-tuning, the pocket token sequence is concatenated before the ligand token sequence. An optional variant conditions on binding energy scores from the CrossDocked dataset, enabling contrastive learning between good and bad binding examples.</p>
<h3 id="reinforcement-learning-with-docking-feedback">Reinforcement Learning with Docking Feedback</h3>
<p>BindGPT applies REINFORCE (not PPO or <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/reinvent-deep-rl-molecular-design/">REINVENT</a>, which were found less stable) to further optimize pocket-conditioned generation. On each RL step, the model generates 3D ligand structures for a batch of random protein pockets, computes binding energy rewards using QVINA docking software, and updates model parameters. A KL-penalty between the current model and the SFT initialization stabilizes training.</p>
<p>The RL objective can be written as:</p>
<p>$$\mathcal{L}_{\text{RL}} = -\mathbb{E}_{x \sim \pi_\theta}\left[ R(x) \right] + \beta \cdot D_{\text{KL}}(\pi_\theta | \pi_{\text{SFT}})$$</p>
<p>where $R(x)$ is the docking reward from QVINA and $\beta$ controls the strength of the KL regularization.</p>
<h2 id="experimental-evaluation-across-three-3d-generation-tasks">Experimental Evaluation Across Three 3D Generation Tasks</h2>
<h3 id="datasets">Datasets</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pre-training</td>
          <td>Uni-Mol 3D</td>
          <td>208M conformations (12M molecules) + 3.2M pockets</td>
          <td>Large-scale 3D molecular dataset</td>
      </tr>
      <tr>
          <td>Fine-tuning (SFT)</td>
          <td>CrossDocked 2020</td>
          <td>~27M pocket-ligand pairs</td>
          <td>14k molecules x 3k pockets, includes all pose qualities</td>
      </tr>
      <tr>
          <td>Fine-tuning (conformer)</td>
          <td><a href="/notes/computational-chemistry/datasets/geom/">GEOM-DRUGS</a></td>
          <td>27M conformations for 300k molecules</td>
          <td>Standard benchmark for 3D conformer generation</td>
      </tr>
      <tr>
          <td>Evaluation (conformer)</td>
          <td>Platinum</td>
          <td>Experimentally validated conformations</td>
          <td>Zero-shot evaluation holdout</td>
      </tr>
      <tr>
          <td>Evaluation (pocket)</td>
          <td>CrossDocked holdout</td>
          <td>100 pockets</td>
          <td>Held out from training</td>
      </tr>
  </tbody>
</table>
<h3 id="task-1-3d-molecule-generation-pre-training">Task 1: 3D Molecule Generation (Pre-training)</h3>
<p>Compared against XYZ-Transformer (the only other model capable of large-scale pre-training), BindGPT achieves 98.58% validity (vs. 12.87% for XYZ-TF without hydrogens), higher SA (0.77 vs. 0.21), QED (0.59 vs. 0.30), and Lipinski scores (4.86 vs. 4.79). BindGPT also produces conformations with RMSD of 0.89 (XYZ-TF&rsquo;s RMSD calculation failed to converge). Generation is 12x faster (13s vs. 165s for 1000 molecules).</p>
<h3 id="task-2-3d-molecule-generation-fine-tuned-on-geom-drugs">Task 2: 3D Molecule Generation (Fine-tuned on GEOM-DRUGS)</h3>
<p>Against EDM and MolDiff (diffusion baselines), BindGPT outperforms on nearly all 3D distributional metrics:</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>EDM</th>
          <th>MolDiff</th>
          <th>BindGPT</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>JS bond lengths</td>
          <td>0.246</td>
          <td>0.365</td>
          <td><strong>0.029</strong></td>
      </tr>
      <tr>
          <td>JS bond angles</td>
          <td>0.282</td>
          <td>0.155</td>
          <td><strong>0.075</strong></td>
      </tr>
      <tr>
          <td>JS dihedral angles</td>
          <td>0.328</td>
          <td>0.162</td>
          <td><strong>0.098</strong></td>
      </tr>
      <tr>
          <td>JS freq. bond types</td>
          <td>0.378</td>
          <td>0.163</td>
          <td><strong>0.045</strong></td>
      </tr>
      <tr>
          <td>JS freq. bond pairs</td>
          <td>0.396</td>
          <td>0.136</td>
          <td><strong>0.043</strong></td>
      </tr>
      <tr>
          <td>JS freq. bond triplets</td>
          <td>0.449</td>
          <td>0.125</td>
          <td><strong>0.042</strong></td>
      </tr>
      <tr>
          <td>Time (1000 molecules)</td>
          <td>1.4e6 s</td>
          <td>7500 s</td>
          <td><strong>200 s</strong></td>
      </tr>
  </tbody>
</table>
<p>BindGPT is two orders of magnitude faster than diffusion baselines while producing more accurate 3D geometries. MolDiff achieves better drug-likeness scores (QED, SA), but the authors argue 3D distributional metrics are more relevant for evaluating 3D structure fidelity.</p>
<h3 id="task-3-pocket-conditioned-molecule-generation">Task 3: Pocket-Conditioned Molecule Generation</h3>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Vina Score</th>
          <th>SA</th>
          <th>QED</th>
          <th>Lipinski</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pocket2Mol</td>
          <td>-7.15 +/- 4.89</td>
          <td>0.75</td>
          <td>0.57</td>
          <td>4.88</td>
      </tr>
      <tr>
          <td>TargetDiff</td>
          <td>-7.80 +/- 3.61</td>
          <td>0.58</td>
          <td>0.48</td>
          <td>4.51</td>
      </tr>
      <tr>
          <td>BindGPT-FT</td>
          <td>-5.44 +/- 2.09</td>
          <td>0.78</td>
          <td>0.50</td>
          <td>4.72</td>
      </tr>
      <tr>
          <td>BindGPT-RFT</td>
          <td>-7.24 +/- 1.68</td>
          <td>0.74</td>
          <td>0.48</td>
          <td>4.32</td>
      </tr>
      <tr>
          <td>BindGPT-RL</td>
          <td><strong>-8.60 +/- 1.90</strong></td>
          <td><strong>0.84</strong></td>
          <td>0.43</td>
          <td>4.81</td>
      </tr>
  </tbody>
</table>
<p>The RL-fine-tuned model achieves the best Vina binding scores (-8.60 vs. -7.80 for TargetDiff) with lower variance and the highest SA score (0.84). The SFT-only model (BindGPT-FT) underperforms baselines on binding score, demonstrating that RL is essential for strong pocket-conditioned generation. QED is lower for BindGPT-RL, but the authors note that QED could be included in the RL reward and was excluded for fair comparison.</p>
<h3 id="conformer-generation">Conformer Generation</h3>
<p>On the Platinum dataset (zero-shot), BindGPT matches the performance of Torsional Diffusion (the specialized state-of-the-art) when assisted by RDKit, with a small gap without RDKit assistance. Uni-Mol fails to generalize to this dataset despite pre-training on the same Uni-Mol data.</p>
<h2 id="key-findings-limitations-and-future-directions">Key Findings, Limitations, and Future Directions</h2>
<p>BindGPT demonstrates that a simple autoregressive language model without equivariance inductive biases can match or surpass specialized diffusion models and GNNs across multiple 3D molecular generation tasks. The key findings include:</p>
<ol>
<li><strong>Joint SMILES+XYZ generation eliminates graph reconstruction errors</strong>, achieving 98.58% validity compared to 12.87% for XYZ-Transformer</li>
<li><strong>Large-scale pre-training is critical for pocket-conditioned generation</strong>, as none of the baselines use pre-training and instead rely on heavy inductive biases</li>
<li><strong>RL fine-tuning with docking feedback substantially improves binding affinity</strong> beyond what SFT alone achieves</li>
<li><strong>Sampling is two orders of magnitude faster</strong> than diffusion baselines (200s vs. 1.4M s for EDM)</li>
</ol>
<p>Limitations include the relatively modest model size (108M parameters), with the authors finding this sufficient for current tasks but not exploring larger scales. The RL optimization uses only Vina score as reward; multi-objective optimization incorporating SA, QED, and other properties is left as future work. The model also relies on character-level SMILES tokenization rather than more sophisticated chemical tokenizers. BindGPT is the first model to explicitly generate hydrogens at scale, though validity drops from 98.58% to 77.33% when hydrogens are included.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pre-training</td>
          <td>Uni-Mol 3D</td>
          <td>208M conformations, 12M molecules, 3.2M pockets</td>
          <td>From Zhou et al. (2023)</td>
      </tr>
      <tr>
          <td>SFT (pocket)</td>
          <td>CrossDocked 2020</td>
          <td>~27M pocket-ligand pairs</td>
          <td>Full version including low-quality poses</td>
      </tr>
      <tr>
          <td>SFT (conformer)</td>
          <td>GEOM-DRUGS</td>
          <td>27M conformations, 300k molecules</td>
          <td>Standard benchmark</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>Platinum</td>
          <td>Experimentally validated conformations</td>
          <td>Zero-shot holdout</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Architecture</strong>: GPT-NeoX with rotary position embeddings (RoPE)</li>
<li><strong>Pre-training</strong>: Causal language modeling with 1.6M tokens per batch</li>
<li><strong>SFT augmentation</strong>: SMILES randomization + random 3D rotation</li>
<li><strong>RL</strong>: REINFORCE with KL-penalty from SFT initialization; QVINA docking as reward</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li><strong>Size</strong>: 108M parameters, 15 layers, 12 heads, hidden size 768</li>
<li><strong>Vocabulary</strong>: Character-level SMILES tokens + special tokens (<code>&lt;LIGAND&gt;</code>, <code>&lt;POCKET&gt;</code>, <code>&lt;XYZ&gt;</code>) + coordinate tokens (6 per 3D position)</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<ul>
<li><strong>Validity, SA, QED, Lipinski</strong>: Standard drug-likeness metrics</li>
<li><strong>Jensen-Shannon divergences</strong>: Distribution-level 3D structural metrics (bond lengths, angles, dihedrals, bond types)</li>
<li><strong>RMSD</strong>: Alignment quality of generated conformations vs. RDKit reference</li>
<li><strong>RMSD-Coverage</strong>: CDF of RMSD between generated and reference conformers</li>
<li><strong>Vina score</strong>: Binding energy from QVINA docking software</li>
</ul>
<h3 id="hardware">Hardware</h3>
<ul>
<li>Pre-training and fine-tuning use Flash Attention and DeepSpeed for efficiency</li>
<li>Specific GPU counts and training times are described in Appendix G (not available in the main text)</li>
</ul>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://bindgpt.github.io/">Project Page</a></td>
          <td>Other</td>
          <td>Not specified</td>
          <td>Project website with additional details</td>
      </tr>
  </tbody>
</table>
<p>No public code repository or pre-trained model weights were identified. The project website exists but no source code has been released as of this writing.</p>
<p><strong>Reproducibility Status</strong>: Partially Reproducible. The paper provides detailed architecture specs and hyperparameters, but no public code or model weights are available. All training datasets (Uni-Mol, CrossDocked, GEOM-DRUGS) are publicly accessible.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Zholus, A., Kuznetsov, M., Schutski, R., Shayakhmetov, R., Polykovskiy, D., Chandar, S., &amp; Zhavoronkov, A. (2025). BindGPT: A Scalable Framework for 3D Molecular Design via Language Modeling and Reinforcement Learning. <em>Proceedings of the AAAI Conference on Artificial Intelligence</em>, 39(24), 26083-26091. <a href="https://doi.org/10.1609/aaai.v39i24.34804">https://doi.org/10.1609/aaai.v39i24.34804</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{zholus2025bindgpt,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{BindGPT: A Scalable Framework for 3D Molecular Design via Language Modeling and Reinforcement Learning}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Zholus, Artem and Kuznetsov, Maksim and Schutski, Roman and Shayakhmetov, Rim and Polykovskiy, Daniil and Chandar, Sarath and Zhavoronkov, Alex}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{Proceedings of the AAAI Conference on Artificial Intelligence}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{39}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{24}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{26083--26091}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2025}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1609/aaai.v39i24.34804}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>AlphaDrug: MCTS-Guided Target-Specific Drug Design</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/target-aware/alphadrug-protein-target-molecular-generation/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/target-aware/alphadrug-protein-target-molecular-generation/</guid><description>AlphaDrug combines a modified transformer with Monte Carlo tree search and docking rollouts for target-specific de novo molecular generation.</description><content:encoded><![CDATA[<h2 id="target-conditioned-molecular-generation-via-transformer-and-mcts">Target-Conditioned Molecular Generation via Transformer and MCTS</h2>
<p>AlphaDrug is a <strong>Method</strong> paper that proposes a target-specific de novo molecular generation framework. The primary contribution is the combination of two components: (1) an Lmser Transformer (LT) that embeds protein-ligand context through hierarchical skip connections from encoder to decoder, and (2) a Monte Carlo tree search (MCTS) procedure guided by both the LT&rsquo;s predicted probabilities and docking scores from the <a href="/notes/computational-chemistry/benchmark-problems/smina-docking-benchmark/">SMINA</a> program. The method generates SMILES strings autoregressively, with each symbol selection informed by look-ahead search over potential binding affinities.</p>
<h2 id="bridging-the-gap-between-molecular-generation-and-protein-targeting">Bridging the Gap Between Molecular Generation and Protein Targeting</h2>
<p>Most deep learning methods for de novo molecular generation optimize physicochemical properties (LogP, QED, SA) without conditioning on a specific protein target. Virtual screening approaches rely on existing compound databases and are computationally expensive. The few methods that do consider protein targets, such as LiGANN and the <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/target-aware/transformer-protein-drug-generation/">transformer-based approach of Grechishnikova (2021)</a>, show limited docking performance. The core challenge is twofold: the search space of drug-like molecules is estimated at $10^{60}$ compounds, and learning protein-ligand interaction patterns from sequence data is difficult because proteins and ligands have very different structures and sequence lengths.</p>
<p>AlphaDrug addresses these gaps by proposing a method that jointly learns protein-ligand representations and uses docking-guided search to navigate the vast chemical space.</p>
<h2 id="lmser-transformer-and-docking-guided-mcts">Lmser Transformer and Docking-Guided MCTS</h2>
<p>The key innovations are the Lmser Transformer architecture and the MCTS search strategy.</p>
<h3 id="lmser-transformer-lt">Lmser Transformer (LT)</h3>
<p>The standard transformer for sequence-to-sequence tasks passes information from the encoder&rsquo;s top layer to the decoder through cross-attention. AlphaDrug identifies an information transfer bottleneck: deep protein features from the encoder&rsquo;s final layer must serve all decoder layers. Inspired by the Lmser (least mean squared error reconstruction) network, the authors add hierarchical skip connections from each encoder layer to the corresponding decoder layer.</p>
<p>Each decoder layer receives protein features at the matching level of abstraction through a cross-attention mechanism:</p>
<p>$$f_{ca}(Q_m, K_S, V_S) = \text{softmax}\left(\frac{Q_m K_S^T}{\sqrt{d_k}}\right) V_S$$</p>
<p>where $Q_m$ comes from the ligand molecule decoder and $(K_S, V_S)$ are passed through skip connections from the protein encoder. This allows different decoder layers to access different levels of protein features, rather than all layers sharing the same top-level encoding.</p>
<p>The multi-head attention follows the standard formulation:</p>
<p>$$\text{MultiHead}(Q, K, V) = \text{Concat}(H_1, \dots, H_h) W^O$$</p>
<p>$$H_i = f_{ca}(Q W_i^Q, K W_i^K, V W_i^V)$$</p>
<h3 id="mcts-for-molecular-generation">MCTS for Molecular Generation</h3>
<p>The molecular generation process models SMILES construction as a sequential decision problem. At each step $\tau$, the context $C_\tau = {S, a_1 a_2 \cdots a_\tau}$ consists of the protein sequence $S$ and the intermediate SMILES string. MCTS runs a fixed number of simulations per step, each consisting of four phases:</p>
<p><strong>Select</strong>: Starting from the current root node, child nodes are selected using a variant of the PUCT algorithm:</p>
<p>$$\tilde{a}_{\tau+t} = \underset{a \in A}{\arg\max}\left(Q(\tilde{C}_{\tau+t-1}, a) + U(\tilde{C}_{\tau+t-1}, a)\right)$$</p>
<p>where $Q(\tilde{C}, a) = W_a / N_a$ is the average reward and $U(\tilde{C}, a) = c_{puct} \cdot P(a | \tilde{C}) \cdot \sqrt{N_t} / (1 + N_t(a))$ is an exploration bonus based on the LT&rsquo;s predicted probability.</p>
<p>The Q-values are normalized to $[0, 1]$ using the range of docking scores in the tree:</p>
<p>$$Q(\tilde{C}, a) \leftarrow \frac{Q(\tilde{C}, a) - \min_{m \in \mathcal{M}} f_d(S, m)}{\max_{m \in \mathcal{M}} f_d(S, m) - \min_{m \in \mathcal{M}} f_d(S, m)}$$</p>
<p><strong>Expand</strong>: At a leaf node, the LT computes next-symbol probabilities and adds child nodes to the tree.</p>
<p><strong>Rollout</strong>: A complete molecule is generated greedily using LT probabilities. Valid molecules are scored with SMINA docking; invalid molecules receive the minimum observed docking score.</p>
<p><strong>Backup</strong>: Docking values propagate back up the tree, updating visit counts and cumulative rewards.</p>
<h3 id="training-objective">Training Objective</h3>
<p>The LT is trained on known protein-ligand pairs using cross-entropy loss:</p>
<p>$$J(\Theta) = -\sum_{(S,m) \in \mathcal{D}} \sum_{\tau=1}^{L_m} \sum_{a \in \mathcal{A}} y_a \ln P(a \mid C_\tau(S, m))$$</p>
<p>MCTS is only activated during inference, not during training.</p>
<h2 id="experiments-on-diverse-protein-targets">Experiments on Diverse Protein Targets</h2>
<h3 id="dataset">Dataset</h3>
<p>The authors use BindingDB, filtered to 239,455 protein-ligand pairs across 981 unique proteins. Filtering criteria include: human proteins only, IC50 &lt; 100 nM, molecular weight &lt; 1000 Da, and single-chain targets. Proteins are clustered at 30% sequence identity using MMseqs2, with 25 clusters held out for testing (100 proteins), and the remainder split 90/10 for training (192,712 pairs) and validation (17,049 pairs).</p>
<h3 id="baselines">Baselines</h3>
<ul>
<li><strong>T+BS10</strong>: Standard transformer with beam search (K=10) from <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/target-aware/transformer-protein-drug-generation/">Grechishnikova (2021)</a></li>
<li><strong>LT+BS10</strong>: The proposed Lmser Transformer with beam search</li>
<li><strong>LiGANN</strong>: 3D pocket-to-ligand shape generation via BicycleGAN</li>
<li><strong>SBMolGen</strong>: ChemTS-based method with docking constraints</li>
<li><strong>SBDD-3D</strong>: 3D autoregressive graph-based generation</li>
<li><strong>Decoys</strong>: Random compounds from ZINC database</li>
<li><strong>Known ligands</strong>: Original binding partners from the database</li>
</ul>
<h3 id="main-results">Main Results</h3>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Docking</th>
          <th>Uniqueness</th>
          <th>LogP</th>
          <th>QED</th>
          <th>SA</th>
          <th>NP</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Decoys</td>
          <td>7.3</td>
          <td>-</td>
          <td>2.4</td>
          <td>0.8</td>
          <td>2.4</td>
          <td>-1.2</td>
      </tr>
      <tr>
          <td>Known ligands</td>
          <td>9.8</td>
          <td>-</td>
          <td>2.2</td>
          <td>0.5</td>
          <td>3.3</td>
          <td>-1.0</td>
      </tr>
      <tr>
          <td>LiGANN</td>
          <td>6.7</td>
          <td>94.7%</td>
          <td>2.9</td>
          <td>0.6</td>
          <td>3.0</td>
          <td>-1.1</td>
      </tr>
      <tr>
          <td>SBMolGen</td>
          <td>7.7</td>
          <td>100%</td>
          <td>2.6</td>
          <td>0.7</td>
          <td>2.8</td>
          <td>-1.2</td>
      </tr>
      <tr>
          <td>SBDD-3D</td>
          <td>7.7</td>
          <td>99.3%</td>
          <td>1.5</td>
          <td>0.6</td>
          <td>4.0</td>
          <td>0.3</td>
      </tr>
      <tr>
          <td>T+BS10</td>
          <td>8.5</td>
          <td>90.6%</td>
          <td>3.8</td>
          <td>0.5</td>
          <td>2.8</td>
          <td>-0.8</td>
      </tr>
      <tr>
          <td>LT+BS10</td>
          <td>8.5</td>
          <td>98.1%</td>
          <td>4.0</td>
          <td>0.5</td>
          <td>2.7</td>
          <td>-1.0</td>
      </tr>
      <tr>
          <td>AlphaDrug (freq)</td>
          <td>10.8</td>
          <td>99.5%</td>
          <td>4.9</td>
          <td>0.4</td>
          <td>2.9</td>
          <td>-1.0</td>
      </tr>
      <tr>
          <td>AlphaDrug (max)</td>
          <td>11.6</td>
          <td>100%</td>
          <td>5.2</td>
          <td>0.4</td>
          <td>2.7</td>
          <td>-0.8</td>
      </tr>
  </tbody>
</table>
<p>AlphaDrug (max) achieves the highest average docking score (11.6), surpassing known ligands (9.8). Statistical significance is confirmed with two-tailed t-test P-values below 0.01 for all comparisons.</p>
<h3 id="mcts-vs-beam-search-under-equal-compute">MCTS vs. Beam Search Under Equal Compute</h3>
<p>When constrained to the same number of docking evaluations, MCTS consistently outperforms beam search:</p>
<table>
  <thead>
      <tr>
          <th>Docking times (N)</th>
          <th>BS</th>
          <th>MCTS</th>
          <th>P-value</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>N = 105 (S=10)</td>
          <td>8.4 (10.9)</td>
          <td>10.9 (11.5)</td>
          <td>1.8e-34 (4.5e-3)</td>
      </tr>
      <tr>
          <td>N = 394 (S=50)</td>
          <td>8.3 (11.4)</td>
          <td>11.6 (12.2)</td>
          <td>1.4e-31 (1.8e-3)</td>
      </tr>
      <tr>
          <td>N = 1345 (S=500)</td>
          <td>8.4 (11.9)</td>
          <td>12.4 (13.2)</td>
          <td>2.2e-39 (8.2e-6)</td>
      </tr>
  </tbody>
</table>
<p>Values in parentheses are average top-1 scores per protein.</p>
<h3 id="ablation-effect-of-protein-sequence-input">Ablation: Effect of Protein Sequence Input</h3>
<p>Replacing the full transformer (T) or LT with a transformer encoder only (TE, no protein input) demonstrates that protein conditioning improves both uniqueness and docking score per symbol (SpS):</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Uniqueness</th>
          <th>SpS</th>
          <th>Molecular length</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>TE + MCTS (S=50)</td>
          <td>81.0%</td>
          <td>0.1926</td>
          <td>62.93</td>
      </tr>
      <tr>
          <td>T + MCTS (S=50)</td>
          <td>98.0%</td>
          <td>0.2149</td>
          <td>55.63</td>
      </tr>
      <tr>
          <td>LT + MCTS (S=50)</td>
          <td>100.0%</td>
          <td>0.2159</td>
          <td>56.54</td>
      </tr>
  </tbody>
</table>
<p>The SpS metric (docking score normalized by molecule length) isolates the quality improvement from the tendency of longer molecules to score higher.</p>
<h3 id="computational-efficiency">Computational Efficiency</h3>
<p>A docking lookup table caches previously computed protein-molecule docking scores, reducing actual docking calls by 81-86% compared to the theoretical maximum ($L \times S$ calls per molecule). With $S = 10$, AlphaDrug generates molecules in about 52 minutes per protein; with $S = 50$, about 197 minutes per protein.</p>
<h2 id="docking-gains-with-acknowledged-limitations">Docking Gains with Acknowledged Limitations</h2>
<h3 id="key-findings">Key Findings</h3>
<ul>
<li>86% of AlphaDrug-generated molecules have higher docking scores than known ligands for their respective targets.</li>
<li>The LT architecture with hierarchical skip connections improves uniqueness (from 90.6% to 98.1% with beam search) and provides slight SpS gains over the vanilla transformer.</li>
<li>MCTS is the dominant factor in performance improvement: even with only 10 simulations, it boosts docking scores by 31.3% over greedy LT decoding.</li>
<li>Case studies on three proteins (3gcs, 3eig, 4o28) show that generated molecules share meaningful substructures with known ligands, suggesting chemical plausibility.</li>
</ul>
<h3 id="limitations">Limitations</h3>
<p>The authors identify three areas for improvement:</p>
<ol>
<li><strong>Sequence-only representation</strong>: AlphaDrug uses amino acid sequences rather than 3D protein structures. While it outperforms existing 3D methods (SBDD-3D), incorporating 3D pocket geometry could further improve performance.</li>
<li><strong>External docking as value function</strong>: SMINA docking calls are computationally expensive and become a bottleneck during MCTS. A learnable end-to-end value network would reduce this cost and allow joint policy-value training.</li>
<li><strong>Full rollout requirement</strong>: Every MCTS simulation requires generating a complete molecule for docking evaluation. Estimating binding affinity from partial molecules remains an open challenge.</li>
</ol>
<p>The physicochemical properties (QED, SA) of AlphaDrug&rsquo;s outputs are comparable to baselines but not explicitly optimized. LogP values trend toward the upper end of the Ghose filter range (4.9-5.2 vs. the 5.6 limit), which may indicate lipophilicity bias.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training</td>
          <td>BindingDB (filtered)</td>
          <td>192,712 protein-ligand pairs</td>
          <td>Human proteins, IC50 &lt; 100 nM, MW &lt; 1000 Da</td>
      </tr>
      <tr>
          <td>Validation</td>
          <td>BindingDB (filtered)</td>
          <td>17,049 pairs</td>
          <td>Same filtering criteria</td>
      </tr>
      <tr>
          <td>Testing</td>
          <td>BindingDB (filtered)</td>
          <td>100 proteins from 25 clusters</td>
          <td>Clustered at 30% sequence identity via MMseqs2</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>MCTS with PUCT selection criterion, $c_{puct} = 1.5$</li>
<li>$S = 50$ simulations per step (default), $S = 10$ for fast variant</li>
<li>Greedy rollout policy using LT probabilities</li>
<li>Docking lookup table for efficiency (caches SMINA results)</li>
<li>Two generation modes: max (deterministic, highest visit count) and freq (stochastic, proportional to visit counts)</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>Lmser Transformer with hierarchical encoder-to-decoder skip connections</li>
<li>Sinusoidal positional encoding</li>
<li>Multi-head cross-attention at each decoder layer</li>
<li>Detailed hyperparameters (embedding dimensions, number of layers/heads) are in the supplementary material (Table S1)</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>AlphaDrug (max)</th>
          <th>Known ligands</th>
          <th>Best baseline (T+BS10)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Docking score</td>
          <td>11.6</td>
          <td>9.8</td>
          <td>8.5</td>
      </tr>
      <tr>
          <td>Uniqueness</td>
          <td>100%</td>
          <td>-</td>
          <td>90.6%</td>
      </tr>
      <tr>
          <td>Validity</td>
          <td>100%</td>
          <td>-</td>
          <td>Not reported</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Hardware specifications are not explicitly reported in the paper. Generation time is reported as approximately 52 minutes per protein ($S = 10$) and 197 minutes per protein ($S = 50$), with docking (via SMINA) being the dominant cost.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/CMACH508/AlphaDrug">CMACH508/AlphaDrug</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official implementation, includes data processing and generation scripts</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Qian, H., Lin, C., Zhao, D., Tu, S., &amp; Xu, L. (2022). AlphaDrug: protein target specific de novo molecular generation. <em>PNAS Nexus</em>, 1(4), pgac227. <a href="https://doi.org/10.1093/pnasnexus/pgac227">https://doi.org/10.1093/pnasnexus/pgac227</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{qian2022alphadrug,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{AlphaDrug: protein target specific de novo molecular generation}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Qian, Hao and Lin, Cheng and Zhao, Dengwei and Tu, Shikui and Xu, Lei}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{PNAS Nexus}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{4}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{pgac227}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2022}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1093/pnasnexus/pgac227}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>TamGen: GPT-Based Target-Aware Drug Design and Generation</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/target-aware/tamgen-target-aware-molecule-generation/</link><pubDate>Wed, 25 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/target-aware/tamgen-target-aware-molecule-generation/</guid><description>TamGen combines a GPT-like chemical language model with protein pocket encoding and VAE refinement to generate drug candidates with experimental validation.</description><content:encoded><![CDATA[<h2 id="a-method-for-target-conditioned-molecular-generation">A Method for Target-Conditioned Molecular Generation</h2>
<p>This is a <strong>Method</strong> paper that introduces TamGen (Target-aware molecular generation), a three-module architecture for generating drug-like compounds conditioned on protein binding pocket structures. The primary contribution is a GPT-like chemical language model pre-trained on 10 million SMILES from PubChem, combined with a Transformer-based protein encoder and a VAE-based contextual encoder for compound refinement. The authors validate TamGen on the CrossDocked2020 benchmark and apply it through a Design-Refine-Test pipeline to discover 14 novel inhibitors of the Mycobacterium tuberculosis ClpP protease, with $\text{IC}_{50}$ values ranging from 1.88 to 35.2 $\mu$M.</p>
<h2 id="bridging-generative-ai-and-practical-drug-discovery">Bridging Generative AI and Practical Drug Discovery</h2>
<p>Target-based generative drug design aims to create novel compounds with desired pharmacological properties from scratch, exploring the estimated $10^{60}$ feasible compounds in chemical space rather than screening existing libraries of $10^{4}$ to $10^{8}$ molecules. Prior approaches using diffusion models, GANs, VAEs, and autoregressive models have demonstrated the feasibility of generating compounds conditioned on target proteins. However, most generated compounds lack satisfactory physicochemical properties for drug-likeness, and validations with biophysical or biochemical assays are largely missing.</p>
<p>The key limitations of existing 3D generation methods (TargetDiff, Pocket2Mol, ResGen, 3D-AR) include:</p>
<ul>
<li>Generated compounds frequently contain multiple fused rings, leading to poor synthetic accessibility</li>
<li>High cellular toxicity and decreased developability associated with excessive fused ring counts</li>
<li>Slow generation speeds (tens of minutes to hours per 100 compounds)</li>
<li>Limited real-world experimental validation of generated candidates</li>
</ul>
<p>TamGen addresses these issues by operating in 1D SMILES space rather than 3D coordinate space, leveraging pre-training on natural compound distributions to produce more drug-like molecules.</p>
<h2 id="three-module-architecture-with-pre-training-and-refinement">Three-Module Architecture with Pre-Training and Refinement</h2>
<p>TamGen consists of three components: a compound decoder, a protein encoder, and a contextual encoder.</p>
<h3 id="compound-decoder-chemical-language-model">Compound Decoder (Chemical Language Model)</h3>
<p>The compound decoder is a GPT-style autoregressive model pre-trained on 10 million SMILES randomly sampled from PubChem. The pre-training objective follows standard next-token prediction:</p>
<p>$$
\min -\sum_{y \in \mathcal{D}_0} \frac{1}{M_y} \sum_{i=1}^{M_y} \log P(y_i \mid y_{i-1}, y_{i-2}, \ldots, y_1)
$$</p>
<p>where $M_y$ is the SMILES sequence length. This enables both unconditional and conditional generation. The decoder uses 12 Transformer layers with hidden dimension 768.</p>
<h3 id="protein-encoder-with-distance-aware-attention">Protein Encoder with Distance-Aware Attention</h3>
<p>The protein encoder processes binding pocket residues using both sequential and geometric information. Given amino acids $\mathbf{a} = (a_1, \ldots, a_N)$ with 3D coordinates $\mathbf{r} = (r_1, \ldots, r_N)$, the input representation combines amino acid embeddings with coordinate embeddings:</p>
<p>$$
h_i^{(0)} = E_a a_i + E_r \rho\left(r_i - \frac{1}{N}\sum_{j=1}^{N} r_j\right)
$$</p>
<p>where $\rho$ denotes a random roto-translation operation applied as data augmentation, and coordinates are centered to the origin.</p>
<p>The encoder uses a distance-aware self-attention mechanism that weights attention scores by spatial proximity:</p>
<p>$$
\begin{aligned}
\hat{\alpha}_j &amp;= \exp\left(-\frac{|r_i - r_j|^2}{\tau}\right)(h_i^{(l)\top} W h_j^{(l)}) \\
\alpha_j &amp;= \frac{\exp \hat{\alpha}_j}{\sum_{k=1}^{N} \exp \hat{\alpha}_k} \\
\hat{\boldsymbol{h}}_i^{(l+1)} &amp;= \sum_{j=1}^{N} \alpha_j (W_v h_j^{(l)})
\end{aligned}
$$</p>
<p>where $\tau$ is a temperature hyperparameter and $W$, $W_v$ are learnable parameters. The encoder uses 4 layers with hidden dimension 256. Outputs are passed to the compound decoder via cross-attention.</p>
<h3 id="vae-based-contextual-encoder">VAE-Based Contextual Encoder</h3>
<p>A VAE-based contextual encoder determines the mean $\mu$ and standard deviation $\sigma$ for any (compound, protein) pair. During training, the model recovers the input compound. During application, a seed compound enables compound refinement. The full training objective combines reconstruction loss with KL regularization:</p>
<p>$$
\min_{\Theta, q} \frac{1}{|\mathcal{D}|} \sum_{(\mathbf{x}, \mathbf{y}) \in \mathcal{D}} -\log P(\mathbf{y} \mid \mathbf{x}, z; \Theta) + \beta \mathcal{D}_{\text{KL}}(q(z \mid \mathbf{x}, \mathbf{y}) | p(z))
$$</p>
<p>where $\beta$ is a hyperparameter controlling the KL divergence weight, and $p(z)$ is a standard Gaussian prior.</p>
<h2 id="benchmark-evaluation-and-tuberculosis-drug-discovery">Benchmark Evaluation and Tuberculosis Drug Discovery</h2>
<h3 id="crossdocked2020-benchmark">CrossDocked2020 Benchmark</h3>
<p>TamGen was evaluated against five baselines (liGAN, 3D-AR, Pocket2Mol, ResGen, TargetDiff) on the CrossDocked2020 dataset (~100k drug-target pairs for training, 100 test binding pockets). For each target, 100 compounds were generated per method. Evaluation metrics included:</p>
<ul>
<li><strong>Docking score</strong> (AutoDock-Vina): binding affinity estimate</li>
<li><strong>QED</strong>: quantitative estimate of drug-likeness</li>
<li><strong><a href="https://en.wikipedia.org/wiki/Lipinski%27s_rule_of_five">Lipinski&rsquo;s Rule of Five</a></strong>: physicochemical property compliance</li>
<li><strong>SAS</strong>: synthetic accessibility score</li>
<li><strong>LogP</strong>: lipophilicity (optimal range 0-5 for oral administration)</li>
<li><strong>Molecular diversity</strong>: Tanimoto similarity between Morgan fingerprints</li>
</ul>
<p>TamGen ranked first or second on 5 of 6 metrics and achieved the best overall score using mean reciprocal rank (MRR) across all metrics. On synthetic accessibility for high-affinity compounds, TamGen performed best. The generated compounds averaged 1.78 fused rings, closely matching FDA-approved drugs, while competing 3D methods produced compounds with significantly more fused rings.</p>
<p>TamGen was also 85x to 394x faster than competing methods: generating 100 compounds per target in an average of 9 seconds on a single A6000 GPU, compared to tens of minutes or hours for the baselines.</p>
<h3 id="design-refine-test-pipeline-for-clpp-inhibitors">Design-Refine-Test Pipeline for ClpP Inhibitors</h3>
<p>The practical application targeted ClpP protease of Mycobacterium tuberculosis, an emerging antibiotic target with no documented advanced inhibitors beyond <a href="https://en.wikipedia.org/wiki/Bortezomib">Bortezomib</a>.</p>
<p><strong>Design stage</strong>: Using the ClpP binding pocket from PDB structure 5DZK, TamGen generated 2,612 unique compounds. Compounds were filtered by molecular docking (retaining those with better scores than Bortezomib) and Ligandformer phenotypic activity prediction. Peptidomimetic compounds were excluded for poor ADME properties. Four seed compounds were selected.</p>
<p><strong>Refine stage</strong>: Using the 4 seed compounds plus 3 weakly active compounds ($\text{IC}_{50}$ 100-200 $\mu$M) from prior experiments, TamGen generated 8,635 unique compounds conditioned on both the target and seeds. After filtering, 296 compounds were selected for testing.</p>
<p><strong>Test stage</strong>: From a 446k commercial compound library, 159 analogs (MCS similarity &gt; 0.55) were identified. Five analogs showed significant inhibitory effects. Dose-response experiments revealed $\text{IC}_{50}$ values below 20 $\mu$M for all five, with Analog-005 achieving $\text{IC}_{50}$ of 1.9 $\mu$M. Three additional novel compounds were synthesized for SAR analysis:</p>
<table>
  <thead>
      <tr>
          <th>Compound</th>
          <th>Series</th>
          <th>Source</th>
          <th>$\text{IC}_{50}$ ($\mu$M)</th>
          <th>Key Feature</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Analog-005</td>
          <td>II</td>
          <td>Commercial library</td>
          <td>1.9</td>
          <td>Most potent analog</td>
      </tr>
      <tr>
          <td>Analog-003</td>
          <td>I</td>
          <td>Commercial library</td>
          <td>&lt; 20</td>
          <td>Strongest single-dose inhibition</td>
      </tr>
      <tr>
          <td>Syn-A003-01</td>
          <td>I</td>
          <td>TamGen (synthesized)</td>
          <td>&lt; 20</td>
          <td>Diphenylurea scaffold</td>
      </tr>
  </tbody>
</table>
<p>Both compound series (diphenylurea and benzenesulfonamide scaffolds) represent novel ClpP inhibitor chemotypes distinct from Bortezomib. Additionally, 6 out of 8 directly synthesized TamGen compounds demonstrated $\text{IC}_{50}$ below 40 $\mu$M, confirming TamGen&rsquo;s ability to produce viable hits without the library search step.</p>
<h3 id="ablation-studies">Ablation Studies</h3>
<p>Four ablation experiments clarified the contributions of TamGen&rsquo;s components:</p>
<ol>
<li><strong>Without pre-training</strong>: Significantly worse docking scores and simpler structures. The optimal decoder depth dropped from 12 to 4 layers without pre-training due to overfitting.</li>
<li><strong>Shuffled pocket-ligand pairs (TamGen-r)</strong>: Substantially worse docking scores, confirming TamGen learns meaningful pocket-ligand interactions rather than generic compound distributions.</li>
<li><strong>Without distance-aware attention</strong>: Significant decline in docking scores when removing the geometric attention term from Eq. 2.</li>
<li><strong>Without coordinate augmentation</strong>: Performance degradation when removing the roto-translation augmentation $\rho$, highlighting the importance of geometric invariance.</li>
</ol>
<h2 id="validated-drug-like-generation-with-practical-limitations">Validated Drug-Like Generation with Practical Limitations</h2>
<p>TamGen demonstrates that 1D SMILES-based generation with pre-training on natural compounds produces molecules with better drug-likeness properties than 3D generation methods. The experimental validation against ClpP is a notable strength, as most generative drug design methods lack biochemical assay confirmation.</p>
<p>Key limitations acknowledged by the authors include:</p>
<ul>
<li><strong>Insufficient sensitivity to minor target differences</strong>: TamGen cannot reliably distinguish targets with point mutations or protein isoforms, limiting applicability for cancer-related proteins</li>
<li><strong>Requires known structure and pocket</strong>: As a structure-based method, TamGen needs the 3D structure of the target protein and binding pocket information</li>
<li><strong>Limited cellular validation</strong>: The study focuses on hit identification; cellular activities and toxicities of proposed compounds were not extensively tested</li>
<li><strong>1D generation trade-off</strong>: SMILES-based generation does not fully exploit 3D protein-ligand geometric interactions available in coordinate space</li>
</ul>
<p>Future directions include integrating insights from 3D autoregressive methods, using Monte Carlo Tree Search or reinforcement learning to guide generation for better docking scores and ADME/T properties, and property-guided generation as explored in <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/target-aware/prefixmol-target-chemistry-aware-generation/">PrefixMol</a>.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Pre-training</td>
          <td>PubChem (random sample)</td>
          <td>10M SMILES</td>
          <td>Compound decoder pre-training</td>
      </tr>
      <tr>
          <td>Fine-tuning</td>
          <td>CrossDocked2020</td>
          <td>~100k pairs</td>
          <td>Filtered pocket-ligand pairs</td>
      </tr>
      <tr>
          <td>Extended fine-tuning</td>
          <td>CrossDocked + PDB</td>
          <td>~300k pairs</td>
          <td>Used for TB compound generation</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>CrossDocked2020 test</td>
          <td>100 pockets</td>
          <td>Same split as TargetDiff/Pocket2Mol</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Compound decoder</strong>: 12-layer GPT with hidden dimension 768, pre-trained for 200k steps</li>
<li><strong>Protein encoder</strong>: 4-layer Transformer with hidden dimension 256, distance-aware attention</li>
<li><strong>VAE encoder</strong>: 4-layer standard Transformer encoder with hidden dimension 256</li>
<li><strong>Optimizer</strong>: Adam with initial learning rate $3 \times 10^{-5}$</li>
<li><strong>VAE $\beta$</strong>: 0.1 or 1.0 depending on generation stage</li>
<li><strong>Beam search</strong>: beam sizes of 4, 10, or 20 depending on stage</li>
<li><strong>Pocket definition</strong>: residues within 10 or 15 Angstrom distance cutoff from ligand center</li>
</ul>
<h3 id="models">Models</h3>
<p>Pre-trained model weights are available via Zenodo at <a href="https://doi.org/10.5281/zenodo.13751391">https://doi.org/10.5281/zenodo.13751391</a>.</p>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>TamGen</th>
          <th>Best Baseline</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Overall MRR</td>
          <td>Best</td>
          <td>TargetDiff (2nd)</td>
          <td>Ranked across 6 metrics</td>
      </tr>
      <tr>
          <td>Fused rings (avg)</td>
          <td>1.78</td>
          <td>~3-5 (others)</td>
          <td>Matches FDA-approved drug average</td>
      </tr>
      <tr>
          <td>Generation speed</td>
          <td>9 sec/100 compounds</td>
          <td>~13 min (ResGen)</td>
          <td>Single A6000 GPU</td>
      </tr>
      <tr>
          <td>ClpP hit rate</td>
          <td>6/8 synthesized</td>
          <td>N/A</td>
          <td>$\text{IC}_{50}$ &lt; 40 $\mu$M</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<ul>
<li>Pre-training: 8x V100 GPUs for 200k steps</li>
<li>Inference benchmarking: 1x A6000 GPU</li>
<li>Generation time: ~9 seconds per 100 compounds per target</li>
</ul>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/SigmaGenX/TamGen">TamGen code</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official implementation</td>
      </tr>
      <tr>
          <td><a href="https://doi.org/10.5281/zenodo.13751391">Model weights and data</a></td>
          <td>Model + Data</td>
          <td>CC-BY-4.0</td>
          <td>Pre-trained weights, source data</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Wu, K., Xia, Y., Deng, P., Liu, R., Zhang, Y., Guo, H., Cui, Y., Pei, Q., Wu, L., Xie, S., Chen, S., Lu, X., Hu, S., Wu, J., Chan, C.-K., Chen, S., Zhou, L., Yu, N., Chen, E., Liu, H., Guo, J., Qin, T., &amp; Liu, T.-Y. (2024). TamGen: drug design with target-aware molecule generation through a chemical language model. <em>Nature Communications</em>, 15, 9360. <a href="https://doi.org/10.1038/s41467-024-53632-4">https://doi.org/10.1038/s41467-024-53632-4</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{wu2024tamgen,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{TamGen: drug design with target-aware molecule generation through a chemical language model}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Wu, Kehan and Xia, Yingce and Deng, Pan and Liu, Renhe and Zhang, Yuan and Guo, Han and Cui, Yumeng and Pei, Qizhi and Wu, Lijun and Xie, Shufang and Chen, Si and Lu, Xi and Hu, Song and Wu, Jinzhi and Chan, Chi-Kin and Chen, Shawn and Zhou, Liangliang and Yu, Nenghai and Chen, Enhong and Liu, Haiguang and Guo, Jinjiang and Qin, Tao and Liu, Tie-Yan}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Nature Communications}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{15}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{9360}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2024}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Nature Publishing Group}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1038/s41467-024-53632-4}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item></channel></rss>