<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Latent-Space Generation on Hunter Heidenreich | ML Research Scientist</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/latent-space/</link><description>Recent content in Latent-Space Generation on Hunter Heidenreich | ML Research Scientist</description><image><title>Hunter Heidenreich | ML Research Scientist</title><url>https://hunterheidenreich.com/img/avatar.webp</url><link>https://hunterheidenreich.com/img/avatar.webp</link></image><generator>Hugo -- 0.147.7</generator><language>en-US</language><copyright>2026 Hunter Heidenreich</copyright><lastBuildDate>Sun, 05 Apr 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/latent-space/index.xml" rel="self" type="application/rss+xml"/><item><title>LatentGAN: Latent-Space GAN for Molecular Generation</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/latent-space/latentgan-de-novo-molecular-generation/</link><pubDate>Sat, 28 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/latent-space/latentgan-de-novo-molecular-generation/</guid><description>LatentGAN combines a SMILES heteroencoder with a Wasserstein GAN to generate novel drug-like molecules in latent space, avoiding SMILES syntax issues.</description><content:encoded><![CDATA[<h2 id="a-gan-operating-in-learned-latent-space-for-molecular-design">A GAN Operating in Learned Latent Space for Molecular Design</h2>
<p>LatentGAN is a <strong>Method</strong> paper that introduces a two-stage architecture for de novo molecular generation. The first stage trains a heteroencoder to map SMILES strings into a continuous latent vector space. The second stage trains a Wasserstein GAN with gradient penalty (WGAN-GP) to generate new latent vectors that, when decoded, produce valid and novel molecular structures. The key contribution is decoupling the GAN from direct SMILES string generation, allowing the adversarial training to focus on learning the distribution of molecular latent representations rather than character-level sequence generation.</p>
<h2 id="limitations-of-direct-smiles-generation-with-gans">Limitations of Direct SMILES Generation with GANs</h2>
<p>Prior GAN-based molecular generation methods such as ORGAN and ORGANIC operated directly on SMILES strings. This created a fundamental challenge: the generator had to simultaneously learn valid SMILES syntax and the distribution of chemically meaningful molecules. ORGAN struggled with optimizing discrete molecular properties like Lipinski&rsquo;s Rule of Five, while ORGANIC showed limited success beyond the QED drug-likeness score. Other approaches (RANC, ATNC) substituted more advanced recurrent architectures but still operated in the discrete SMILES space.</p>
<p>Meanwhile, variational autoencoders (VAEs) demonstrated that working in continuous latent space could enable molecular generation, but they relied on forcing the latent distribution to match a Gaussian prior through KL divergence. This assumption is not necessarily appropriate for chemical space, which is inherently discontinuous.</p>
<p>RNN-based methods with transfer learning offered an alternative for target-biased generation, but the authors hypothesized that combining GANs with learned latent representations could produce complementary chemical space coverage.</p>
<h2 id="heteroencoder-plus-wasserstein-gan-architecture">Heteroencoder Plus Wasserstein GAN Architecture</h2>
<p>The core innovation of LatentGAN is separating molecular representation learning from adversarial generation through a two-component pipeline.</p>
<h3 id="heteroencoder">Heteroencoder</h3>
<p>The heteroencoder is an autoencoder trained on pairs of different non-canonical (randomized) SMILES representations of the same molecule. This is distinct from a standard autoencoder because the input and target SMILES are different representations of the same structure.</p>
<p>The encoder uses a two-layer bidirectional LSTM with 512 units per layer (256 forward, 256 backward). The concatenated output feeds into a 512-dimensional feed-forward layer. During training, zero-centered Gaussian noise with $\sigma = 0.1$ is added to the latent vector as regularization. The decoder is a four-layer unidirectional LSTM with a softmax output layer. Batch normalization with momentum 0.9 is applied to all hidden layers except the noise layer.</p>
<p>Training uses teacher forcing with categorical cross-entropy loss for 100 epochs. The learning rate starts at $10^{-3}$ for the first 50 epochs and decays exponentially to $10^{-6}$ by the final epoch. After training, the noise layer is deactivated for deterministic encoding and decoding.</p>
<p>An important design choice is that the heteroencoder makes no assumption about the latent space distribution (unlike VAEs with their KL divergence term). The latent space is shaped purely by reconstruction loss, and the GAN later learns to sample from this unconstrained distribution.</p>
<h3 id="wasserstein-gan-with-gradient-penalty">Wasserstein GAN with Gradient Penalty</h3>
<p>The GAN uses the WGAN-GP formulation. The critic (discriminator) consists of three feed-forward layers of 256 dimensions each with leaky ReLU activations (no activation on the final layer). The generator has five feed-forward layers of 256 dimensions each with batch normalization and leaky ReLU between layers.</p>
<p>The training ratio is 5:1, with five critic updates for every generator update. The generator takes random vectors sampled from a uniform distribution and learns to produce latent vectors indistinguishable from the real encoded molecular latent vectors.</p>
<p>The WGAN-GP loss for the critic is:</p>
<p>$$L_{\text{critic}} = \mathbb{E}_{\tilde{x} \sim \mathbb{P}_g}[D(\tilde{x})] - \mathbb{E}_{x \sim \mathbb{P}_r}[D(x)] + \lambda \mathbb{E}_{\hat{x} \sim \mathbb{P}_{\hat{x}}}[(|\nabla_{\hat{x}} D(\hat{x})|_2 - 1)^2]$$</p>
<p>where $\lambda$ is the gradient penalty coefficient, $\mathbb{P}_r$ is the real data distribution (encoded latent vectors), $\mathbb{P}_g$ is the generator distribution, and $\mathbb{P}_{\hat{x}}$ samples uniformly along straight lines between pairs of real and generated points.</p>
<h3 id="generation-pipeline">Generation Pipeline</h3>
<p>At inference time, the full pipeline operates as: (1) sample a random vector, (2) pass through the trained generator to produce a latent vector, (3) decode the latent vector into a SMILES string using the pretrained heteroencoder decoder.</p>
<h2 id="experiments-on-drug-like-and-target-biased-generation">Experiments on Drug-Like and Target-Biased Generation</h2>
<h3 id="datasets">Datasets</h3>
<p>The heteroencoder was trained on 1,347,173 SMILES from ChEMBL 25, standardized with MolVS and restricted to molecules with atoms from {H, C, N, O, S, Cl, Br} and at most 50 heavy atoms.</p>
<p>For general drug-like generation, a random subset of 100,000 ChEMBL compounds was used to train the GAN model for 30,000 epochs.</p>
<p>For target-biased generation, three datasets were extracted from ExCAPE-DB for EGFR, HTR1A, and S1PR1 targets. These were clustered into training and test sets to ensure chemical series were not split across sets.</p>
<table>
  <thead>
      <tr>
          <th>Target</th>
          <th>Training Set</th>
          <th>Test Set</th>
          <th>SVM ROC-AUC</th>
          <th>SVM Kappa</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>EGFR</td>
          <td>2,949</td>
          <td>2,326</td>
          <td>0.850</td>
          <td>0.56</td>
      </tr>
      <tr>
          <td>HTR1A</td>
          <td>48,283</td>
          <td>23,048</td>
          <td>0.993</td>
          <td>0.90</td>
      </tr>
      <tr>
          <td>S1PR1</td>
          <td>49,381</td>
          <td>23,745</td>
          <td>0.995</td>
          <td>0.91</td>
      </tr>
  </tbody>
</table>
<p>SVM target prediction models using 2048-bit FCFP6 fingerprints were built with scikit-learn to evaluate generated compounds.</p>
<h3 id="baselines">Baselines</h3>
<p>RNN-based generative models with transfer learning served as the primary baseline. A prior RNN model was trained on the same ChEMBL set, then fine-tuned on each target dataset. The LatentGAN was also benchmarked on the MOSES platform against VAE, JTN-VAE, and AAE architectures.</p>
<h3 id="heteroencoder-performance">Heteroencoder Performance</h3>
<p>The heteroencoder achieved 99% valid SMILES on the training set and 98% on the test set. Reconstruction error (decoding to a different molecule) was 18% on training and 20% on test. Notably, decoding to a different valid SMILES of the same molecule is not counted as an error.</p>
<h3 id="target-biased-generation-results">Target-Biased Generation Results</h3>
<p>From 50,000 sampled SMILES per target model:</p>
<table>
  <thead>
      <tr>
          <th>Target</th>
          <th>Arch.</th>
          <th>Valid (%)</th>
          <th>Unique (%)</th>
          <th>Novel (%)</th>
          <th>Active (%)</th>
          <th>Recovered Actives (%)</th>
          <th>Recovered Neighbors</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>EGFR</td>
          <td>GAN</td>
          <td>86</td>
          <td>56</td>
          <td>97</td>
          <td>71</td>
          <td>5.26</td>
          <td>196</td>
      </tr>
      <tr>
          <td>EGFR</td>
          <td>RNN</td>
          <td>96</td>
          <td>46</td>
          <td>95</td>
          <td>65</td>
          <td>7.74</td>
          <td>238</td>
      </tr>
      <tr>
          <td>HTR1A</td>
          <td>GAN</td>
          <td>86</td>
          <td>66</td>
          <td>95</td>
          <td>71</td>
          <td>5.05</td>
          <td>284</td>
      </tr>
      <tr>
          <td>HTR1A</td>
          <td>RNN</td>
          <td>96</td>
          <td>50</td>
          <td>90</td>
          <td>81</td>
          <td>7.28</td>
          <td>384</td>
      </tr>
      <tr>
          <td>S1PR1</td>
          <td>GAN</td>
          <td>89</td>
          <td>31</td>
          <td>98</td>
          <td>44</td>
          <td>0.93</td>
          <td>24</td>
      </tr>
      <tr>
          <td>S1PR1</td>
          <td>RNN</td>
          <td>97</td>
          <td>35</td>
          <td>97</td>
          <td>65</td>
          <td>3.72</td>
          <td>43</td>
      </tr>
  </tbody>
</table>
<h3 id="moses-benchmark">MOSES Benchmark</h3>
<p>On the MOSES benchmark (trained on a ZINC subset of 1,584,663 compounds, sampled 30,000 SMILES), LatentGAN showed comparable or better results than JTN-VAE and AAE on Frechet ChemNet Distance (FCD), Fragment similarity, and Scaffold similarity, while producing slightly worse nearest-neighbor cosine similarity (SNN). The standard VAE showed signs of mode collapse with high test metric overlap and low novelty.</p>
<h2 id="complementary-generation-and-drug-likeness-preservation">Complementary Generation and Drug-Likeness Preservation</h2>
<h3 id="key-findings">Key Findings</h3>
<p><strong>Validity and novelty</strong>: LatentGAN achieved 86-89% validity on target-biased tasks (lower than RNN&rsquo;s 96-97%) but produced higher uniqueness on two of three targets and comparable or higher novelty (95-98%).</p>
<p><strong>Complementary chemical space</strong>: The overlap between LatentGAN-generated and RNN-generated active compounds was very small at both compound and scaffold levels. A probabilistic analysis showed that the RNN model would be very unlikely to eventually cover the LatentGAN output space. This suggests the two architectures can work complementarily in de novo design campaigns.</p>
<p><strong>Drug-likeness</strong>: QED score distributions of LatentGAN-generated compounds closely matched training set distributions across all three targets, with training compounds showing only slightly higher drug-likeness. SA score distributions were similarly well-preserved.</p>
<p><strong>Chemical space coverage</strong>: PCA analysis using MQN fingerprints confirmed that generated compounds occupy most of the chemical space of the training sets. Some regions of the PCA plots contained compounds predicted as inactive, which corresponded to non-drug-like outliers in the training data.</p>
<p><strong>Novel scaffolds</strong>: About 14% of scaffolds in the sampled sets had similarity below 0.4 to the training set across all three targets, indicating LatentGAN can generate genuinely novel chemical scaffolds. Around 5% of generated compounds were identical to training set compounds, while 21-25% had Tanimoto similarity below 0.4.</p>
<h3 id="limitations">Limitations</h3>
<p>The paper acknowledges several limitations. The 18-20% heteroencoder reconstruction error means a non-trivial fraction of encoded molecules decode to different structures. Validity rates (86-89%) are lower than RNN baselines (96-97%). The S1PR1 target showed notably lower uniqueness (31%) and predicted activity (44%) compared to the other targets, possibly due to the smaller effective training set of active compounds. The paper does not report specific hardware requirements or training times. No wet-lab experimental validation of generated compounds was performed.</p>
<h3 id="future-directions">Future Directions</h3>
<p>The authors envision LatentGAN as a complementary tool to existing RNN-based generative models, with the two architectures covering different regions of chemical space. The approach of operating in learned latent space rather than directly on SMILES strings offers a general framework that could be extended to other molecular representations or generation objectives.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Heteroencoder training</td>
          <td>ChEMBL 25 (subset)</td>
          <td>1,347,173 SMILES</td>
          <td>Standardized with MolVS; atoms restricted to H, C, N, O, S, Cl, Br; max 50 heavy atoms</td>
      </tr>
      <tr>
          <td>General GAN training</td>
          <td>ChEMBL 25 (random subset)</td>
          <td>100,000</td>
          <td>Subset of heteroencoder training set</td>
      </tr>
      <tr>
          <td>Target-biased training</td>
          <td>ExCAPE-DB (EGFR)</td>
          <td>2,949 actives</td>
          <td>Clustered train/test split</td>
      </tr>
      <tr>
          <td>Target-biased training</td>
          <td>ExCAPE-DB (HTR1A)</td>
          <td>48,283 actives</td>
          <td>Clustered train/test split</td>
      </tr>
      <tr>
          <td>Target-biased training</td>
          <td>ExCAPE-DB (S1PR1)</td>
          <td>49,381 actives</td>
          <td>Clustered train/test split</td>
      </tr>
      <tr>
          <td>Benchmarking</td>
          <td>ZINC (MOSES subset)</td>
          <td>1,584,663</td>
          <td>Canonical SMILES</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Heteroencoder</strong>: Bidirectional LSTM encoder (2 layers, 512 units) + unidirectional LSTM decoder (4 layers), trained with teacher forcing and categorical cross-entropy for 100 epochs</li>
<li><strong>GAN</strong>: WGAN-GP with 5:1 critic-to-generator training ratio. General model trained 30,000 epochs; target models trained 10,000 epochs</li>
<li><strong>Evaluation</strong>: SVM classifiers with FCFP6 fingerprints (2048 bits) for activity prediction; MQN fingerprints for PCA-based chemical space analysis; Murcko scaffolds for scaffold-level analysis</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>Heteroencoder: 512-dim latent space, bidirectional LSTM encoder, unidirectional LSTM decoder</li>
<li>Generator: 5 feed-forward layers of 256 dims with batch norm and leaky ReLU</li>
<li>Critic: 3 feed-forward layers of 256 dims with leaky ReLU</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>LatentGAN (EGFR)</th>
          <th>RNN Baseline (EGFR)</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Validity</td>
          <td>86%</td>
          <td>96%</td>
          <td>Percent valid SMILES</td>
      </tr>
      <tr>
          <td>Uniqueness</td>
          <td>56%</td>
          <td>46%</td>
          <td>Percent unique among valid</td>
      </tr>
      <tr>
          <td>Novelty</td>
          <td>97%</td>
          <td>95%</td>
          <td>Not in training set</td>
      </tr>
      <tr>
          <td>Predicted active</td>
          <td>71%</td>
          <td>65%</td>
          <td>By SVM model</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Not specified in the paper.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/Dierme/latent-gan">LatentGAN source code</a></td>
          <td>Code</td>
          <td>Not specified</td>
          <td>Includes trained heteroencoder model and training sets</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Prykhodko, O., Johansson, S.V., Kotsias, P.-C., Arús-Pous, J., Bjerrum, E.J., Engkvist, O., &amp; Chen, H. (2019). A de novo molecular generation method using latent vector based generative adversarial network. <em>Journal of Cheminformatics</em>, 11(1), 74. <a href="https://doi.org/10.1186/s13321-019-0397-9">https://doi.org/10.1186/s13321-019-0397-9</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{prykhodko2019latentgan,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{A de novo molecular generation method using latent vector based generative adversarial network}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Prykhodko, Oleksii and Johansson, Simon Viet and Kotsias, Panagiotis-Christos and Ar{\&#39;u}s-Pous, Josep and Bjerrum, Esben Jannik and Engkvist, Ola and Chen, Hongming}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Journal of Cheminformatics}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{11}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{1}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{74}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2019}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{Springer}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1186/s13321-019-0397-9}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Grammar VAE: Generating Valid Molecules via CFGs</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/latent-space/grammar-variational-autoencoder/</link><pubDate>Sat, 28 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/latent-space/grammar-variational-autoencoder/</guid><description>The Grammar VAE encodes and decodes molecular parse trees from context-free grammars, guaranteeing syntactically valid SMILES outputs during generation.</description><content:encoded><![CDATA[<h2 id="a-grammar-constrained-vae-for-discrete-data-generation">A Grammar-Constrained VAE for Discrete Data Generation</h2>
<p>This is a <strong>Method</strong> paper that introduces the Grammar Variational Autoencoder (GVAE), a variant of the <a href="/notes/machine-learning/generative-models/autoencoding-variational-bayes/">variational autoencoder</a> that operates directly on parse trees from context-free grammars (CFGs) rather than on raw character sequences. The primary contribution is a decoding mechanism that uses a stack and grammar-derived masks to restrict the output at every timestep to only syntactically valid production rules. This guarantees that every decoded output is a valid string under the grammar, addressing a fundamental limitation of character-level VAEs when applied to structured discrete data such as <a href="/notes/computational-chemistry/molecular-representations/smiles/">SMILES</a> molecular strings and arithmetic expressions.</p>
<h2 id="why-character-level-vaes-fail-on-structured-discrete-data">Why Character-Level VAEs Fail on Structured Discrete Data</h2>
<p>Generative models for continuous data (images, audio) had achieved impressive results by 2017, but generating structured discrete data remained difficult. The key challenge is that string representations of molecules and mathematical expressions are brittle: small perturbations to a character sequence often produce invalid outputs. <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/latent-space/automatic-chemical-design-vae/">Gomez-Bombarelli et al. (2016)</a> demonstrated a character-level VAE (CVAE) for SMILES strings that could encode molecules into a continuous latent space and decode them back, enabling latent-space optimization for molecular design. However, the CVAE frequently decoded latent points into strings that were not valid SMILES, particularly when exploring regions of latent space far from training data.</p>
<p>The fundamental issue is that character-level decoders must implicitly learn the syntactic rules of the target language from data alone. For SMILES, this includes matching parentheses, valid atom types, proper bonding, and ring closure notation. The GVAE addresses this by giving the decoder explicit knowledge of the grammar, so it can focus entirely on learning the semantic structure of the data.</p>
<h2 id="core-innovation-stack-based-grammar-masking-in-the-decoder">Core Innovation: Stack-Based Grammar Masking in the Decoder</h2>
<p>The GVAE encodes and decodes sequences of production rules from a context-free grammar rather than sequences of characters.</p>
<p><strong>Encoding.</strong> Given an input string (e.g., a SMILES molecule), the encoder first parses it into a parse tree using the CFG, then performs a left-to-right pre-order traversal of the tree to extract an ordered sequence of production rules. Each rule is represented as a one-hot vector of dimension $K$ (total number of production rules in the grammar). The resulting $T(\mathbf{X}) \times K$ matrix is processed by a convolutional neural network to produce the mean and variance of a Gaussian posterior $q_{\phi}(\mathbf{z} \mid \mathbf{X})$.</p>
<p><strong>Decoding with grammar masks.</strong> The decoder maps a latent vector $\mathbf{z}$ through an RNN to produce a matrix of logits $\mathbf{F} \in \mathbb{R}^{T_{max} \times K}$. The key innovation is a last-in first-out (LIFO) stack that tracks the current parsing state. At each timestep $t$, the decoder:</p>
<ol>
<li>Pops the top non-terminal $\alpha$ from the stack</li>
<li>Applies a fixed binary mask $\mathbf{m}_{\alpha} \in {0, 1}^K$ that zeros out all production rules whose left-hand side is not $\alpha$</li>
<li>Samples a production rule from the masked softmax distribution:</li>
</ol>
<p>$$
p(\mathbf{x}_{t} = k \mid \alpha, \mathbf{z}) = \frac{m_{\alpha,k} \exp(f_{tk})}{\sum_{j=1}^{K} m_{\alpha,j} \exp(f_{tj})}
$$</p>
<ol start="4">
<li>Pushes the right-hand-side non-terminals of the selected rule onto the stack (right-to-left, so the leftmost is on top)</li>
</ol>
<p>This process continues until the stack is empty or $T_{max}$ timesteps are reached. Because the mask restricts selection to only those rules applicable to the current non-terminal, every generated sequence of production rules is guaranteed to be a valid derivation under the grammar.</p>
<p><strong>Training.</strong> The model is trained by maximizing the ELBO:</p>
<p>$$
\mathcal{L}(\phi, \theta; \mathbf{X}) = \mathbb{E}_{q(\mathbf{z} \mid \mathbf{X})} \left[ \log p_{\theta}(\mathbf{X}, \mathbf{z}) - \log q_{\phi}(\mathbf{z} \mid \mathbf{X}) \right]
$$</p>
<p>where the likelihood factorizes as:</p>
<p>$$
p(\mathbf{X} \mid \mathbf{z}) = \prod_{t=1}^{T(\mathbf{X})} p(\mathbf{x}_{t} \mid \mathbf{z})
$$</p>
<p>During training, the masks at each timestep are determined by the ground-truth production rule sequence, so no stack simulation is needed. The stack-based decoding is only required at generation time.</p>
<p><strong>Syntactic vs. semantic validity.</strong> The grammar guarantees syntactic validity but not semantic validity. The GVAE can still produce chemically implausible molecules (e.g., an oxygen atom with three bonds) because such constraints are not context-free. SMILES ring-bond digit matching is also not context-free, so the grammar cannot enforce it. Additionally, sequences that have not emptied the stack by $T_{max}$ are marked invalid.</p>
<h2 id="experiments-on-symbolic-regression-and-molecular-optimization">Experiments on Symbolic Regression and Molecular Optimization</h2>
<p>The authors evaluate the GVAE on two domains: arithmetic expressions and molecules. Both use Bayesian optimization (BO) over the learned latent space.</p>
<p><strong>Setup.</strong> After training each VAE, the authors encode training data into latent vectors and train a sparse Gaussian process (SGP) with 500 inducing points to predict properties from latent representations. They then run batch BO with expected improvement, selecting 50 candidates per iteration.</p>
<h3 id="arithmetic-expressions">Arithmetic Expressions</h3>
<ul>
<li><strong>Data</strong>: 100,000 randomly generated univariate expressions from a simple grammar (3 binary operators, 2 unary operators, 3 constants), each with at most 15 production rules</li>
<li><strong>Target</strong>: Find an expression minimizing $\log(1 + \text{MSE})$ against the true function $1/3 + x + \sin(x \cdot x)$</li>
<li><strong>BO iterations</strong>: 5, averaged over 10 repetitions</li>
</ul>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Fraction Valid</th>
          <th>Average Score</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>GVAE</td>
          <td>0.99 +/- 0.01</td>
          <td>3.47 +/- 0.24</td>
      </tr>
      <tr>
          <td>CVAE</td>
          <td>0.86 +/- 0.06</td>
          <td>4.75 +/- 0.25</td>
      </tr>
  </tbody>
</table>
<p>The GVAE&rsquo;s best expression ($x/1 + \sin(3) + \sin(x \cdot x)$, score 0.04) nearly exactly recovers the true function, while the CVAE&rsquo;s best ($x \cdot 1 + \sin(3) + \sin(3/1)$, score 0.39) misses the sinusoidal component.</p>
<h3 id="molecular-optimization">Molecular Optimization</h3>
<ul>
<li><strong>Data</strong>: 250,000 SMILES strings from the ZINC database</li>
<li><strong>Target</strong>: Maximize penalized logP (water-octanol partition coefficient penalized for ring size and synthetic accessibility)</li>
<li><strong>BO iterations</strong>: 10, averaged over 5 trials</li>
</ul>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>Fraction Valid</th>
          <th>Average Score</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>GVAE</td>
          <td>0.31 +/- 0.07</td>
          <td>-9.57 +/- 1.77</td>
      </tr>
      <tr>
          <td>CVAE</td>
          <td>0.17 +/- 0.05</td>
          <td>-54.66 +/- 2.66</td>
      </tr>
  </tbody>
</table>
<p>The GVAE produces roughly twice as many valid molecules as the CVAE and finds molecules with substantially better penalized logP scores (best: 2.94 vs. 1.98).</p>
<h3 id="latent-space-quality">Latent Space Quality</h3>
<p>Interpolation experiments show that the GVAE produces valid outputs at every intermediate point when linearly interpolating between two encoded expressions, while the CVAE passes through invalid strings. Grid searches around encoded molecules in the GVAE latent space show smooth transitions where neighboring points differ by single atoms.</p>
<h3 id="predictive-performance">Predictive Performance</h3>
<p>Sparse GP models trained on GVAE latent features achieve better test RMSE and log-likelihood than those trained on CVAE features for both expressions and molecules:</p>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>GVAE (Expressions)</th>
          <th>CVAE (Expressions)</th>
          <th>GVAE (Molecules)</th>
          <th>CVAE (Molecules)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Test LL</td>
          <td>-1.320 +/- 0.001</td>
          <td>-1.397 +/- 0.003</td>
          <td>-1.739 +/- 0.004</td>
          <td>-1.812 +/- 0.004</td>
      </tr>
      <tr>
          <td>Test RMSE</td>
          <td>0.884 +/- 0.002</td>
          <td>0.975 +/- 0.004</td>
          <td>1.404 +/- 0.006</td>
          <td>1.504 +/- 0.006</td>
      </tr>
  </tbody>
</table>
<h3 id="reconstruction-and-prior-sampling">Reconstruction and Prior Sampling</h3>
<p>On held-out molecules, the GVAE achieves 53.7% reconstruction accuracy vs. 44.6% for the CVAE. When sampling from the prior $p(\mathbf{z}) = \mathcal{N}(0, \mathbf{I})$, 7.2% of GVAE samples are valid molecules vs. 0.7% for the CVAE.</p>
<h2 id="key-findings-limitations-and-impact">Key Findings, Limitations, and Impact</h2>
<p><strong>Key findings.</strong> Incorporating grammar structure into the VAE decoder consistently improves validity rates, latent space smoothness, downstream predictive performance, and Bayesian optimization outcomes across both domains. The approach is general: any domain with a context-free grammar can benefit.</p>
<p><strong>Limitations acknowledged by the authors.</strong></p>
<ul>
<li>The GVAE guarantees syntactic but not semantic validity. For molecules, invalid ring-bond patterns and chemically implausible structures can still be generated.</li>
<li>The molecular validity rate during BO (31%) is substantially higher than the CVAE (17%) but still means most decoded molecules are invalid, largely due to non-context-free constraints in SMILES.</li>
<li>The approach requires a context-free grammar for the target domain, which limits applicability to well-defined formal languages.</li>
<li>Sequences that do not complete parsing within $T_{max}$ timesteps are discarded as invalid.</li>
</ul>
<p><strong>Impact.</strong> The GVAE was an influential early contribution to constrained molecular generation. It directly inspired the Syntax-Directed VAE (SD-VAE) by Dai et al. (2018), which uses attribute grammars for tighter semantic constraints, and contributed to the broader movement toward structured molecular generation methods including graph-based approaches. The paper demonstrated that encoding domain knowledge into the decoder architecture is more effective than relying on the model to learn structural constraints from data alone.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training (expressions)</td>
          <td>Generated arithmetic expressions</td>
          <td>100,000</td>
          <td>Up to 15 production rules each</td>
      </tr>
      <tr>
          <td>Training (molecules)</td>
          <td>ZINC database subset</td>
          <td>250,000 SMILES</td>
          <td>Same subset as <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/latent-space/automatic-chemical-design-vae/">Gomez-Bombarelli et al. (2016)</a></td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Encoder: 1D convolutional neural network over one-hot rule sequences</li>
<li>Decoder: RNN with stack-based grammar masking</li>
<li>Latent space: 56 dimensions (molecules), isotropic Gaussian prior</li>
<li>Property predictor: Sparse Gaussian process with 500 inducing points</li>
<li>Optimization: Batch Bayesian optimization with expected improvement, 50 candidates per iteration, Kriging Believer for batch selection</li>
</ul>
<h3 id="models">Models</h3>
<p>Architecture details follow <a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/latent-space/automatic-chemical-design-vae/">Gomez-Bombarelli et al. (2016)</a> with modifications for grammar-based encoding/decoding. Specific layer sizes and hyperparameters are described in the supplementary material.</p>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>GVAE</th>
          <th>CVAE</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Fraction valid (expressions)</td>
          <td>0.99</td>
          <td>0.86</td>
          <td>During BO</td>
      </tr>
      <tr>
          <td>Fraction valid (molecules)</td>
          <td>0.31</td>
          <td>0.17</td>
          <td>During BO</td>
      </tr>
      <tr>
          <td>Best penalized logP</td>
          <td>2.94</td>
          <td>1.98</td>
          <td>Best molecule found</td>
      </tr>
      <tr>
          <td>Reconstruction accuracy</td>
          <td>53.7%</td>
          <td>44.6%</td>
          <td>On held-out molecules</td>
      </tr>
      <tr>
          <td>Prior validity</td>
          <td>7.2%</td>
          <td>0.7%</td>
          <td>Sampling from N(0,I)</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Not specified in the paper.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/mkusner/grammarVAE">grammarVAE</a></td>
          <td>Code</td>
          <td>Not specified</td>
          <td>Official implementation</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Kusner, M. J., Paige, B., &amp; Hernández-Lobato, J. M. (2017). Grammar Variational Autoencoder. <em>Proceedings of the 34th International Conference on Machine Learning (ICML)</em>, 1945-1954.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{kusner2017grammar,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Grammar Variational Autoencoder}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Kusner, Matt J. and Paige, Brooks and Hern{\&#39;a}ndez-Lobato, Jos{\&#39;e} Miguel}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{Proceedings of the 34th International Conference on Machine Learning}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{1945--1954}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2017}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{PMLR}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>VAE for Automatic Chemical Design (2018 Seminal)</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/latent-space/automatic-chemical-design-vae/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/latent-space/automatic-chemical-design-vae/</guid><description>A variational autoencoder maps SMILES strings to a continuous latent space, enabling gradient-based optimization for molecular design and generation.</description><content:encoded><![CDATA[<h2 id="a-foundational-method-for-continuous-molecular-representation">A Foundational Method for Continuous Molecular Representation</h2>
<p>This is a <strong>Method</strong> paper that introduces a variational autoencoder (VAE) framework for mapping discrete molecular representations (<a href="/notes/computational-chemistry/molecular-representations/smiles/">SMILES</a> strings) into a continuous latent space. The primary contribution is demonstrating that this continuous representation enables three key capabilities: (1) automatic generation of novel molecules by decoding random or perturbed latent vectors, (2) smooth interpolation between molecules in latent space, and (3) gradient-based optimization of molecular properties using a jointly trained property predictor. This work is widely regarded as one of the earliest and most influential applications of deep generative models to molecular design.</p>
<h2 id="the-challenge-of-searching-discrete-chemical-space">The Challenge of Searching Discrete Chemical Space</h2>
<p>Molecular design is fundamentally an optimization problem: identify molecules that maximize some set of desirable properties. The search space is enormous (estimated $10^{23}$ to $10^{60}$ drug-like molecules) and discrete, making systematic exploration difficult. Prior approaches fell into two categories, each with significant limitations:</p>
<ol>
<li><strong>Virtual screening</strong> over fixed libraries: effective but monolithic, costly to enumerate, and requiring hand-crafted rules to avoid impractical chemistries.</li>
<li><strong>Discrete local search</strong> (e.g., genetic algorithms): requires manual specification of mutation and crossover heuristics, and cannot leverage gradient information to guide the search.</li>
</ol>
<p>The core insight is that mapping molecules into a continuous vector space sidesteps these problems entirely. In a continuous space, new compounds can be generated by vector perturbation (no hand-crafted mutation rules), optimization can follow property gradients (enabling larger and more directed jumps), and large unlabeled chemical databases can be leveraged through unsupervised representation learning.</p>
<h2 id="a-vae-architecture-for-smiles-strings-with-joint-property-prediction">A VAE Architecture for SMILES Strings with Joint Property Prediction</h2>
<p>The architecture consists of three coupled neural networks trained jointly:</p>
<ol>
<li>
<p><strong>Encoder</strong>: Converts SMILES character strings into fixed-dimensional continuous vectors (the latent representation). Uses three 1D convolutional layers followed by a fully connected layer. For ZINC molecules, the latent space has 196 dimensions; for QM9, 156 dimensions.</p>
</li>
<li>
<p><strong>Decoder</strong>: Converts latent vectors back into SMILES strings character by character using three layers of gated recurrent units (GRUs). The output is stochastic, as each character is sampled from a probability distribution over the SMILES alphabet.</p>
</li>
<li>
<p><strong>Property Predictor</strong>: A multilayer perceptron that predicts molecular properties directly from the latent representation. Joint training with the autoencoder reconstruction loss organizes the latent space so that molecules with similar properties cluster together.</p>
</li>
</ol>
<h3 id="the-vae-objective">The VAE Objective</h3>
<p>The model uses the <a href="/notes/machine-learning/generative-models/autoencoding-variational-bayes/">variational autoencoder framework of Kingma and Welling</a>. The training objective combines three terms:</p>
<p>$$\mathcal{L} = \mathcal{L}_{recon} + \beta \cdot D_{KL}(q(z|x) | p(z)) + \lambda \cdot \mathcal{L}_{prop}$$</p>
<p>where $\mathcal{L}_{recon}$ is the reconstruction loss (cross-entropy over SMILES characters), $D_{KL}$ is the KL divergence regularizer that encourages the latent distribution $q(z|x)$ to match a standard Gaussian prior $p(z)$, and $\mathcal{L}_{prop}$ is the property prediction regression loss. Both the variational loss and the property prediction loss are annealed in using a sigmoid schedule after 29 epochs over a total of 120 epochs of training.</p>
<p>The KL regularization is critical: it forces the decoder to handle a wider variety of latent points, preventing &ldquo;dead areas&rdquo; in latent space that would decode to invalid molecules.</p>
<h3 id="gradient-based-optimization">Gradient-Based Optimization</h3>
<p>After training, a Gaussian process (GP) surrogate model is fit on top of the latent representations to predict the target property. Optimization proceeds by:</p>
<ol>
<li>Encoding a seed molecule into the latent space</li>
<li>Using the GP model to define a smooth property surface over the latent space</li>
<li>Optimizing the latent vector $z$ to maximize the predicted property via gradient ascent</li>
<li>Decoding the optimized $z$ back into a SMILES string</li>
</ol>
<p>The objective used for demonstration is $5 \times \text{QED} - \text{SAS}$, balancing drug-likeness (QED) against synthetic accessibility (SAS).</p>
<h2 id="experiments-on-zinc-and-qm9-datasets">Experiments on ZINC and QM9 Datasets</h2>
<p>Two autoencoder systems were trained:</p>
<ul>
<li><strong>ZINC</strong>: 250,000 drug-like molecules from the ZINC database, with a 196-dimensional latent space. Properties predicted: logP, QED, SAS.</li>
<li><strong>QM9</strong>: 108,000 molecules with fewer than 9 heavy atoms, with a 156-dimensional latent space. Properties predicted: HOMO energy, LUMO energy, electronic spatial extent ($\langle R^2 \rangle$).</li>
</ul>
<h3 id="latent-space-quality">Latent Space Quality</h3>
<p>The encoded latent dimensions follow approximately normal distributions as enforced by the variational regularizer. Decoding is stochastic: sampling the same latent point multiple times yields different SMILES strings, with the most frequent decoding tending to be closest to the original point in latent space. Decoding validity rates are 73-79% for points near known molecules but only 4% for randomly selected latent points.</p>
<p>Spherical interpolation (slerp) between molecules in latent space produces smooth structural transitions, accounting for the geometry of high-dimensional Gaussian distributions where linear interpolation would pass through low-probability regions.</p>
<h3 id="molecular-generation-comparison">Molecular Generation Comparison</h3>
<table>
  <thead>
      <tr>
          <th>Source</th>
          <th>Dataset</th>
          <th>Samples</th>
          <th>logP</th>
          <th>SAS</th>
          <th>QED</th>
          <th>% in ZINC</th>
          <th>% in eMolecules</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Data</td>
          <td>ZINC</td>
          <td>249k</td>
          <td>2.46 (1.43)</td>
          <td>3.05 (0.83)</td>
          <td>0.73 (0.14)</td>
          <td>100</td>
          <td>12.9</td>
      </tr>
      <tr>
          <td>GA</td>
          <td>ZINC</td>
          <td>5303</td>
          <td>2.84 (1.86)</td>
          <td>3.80 (1.01)</td>
          <td>0.57 (0.20)</td>
          <td>6.5</td>
          <td>4.8</td>
      </tr>
      <tr>
          <td>VAE</td>
          <td>ZINC</td>
          <td>8728</td>
          <td>2.67 (1.46)</td>
          <td>3.18 (0.86)</td>
          <td>0.70 (0.14)</td>
          <td>5.8</td>
          <td>7.0</td>
      </tr>
      <tr>
          <td>Data</td>
          <td>QM9</td>
          <td>134k</td>
          <td>0.30 (1.00)</td>
          <td>4.25 (0.94)</td>
          <td>0.48 (0.07)</td>
          <td>0.0</td>
          <td>8.6</td>
      </tr>
      <tr>
          <td>GA</td>
          <td>QM9</td>
          <td>5470</td>
          <td>0.96 (1.53)</td>
          <td>4.47 (1.01)</td>
          <td>0.53 (0.13)</td>
          <td>0.018</td>
          <td>3.8</td>
      </tr>
      <tr>
          <td>VAE</td>
          <td>QM9</td>
          <td>2839</td>
          <td>0.30 (0.97)</td>
          <td>4.34 (0.98)</td>
          <td>0.47 (0.08)</td>
          <td>0.0</td>
          <td>8.9</td>
      </tr>
  </tbody>
</table>
<p>The VAE generates molecules whose property distributions closely match the training data, outperforming a genetic algorithm baseline that biases toward higher chemical complexity and decreased drug-likeness. Only 5.8% of VAE-generated ZINC molecules were found in the original ZINC database, indicating genuine novelty.</p>
<h3 id="property-prediction">Property Prediction</h3>
<table>
  <thead>
      <tr>
          <th>Dataset/Property</th>
          <th>Mean Baseline</th>
          <th>ECFP</th>
          <th>Graph Conv.</th>
          <th>1-hot SMILES</th>
          <th>Encoder Only</th>
          <th>VAE</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>ZINC/logP</td>
          <td>1.14</td>
          <td>0.38</td>
          <td>0.05</td>
          <td>0.16</td>
          <td>0.13</td>
          <td>0.15</td>
      </tr>
      <tr>
          <td>ZINC/QED</td>
          <td>0.112</td>
          <td>0.045</td>
          <td>0.017</td>
          <td>0.041</td>
          <td>0.037</td>
          <td>0.054</td>
      </tr>
      <tr>
          <td>QM9/HOMO (eV)</td>
          <td>0.44</td>
          <td>0.20</td>
          <td>0.12</td>
          <td>0.12</td>
          <td>0.13</td>
          <td>0.16</td>
      </tr>
      <tr>
          <td>QM9/LUMO (eV)</td>
          <td>1.05</td>
          <td>0.20</td>
          <td>0.15</td>
          <td>0.11</td>
          <td>0.14</td>
          <td>0.16</td>
      </tr>
      <tr>
          <td>QM9/Gap (eV)</td>
          <td>1.07</td>
          <td>0.30</td>
          <td>0.18</td>
          <td>0.16</td>
          <td>0.18</td>
          <td>0.21</td>
      </tr>
  </tbody>
</table>
<p>The VAE latent representation achieves property prediction accuracy comparable to graph convolutions for some properties, though graph convolutions generally perform best. The primary purpose of joint training is not to maximize prediction accuracy but to organize the latent space for optimization.</p>
<h3 id="optimization-results">Optimization Results</h3>
<p>Bayesian optimization with a GP model on the jointly trained latent space consistently produces molecules with higher percentile scores on the $5 \times \text{QED} - \text{SAS}$ objective compared to both random Gaussian search and genetic algorithm baselines. Starting from molecules in the bottom 10th percentile of the ZINC dataset, the optimizer reliably discovers molecules in regions of high objective value. Training the GP with 1000 molecules (vs. 2000) produces a wider diversity of solutions by optimizing to multiple local optima rather than a single global optimum.</p>
<h2 id="key-findings-limitations-and-legacy">Key Findings, Limitations, and Legacy</h2>
<h3 id="key-findings">Key Findings</h3>
<ul>
<li>A continuous latent representation of molecules enables gradient-based search through chemical space, a qualitatively different approach from discrete enumeration or genetic algorithms.</li>
<li>Joint training with property prediction organizes the latent space by property values, creating smooth gradients that optimization can follow.</li>
<li>The VAE generates novel molecules with realistic property distributions, and the latent space encodes an estimated 7.5 million molecules despite training on only 250,000.</li>
</ul>
<h3 id="acknowledged-limitations">Acknowledged Limitations</h3>
<ul>
<li>The SMILES-based decoder sometimes produces formally valid but chemically undesirable molecules (acid chlorides, anhydrides, cyclopentadienes, aziridines, etc.) because the grammar of valid SMILES does not capture all synthetic or stability constraints.</li>
<li>Character-level SMILES generation is fragile: the decoder must implicitly learn which strings are valid SMILES, making the learning problem harder than necessary.</li>
<li>Decoding validity drops to only 4% for random latent points far from training data, limiting the ability to explore truly novel regions of chemical space.</li>
</ul>
<h3 id="directions-identified">Directions Identified</h3>
<p>The authors point to several extensions that were already underway at the time of publication:</p>
<ul>
<li><strong><a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/latent-space/grammar-variational-autoencoder/">Grammar VAE</a></strong>: Using an explicitly defined SMILES grammar instead of forcing the model to learn one (Kusner et al., 2017).</li>
<li><strong>Graph-based decoders</strong>: Directly outputting molecular graphs to avoid the SMILES validity problem.</li>
<li><strong>Adversarial training</strong>: Using GANs for molecular generation (<a href="/notes/computational-chemistry/chemical-language-models/molecular-generation/rl-tuned/organ-objective-reinforced-gan/">ORGAN, ORGANIC</a>).</li>
<li><strong>LSTM/RNN generators</strong>: Applying recurrent networks directly to SMILES for generation and reaction prediction.</li>
</ul>
<p>This paper has been cited over 2,900 times and launched a large body of follow-up work in VAE-based, GAN-based, and reinforcement learning-based molecular generation.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training</td>
          <td>ZINC (drug-like subset)</td>
          <td>250,000 molecules</td>
          <td>Randomly sampled from ZINC database</td>
      </tr>
      <tr>
          <td>Training</td>
          <td>QM9</td>
          <td>108,000 molecules</td>
          <td>Molecules with fewer than 9 heavy atoms</td>
      </tr>
      <tr>
          <td>Evaluation</td>
          <td>ZINC held-out set</td>
          <td>5,000 molecules</td>
          <td>For latent space analysis</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Encoder</strong>: 3 x 1D convolutional layers (ZINC: filters 9,9,10 with kernels 9,9,11; QM9: filters 2,2,1 with kernels 5,5,4), followed by a fully connected layer</li>
<li><strong>Decoder</strong>: 3 x GRU layers (ZINC: hidden dim 488; QM9: hidden dim 500), trained with teacher forcing</li>
<li><strong>Property Predictor</strong>: 2 fully connected layers of 1000 neurons (dropout 0.20) for prediction; smaller 3-layer MLP of 67 neurons (dropout 0.15) for latent space shaping</li>
<li><strong>Variational loss annealing</strong>: Sigmoid schedule after 29 epochs, total 120 epochs</li>
<li><strong>SMILES validation</strong>: Post-hoc filtering with RDKit; invalid outputs discarded</li>
<li><strong>Optimization</strong>: Gaussian process surrogate model trained on 2000 maximally diverse molecules from latent space</li>
</ul>
<h3 id="models">Models</h3>
<p>Built with Keras and TensorFlow. Latent dimensions: 196 (ZINC), 156 (QM9). SMILES alphabet: 35 characters (ZINC), 22 characters (QM9). Maximum string length: 120 (ZINC), 34 (QM9). Only canonicalized SMILES used for training.</p>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Description</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>logP</td>
          <td>Water-octanol partition coefficient</td>
      </tr>
      <tr>
          <td>QED</td>
          <td>Quantitative Estimation of Drug-likeness (0-1)</td>
      </tr>
      <tr>
          <td>SAS</td>
          <td>Synthetic Accessibility Score</td>
      </tr>
      <tr>
          <td>HOMO/LUMO (eV)</td>
          <td>Frontier orbital energies (QM9)</td>
      </tr>
      <tr>
          <td>Decoding validity</td>
          <td>Fraction of latent points producing valid SMILES</td>
      </tr>
      <tr>
          <td>Novelty</td>
          <td>Fraction of generated molecules not in training set</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Training was performed on the Harvard FAS Odyssey Cluster. Specific GPU types and training times are not reported. The Gaussian process optimization requires only minutes to train on a few thousand molecules.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/aspuru-guzik-group/chemical_vae">chemical_vae</a></td>
          <td>Code</td>
          <td>Apache-2.0</td>
          <td>Official implementation with training scripts and pre-trained models</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Gómez-Bombarelli, R., Wei, J. N., Duvenaud, D., Hernández-Lobato, J. M., Sánchez-Lengeling, B., Sheberla, D., Aguilera-Iparraguirre, J., Hirzel, T. D., Adams, R. P., &amp; Aspuru-Guzik, A. (2018). Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules. <em>ACS Central Science</em>, 4(2), 268-276. <a href="https://doi.org/10.1021/acscentsci.7b00572">https://doi.org/10.1021/acscentsci.7b00572</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{gomez2018automatic,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{G{\&#39;o}mez-Bombarelli, Rafael and Wei, Jennifer N. and Duvenaud, David and Hern{\&#39;a}ndez-Lobato, Jos{\&#39;e} Miguel and S{\&#39;a}nchez-Lengeling, Benjam{\&#39;i}n and Sheberla, Dennis and Aguilera-Iparraguirre, Jorge and Hirzel, Timothy D. and Adams, Ryan P. and Aspuru-Guzik, Al{\&#39;a}n}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{ACS Central Science}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{4}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{2}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{268--276}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2018}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{American Chemical Society}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1021/acscentsci.7b00572}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>PASITHEA: Gradient-Based Molecular Design via Dreaming</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/latent-space/deep-molecular-dreaming-pasithea/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/latent-space/deep-molecular-dreaming-pasithea/</guid><description>PASITHEA applies inceptionism to molecular design, using gradient-based optimization on SELFIES representations to generate molecules with target properties.</description><content:encoded><![CDATA[<h2 id="inceptionism-applied-to-molecular-inverse-design">Inceptionism Applied to Molecular Inverse Design</h2>
<p>This is a <strong>Method</strong> paper that introduces PASITHEA, a gradient-based approach to de-novo molecular design inspired by inceptionism (deep dreaming) techniques from computer vision. The core contribution is a direct optimization framework that modifies molecular structures by backpropagating through a trained property-prediction network, with the molecular input (rather than weights) serving as the optimizable variable. PASITHEA is enabled by SELFIES, a surjective molecular string representation that guarantees 100% validity of generated molecules.</p>
<h2 id="the-need-for-direct-gradient-based-molecular-optimization">The Need for Direct Gradient-Based Molecular Optimization</h2>
<p>Existing inverse molecular design methods, including variational autoencoders (VAEs), generative adversarial networks (GANs), reinforcement learning (RL), and genetic algorithms (GAs), share a common characteristic: they optimize molecules indirectly. VAEs and GANs learn distributions and scan latent spaces. RL agents learn policies from environmental rewards. GAs iteratively apply mutations and selections. None of these approaches directly maximize an objective function in a gradient-based manner with respect to the molecular representation itself.</p>
<p>This indirection has several consequences. VAE-based methods require learning a latent space, and the optimization happens in that space rather than directly on molecular structures. RL and GA methods require expensive function evaluations for each candidate molecule. The authors identify an opportunity to exploit gradients more directly by reversing the learning process of a neural network trained to predict molecular properties, thereby sidestepping latent spaces, policies, and population-based search entirely.</p>
<p>A second motivation is interpretability. By operating directly on the molecular representation (rather than a learned latent space), PASITHEA can reveal what a regression network has learned about structure-property relationships, a capability the authors frame as analogous to how deep dreaming reveals what image classifiers have learned about visual features.</p>
<h2 id="core-innovation-inverting-regression-networks-on-selfies">Core Innovation: Inverting Regression Networks on SELFIES</h2>
<p>PASITHEA&rsquo;s key insight is a two-phase training procedure that repurposes the standard neural network training loop for molecule generation.</p>
<p><strong>Phase 1: Prediction training.</strong> A fully connected neural network is trained to predict a real-valued chemical property (logP) from one-hot encoded SELFIES strings. The standard feedforward and backpropagation process updates the network weights to minimize mean squared error between predicted and ground-truth property values:</p>
<p>$$
\min_{\theta} \frac{1}{N} \sum_{i=1}^{N} (f_{\theta}(\mathbf{x}_i) - y_i)^2
$$</p>
<p>where $f_{\theta}$ is the neural network with parameters $\theta$, $\mathbf{x}_i$ is the one-hot encoded SELFIES input, and $y_i$ is the target logP value.</p>
<p><strong>Phase 2: Inverse training (deep dreaming).</strong> The network weights $\theta$ are frozen. For a given input molecule $\mathbf{x}$ and a desired target property value $y_{\text{target}}$, the gradients are computed with respect to the input representation rather than the weights:</p>
<p>$$
\mathbf{x} \leftarrow \mathbf{x} - \eta \nabla_{\mathbf{x}} \mathcal{L}(f_{\theta}(\mathbf{x}), y_{\text{target}})
$$</p>
<p>This gradient descent on the input incrementally modifies the one-hot encoding of the molecular string, transforming it toward a structure whose predicted property matches the target value. At each step, the argmax function converts the continuous one-hot encoding back to a discrete SELFIES string, which always maps to a valid molecular graph due to the surjective property of SELFIES.</p>
<p><strong>The role of SELFIES.</strong> The surjective mapping from strings to molecular graphs is essential. With SMILES, intermediate strings during optimization can become syntactically invalid (e.g., an unclosed ring like &ldquo;CCCC1CCCCC&rdquo;), producing no valid molecule. SELFIES enforces constraints that guarantee every string maps to a valid molecular graph, making the continuous gradient-based optimization feasible.</p>
<p><strong>Input noise injection.</strong> Because inverse training transforms a one-hot encoding from binary values to real numbers, the discrete-to-continuous transition can cause convergence problems. The authors address this by initializing the input with noise: every zero in the one-hot encoding is replaced by a random number in $[0, k]$, where $k$ is a hyperparameter between 0.5 and 0.95. This smooths the optimization landscape and enables incremental molecular modifications rather than abrupt changes.</p>
<h2 id="experimental-setup-on-qm9-with-logp-optimization">Experimental Setup on QM9 with LogP Optimization</h2>
<h3 id="dataset-and-property">Dataset and Property</h3>
<p>The experiments use a random subset of 10,000 molecules from the QM9 dataset. The target property is the logarithm of the partition coefficient (logP), computed using RDKit. LogP measures lipophilicity, an important drug-likeness indicator that follows an approximately normal distribution in QM9 and has a nearly continuous range, making it suitable for gradient-based optimization.</p>
<h3 id="network-architecture">Network Architecture</h3>
<p>PASITHEA uses a fully connected neural network with four layers, each containing 500 nodes with ReLU activation. The loss function is mean squared error. Data is split 85%/15% for training/testing. The prediction model trains for approximately 1,500 epochs with an Adam optimizer and a learning rate of $1 \times 10^{-6}$.</p>
<p>For inverse training, the authors select a noise upper-bound of 0.9 and a learning rate of 0.01, chosen from hyperparameter tuning experiments that evaluate the percentage of molecules optimized toward the target property.</p>
<h3 id="optimization-targets">Optimization Targets</h3>
<p>Two extreme logP targets are used: $+6$ (high lipophilicity) and $-6$ (low lipophilicity). These values exceed the range of logP values in the QM9 dataset (minimum: $-2.19$, maximum: $3.08$), testing whether the model can extrapolate beyond the training distribution.</p>
<h2 id="distribution-shifts-and-interpretable-molecular-transformations">Distribution Shifts and Interpretable Molecular Transformations</h2>
<h3 id="distribution-level-results">Distribution-Level Results</h3>
<p>Applying deep dreaming to the full set of 10,000 molecules produces a clear shift in the logP distribution:</p>
<table>
  <thead>
      <tr>
          <th>Statistic</th>
          <th>QM9 Original</th>
          <th>Optimized (target +6)</th>
          <th>Optimized (target -6)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Mean logP</td>
          <td>0.3909</td>
          <td>1.8172</td>
          <td>-0.3360</td>
      </tr>
      <tr>
          <td>Min logP</td>
          <td>-2.1903</td>
          <td>-0.8240</td>
          <td>-2.452</td>
      </tr>
      <tr>
          <td>Max logP</td>
          <td>3.0786</td>
          <td>4.2442</td>
          <td>0.9018</td>
      </tr>
  </tbody>
</table>
<p>The optimized distributions extend beyond the original dataset&rsquo;s property range. The right-shifted distribution (target +6) produces molecules with logP values up to 4.24, exceeding the original maximum of 3.08. The left-shifted distribution (target -6) reaches -2.45, below the original minimum. This indicates that PASITHEA can generate molecules with properties outside the training data bounds.</p>
<p>Additionally, 97.2% of the generated molecules do not exist in the original training set, indicating that the network is not memorizing data but rather using structural features to guide optimization. Some generated molecules contain more heavy atoms than the QM9 maximum of 9, since the SELFIES string length allows for larger structures.</p>
<h3 id="molecule-level-interpretability">Molecule-Level Interpretability</h3>
<p>The stepwise molecular transformations reveal interpretable &ldquo;strategies&rdquo; the network employs:</p>
<ol>
<li>
<p><strong>Nitrogen appendage</strong>: When optimizing for lower logP, the network repeatedly appends nitrogen atoms to the molecule. The authors observe this as a consistent pattern across multiple test molecules, reflecting the known relationship between nitrogen content and reduced lipophilicity.</p>
</li>
<li>
<p><strong>Length modulation</strong>: When optimizing for higher logP, the network tends to increase molecular chain length (e.g., extending a carbon chain). When optimizing for lower logP, it shortens chains. This captures the intuition that larger, more carbon-heavy molecules tend to be more lipophilic.</p>
</li>
<li>
<p><strong>Bond order changes</strong>: The network replaces single bonds with double or triple bonds during optimization, demonstrating an understanding of the relationship between bonding patterns and logP.</p>
</li>
<li>
<p><strong>Consistency across trials</strong>: Because the input initialization includes random noise, repeated trials with the same molecule produce different transformation sequences. Despite this stochasticity, the network applies consistent strategies across trials (e.g., always shortening chains for negative optimization), validating that it has learned genuine structure-property relationships.</p>
</li>
</ol>
<h3 id="thermodynamic-stability">Thermodynamic Stability</h3>
<p>The authors assess synthesizability by computing heats of formation using MOPAC2016 at the PM7 level of theory. Some optimization trajectories move toward thermodynamically stable molecules (negative heats of formation), while others produce less stable structures. The authors acknowledge this limitation and propose multi-objective optimization incorporating stability as a future direction.</p>
<h3 id="comparison-to-vaes">Comparison to VAEs</h3>
<p>The key distinction from VAEs is where gradient computation occurs. In VAEs, a latent space is learned through encoding and decoding, and property optimization happens in that latent space. In PASITHEA, gradients are computed directly with respect to the molecular representation (SELFIES one-hot encoding). The authors argue this makes the approach more interpretable, since we can probe what the network learned about molecular structure without the &ldquo;detour&rdquo; through a latent space.</p>
<h3 id="limitations">Limitations</h3>
<p>The authors are forthright about the preliminary nature of these results:</p>
<ul>
<li>The method is demonstrated only on a small subset of QM9 with a single, computationally inexpensive property (logP).</li>
<li>The simple four-layer architecture may not scale to larger molecular spaces or more complex properties.</li>
<li>Generated molecules are not always thermodynamically stable, requiring additional optimization objectives.</li>
<li>The approach has not been benchmarked against established methods (VAEs, GANs, RL) on standard generative benchmarks.</li>
</ul>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training/Evaluation</td>
          <td>QM9 (random subset)</td>
          <td>10,000 molecules</td>
          <td>logP values computed via RDKit</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li><strong>Prediction training</strong>: 4-layer fully connected NN, 500 nodes/layer, ReLU activation, MSE loss, Adam optimizer, LR $1 \times 10^{-6}$, ~1,500 epochs, 85/15 train/test split</li>
<li><strong>Inverse training</strong>: Frozen weights, Adam optimizer, LR 0.01, noise upper-bound 0.9, logP targets of +6 and -6</li>
<li><strong>Heats of formation</strong>: MOPAC2016, PM7 level, geometry optimization with eigenvector following (EF)</li>
</ul>
<h3 id="models">Models</h3>
<p>The architecture is a simple 4-layer MLP. No pre-trained weights are distributed, but the full code is available.</p>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Value</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Novel molecules</td>
          <td>97.2%</td>
          <td>Generated molecules not in training set</td>
      </tr>
      <tr>
          <td>Max logP (target +6)</td>
          <td>4.2442</td>
          <td>Exceeds QM9 max of 3.0786</td>
      </tr>
      <tr>
          <td>Min logP (target -6)</td>
          <td>-2.452</td>
          <td>Below QM9 min of -2.1903</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Not specified in the paper.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/aspuru-guzik-group/Pasithea">Pasithea</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official implementation</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Shen, C., Krenn, M., Eppel, S., &amp; Aspuru-Guzik, A. (2021). Deep molecular dreaming: inverse machine learning for de-novo molecular design and interpretability with surjective representations. <em>Machine Learning: Science and Technology</em>, 2(3), 03LT02. <a href="https://doi.org/10.1088/2632-2153/ac09d6">https://doi.org/10.1088/2632-2153/ac09d6</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@article</span>{shen2021deep,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Deep molecular dreaming: inverse machine learning for de-novo molecular design and interpretability with surjective representations}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Shen, Cynthia and Krenn, Mario and Eppel, Sagi and Aspuru-Guzik, Al{\&#39;a}n}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">journal</span>=<span style="color:#e6db74">{Machine Learning: Science and Technology}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{2}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">number</span>=<span style="color:#e6db74">{3}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{03LT02}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2021}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">publisher</span>=<span style="color:#e6db74">{IOP Publishing}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1088/2632-2153/ac09d6}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>CogMol: Controlled Molecule Generation for COVID-19</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/latent-space/cogmol-target-specific-drug-design/</link><pubDate>Thu, 26 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/latent-space/cogmol-target-specific-drug-design/</guid><description>CogMol combines a SMILES VAE with controlled latent space sampling to generate drug-like molecules with target specificity for novel viral proteins.</description><content:encoded><![CDATA[<h2 id="a-controlled-generation-framework-for-target-specific-drug-design">A Controlled Generation Framework for Target-Specific Drug Design</h2>
<p>This is a <strong>Method</strong> paper that introduces CogMol (Controlled Generation of Molecules), an end-to-end framework for de novo drug design. The primary contribution is a pipeline that combines a SMILES-based Variational Autoencoder (VAE) with multi-attribute controlled latent space sampling (CLaSS) to generate novel drug-like molecules with high binding affinity to specified protein targets, off-target selectivity, and favorable drug-likeness properties. The framework operates on protein sequence embeddings, allowing it to generalize to unseen target proteins without model retraining.</p>
<h2 id="multi-constraint-drug-design-for-novel-viral-targets">Multi-Constraint Drug Design for Novel Viral Targets</h2>
<p>Traditional drug discovery costs 2-3 billion USD and takes over a decade with less than 10% success rate. Generating drug molecules requires satisfying multiple competing objectives simultaneously: target binding affinity, off-target selectivity, synthetic accessibility, drug-likeness, and low toxicity. Prior generative approaches using reinforcement learning or Bayesian optimization are computationally expensive and typically require fine-tuning on target-specific ligand libraries, making them unable to generalize to unseen protein targets.</p>
<p>The emergence of SARS-CoV-2 in 2020 created an urgent need for antiviral drug candidates targeting novel viral proteins. Because no binding affinity data existed for these new targets, and the viral proteins were not closely related to proteins in existing databases like BindingDB, existing target-specific generative frameworks could not be directly applied. CogMol addresses this by using pre-trained protein sequence embeddings from UniRep (trained on 24 million UniRef50 sequences) rather than learning protein representations from the limited BindingDB training set.</p>
<h2 id="controlled-latent-space-sampling-with-pre-trained-protein-embeddings">Controlled Latent Space Sampling with Pre-trained Protein Embeddings</h2>
<p>CogMol&rsquo;s core innovation is a three-component architecture that enables multi-constraint molecule generation for unseen targets:</p>
<p><strong>1. SMILES VAE with adaptive pre-training.</strong> A Variational Autoencoder is first trained unsupervised on the MOSES/ZINC dataset (1.6M molecules), then jointly fine-tuned with QED and SA property predictors on BindingDB molecules. The standard VAE objective is:</p>
<p>$$\mathcal{L}_{\text{VAE}}(\theta, \phi) = \mathbb{E}_{p(x)} \left\{ \mathbb{E}_{q_\phi(z|x)} [\log p_\theta(x|z)] - D_{\text{KL}}(q_\phi(z|x) | p(z)) \right\}$$</p>
<p>where $q_\phi(z|x) = \mathcal{N}(z; \mu(x), \Sigma(x))$ specifies a diagonal Gaussian encoder distribution.</p>
<p><strong>2. Protein-molecule binding affinity predictor.</strong> A regression model takes pre-trained UniRep protein sequence embeddings and molecule latent embeddings $z$ as input and predicts pIC50 binding affinity ($= -\log(\text{IC50})$). Because UniRep embeddings capture sequence, structural, and functional relationships from a large unsupervised corpus, the predictor can estimate binding affinity for novel target sequences not present in the training data.</p>
<p><strong>3. CLaSS controlled sampling.</strong> The Conditional Latent attribute Space Sampling scheme generates molecules satisfying multiple constraints (affinity, QED, selectivity) through rejection sampling in the VAE latent space:</p>
<p>$$p(\mathbf{x} | \mathbf{a}) = \mathbb{E}_{\mathbf{z}} [p(\mathbf{z} | \mathbf{a}) , p(\mathbf{x} | \mathbf{z})] \approx \mathbb{E}_{\mathbf{z}} [\hat{p}_\xi(\mathbf{z} | \mathbf{a}) , p_\theta(\mathbf{x} | \mathbf{z})]$$</p>
<p>where $\mathbf{a} = [a_1, a_2, \ldots, a_n]$ is a set of independent attribute constraints. The conditional density $\hat{p}_\xi(\mathbf{z} | \mathbf{a})$ is approximated using a Gaussian mixture model $Q_\xi(\mathbf{z})$ and per-attribute classifiers $q_\xi(a_i | \mathbf{z})$, with Bayes&rsquo; rule and conditional independence assumptions. The acceptance probability equals the product of all attribute predictor scores, enabling efficient multi-constraint sampling without surrogate model or policy learning.</p>
<p><strong>Selectivity modeling.</strong> Off-target selectivity for a molecule $m$ against target $T$ is defined as:</p>
<p>$$\text{Sel}_{T,m} = \text{BA}(T, m) - \frac{1}{k} \sum_{i=1}^{k} \text{BA}(T_i, m)$$</p>
<p>where $\text{BA}(T, m)$ is binding affinity to the target and $T_i$ are $k$ randomly selected off-targets. This selectivity score is incorporated as a control attribute during CLaSS sampling.</p>
<h2 id="experimental-setup-covid-19-targets-and-in-silico-screening">Experimental Setup: COVID-19 Targets and In Silico Screening</h2>
<p><strong>Target proteins.</strong> CogMol was applied to three SARS-CoV-2 targets not present in BindingDB: NSP9 Replicase dimer, Main Protease (Mpro), and the Receptor-Binding Domain (RBD) of the spike protein. A cancer target (human HDAC1) with low ligand coverage in the training data was also evaluated.</p>
<p><strong>Training data.</strong> The SMILES VAE was trained on the MOSES benchmark (1.6M molecules from ZINC). The binding affinity predictor used curated IC50 data from BindingDB as reported in DeepAffinity, with all protein classes included in training.</p>
<p><strong>CLaSS controlled generation.</strong> Molecules were generated with simultaneous constraints on binding affinity (&gt; 0.5 normalized), QED (&gt; 0.8 normalized), and selectivity (&gt; 0.5 normalized). Approximately 1000 molecules per target were selected for downstream evaluation.</p>
<p><strong>In silico screening pipeline.</strong> Generated molecules underwent:</p>
<ul>
<li>Toxicity prediction via a multi-task deep neural network (MT-DNN) on 12 Tox21 in vitro endpoints and ClinTox clinical trial failure</li>
<li>Binding affinity rescoring with a higher-accuracy SMILES-level predictor</li>
<li>Blind docking (5 independent runs per molecule) using AutoDock Vina against target protein structures</li>
<li>Synthetic feasibility assessment using a retrosynthetic algorithm based on the Molecular Transformer trained on patent reaction data</li>
</ul>
<p><strong>Baselines.</strong> VAE performance was benchmarked against models from the MOSES platform. CLaSS-accepted molecules were compared against randomly sampled molecules from the latent space. Generated molecules were compared against FDA-approved drugs for toxicity and synthesizability.</p>
<h3 id="key-results">Key Results</h3>
<p><strong>CLaSS enrichment (Table 1).</strong> CLaSS consistently produced higher fractions of molecules meeting all criteria compared to random sampling. For the triple constraint (affinity &gt; 0.5, QED &gt; 0.8, selectivity &gt; 0.5), the enrichment was substantial: 6.9% vs. 0.7% for NSP9, 9.0% vs. 0.9% for RBD, and 10.4% vs. 1.1% for Mpro.</p>
<table>
  <thead>
      <tr>
          <th>Target</th>
          <th>CLaSS (Aff+QED+Sel)</th>
          <th>Random (Aff+QED+Sel)</th>
          <th>Enrichment</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>NSP9</td>
          <td>6.9%</td>
          <td>0.7%</td>
          <td>~10x</td>
      </tr>
      <tr>
          <td>RBD</td>
          <td>9.0%</td>
          <td>0.9%</td>
          <td>~10x</td>
      </tr>
      <tr>
          <td>Mpro</td>
          <td>10.4%</td>
          <td>1.1%</td>
          <td>~9.5x</td>
      </tr>
  </tbody>
</table>
<p><strong>Docking results (Table 3).</strong> 87-95% of high-affinity generated molecules showed docking binding free energy (BFE) below -6 kcal/mol, with minimum BFEs reaching -8.6 to -9.5 kcal/mol depending on the target.</p>
<p><strong>Novelty.</strong> The likelihood of generating an exact duplicate of a training molecule was 2% or less. Against the full PubChem database (~103M molecules), exact matches ranged from 3.7% to 9.5%. Generated molecules also showed novel chemical scaffolds as confirmed by high Frechet ChemNet Distance.</p>
<p><strong>Synthesizability.</strong> Generated molecules for COVID-19 targets showed 85-90% synthetic feasibility using retrosynthetic analysis, exceeding the ~78% rate of FDA-approved drugs.</p>
<p><strong>Toxicity.</strong> Approximately 70% of generated parent molecules and ~80% of predicted metabolites were toxic in 0-1 endpoints out of 13, comparable to FDA-approved drugs.</p>
<h2 id="generated-molecules-show-favorable-binding-and-drug-like-properties">Generated Molecules Show Favorable Binding and Drug-Like Properties</h2>
<p>CogMol demonstrates that controlled latent space sampling with pre-trained protein embeddings can generate novel, drug-like molecules for unseen viral targets. The key findings are:</p>
<ol>
<li>CLaSS provides roughly 10x enrichment over random latent space sampling for molecules satisfying all three constraints (affinity, QED, selectivity).</li>
<li>Generated molecules bind favorably to druggable pockets in target protein 3D structures, even though the generation model uses only 1D sequence information.</li>
<li>Some generated SMILES matched existing PubChem molecules with known biological activity, suggesting the model identifies chemically relevant regions of molecular space.</li>
<li>The framework generalizes across targets of varying novelty, with Mpro (more similar to training proteins) yielding easier generation than NSP9 or RBD.</li>
</ol>
<p><strong>Limitations.</strong> The authors note that no wet-lab validation was performed on generated candidates. There may be divergence between ML-predicted properties and experimental measurements. The binding affinity predictor&rsquo;s accuracy is bounded by the quality and coverage of BindingDB training data. Selectivity modeling uses a random sample of off-targets rather than a pharmacologically curated panel.</p>
<p><strong>Future directions.</strong> The authors propose incorporating additional contexts beyond target protein (e.g., metabolic pathways), adding more pharmacologically relevant controls, and weighting objectives by relative importance.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>VAE pre-training</td>
          <td>MOSES/ZINC</td>
          <td>1.6M train, 176K test</td>
          <td>Publicly available benchmark</td>
      </tr>
      <tr>
          <td>VAE adaptive training</td>
          <td>BindingDB (DeepAffinity split)</td>
          <td>~27K protein-ligand pairs</td>
          <td>Curated IC50 data</td>
      </tr>
      <tr>
          <td>Protein embeddings</td>
          <td>UniRef50 via UniRep</td>
          <td>24M sequences</td>
          <td>Pre-trained, publicly available</td>
      </tr>
      <tr>
          <td>Toxicity prediction</td>
          <td>Tox21 + ClinTox</td>
          <td>12 in vitro + clinical endpoints</td>
          <td>Public benchmark datasets</td>
      </tr>
      <tr>
          <td>Docking validation</td>
          <td>PDB structures</td>
          <td>3 SARS-CoV-2 targets</td>
          <td>Public crystal structures</td>
      </tr>
  </tbody>
</table>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>VAE architecture: SMILES encoder-decoder with diagonal Gaussian latent space, jointly trained with QED and SA regressors</li>
<li>CLaSS: rejection sampling from Gaussian mixture model of latent space with per-attribute classifiers</li>
<li>Binding affinity: regression on concatenated UniRep protein embeddings and VAE molecule embeddings</li>
<li>Selectivity: excess binding affinity over average of $k$ random off-targets</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>SMILES VAE with adaptive pre-training (ZINC then BindingDB)</li>
<li>Multi-task toxicity classifier (MT-DNN) for Tox21 and ClinTox endpoints</li>
<li>Binding affinity predictor (latent-level for generation, SMILES-level for screening)</li>
<li>Retrosynthetic predictor based on Molecular Transformer</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Value</th>
          <th>Baseline</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Validity</td>
          <td>90%</td>
          <td>-</td>
          <td>Generated SMILES</td>
      </tr>
      <tr>
          <td>Uniqueness</td>
          <td>99%</td>
          <td>-</td>
          <td>Among valid molecules</td>
      </tr>
      <tr>
          <td>Filter pass</td>
          <td>95%</td>
          <td>-</td>
          <td>Relevant chemical filters</td>
      </tr>
      <tr>
          <td>Docking BFE &lt; -6 kcal/mol</td>
          <td>87-95%</td>
          <td>-</td>
          <td>Varies by target</td>
      </tr>
      <tr>
          <td>Synthetic feasibility</td>
          <td>85-90%</td>
          <td>78% (FDA drugs)</td>
          <td>COVID-19 targets</td>
      </tr>
      <tr>
          <td>Low toxicity (0-1 endpoints)</td>
          <td>~70% parent, ~80% metabolite</td>
          <td>Comparable to FDA drugs</td>
          <td>MT-DNN prediction</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>The paper does not specify GPU types or training times. The work was funded internally by IBM Research.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/IBM/CogMol">CogMol (GitHub)</a></td>
          <td>Code</td>
          <td>Apache-2.0</td>
          <td>Official implementation</td>
      </tr>
      <tr>
          <td><a href="https://github.com/IBM/CogMol">~3500 generated molecules</a></td>
          <td>Dataset</td>
          <td>Open license</td>
          <td>For three SARS-CoV-2 targets</td>
      </tr>
  </tbody>
</table>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Chenthamarakshan, V., Das, P., Hoffman, S. C., Strobelt, H., Padhi, I., Lim, K. W., Hoover, B., Manica, M., Born, J., Laino, T., &amp; Mojsilovic, A. (2020). CogMol: Target-Specific and Selective Drug Design for COVID-19 Using Deep Generative Models. <em>Advances in Neural Information Processing Systems</em>, 33, 4320-4332.</p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{chenthamarakshan2020cogmol,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{CogMol: Target-Specific and Selective Drug Design for COVID-19 Using Deep Generative Models}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Chenthamarakshan, Vijil and Das, Payel and Hoffman, Samuel C. and Strobelt, Hendrik and Padhi, Inkit and Lim, Kar Wai and Hoover, Benjamin and Manica, Matteo and Born, Jannis and Laino, Teodoro and Mojsilovi{\&#39;c}, Aleksandra}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{Advances in Neural Information Processing Systems}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">volume</span>=<span style="color:#e6db74">{33}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{4320--4332}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2020}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>LIMO: Latent Inceptionism for Targeted Molecule Generation</title><link>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/latent-space/limo-latent-inceptionism/</link><pubDate>Sun, 22 Mar 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/computational-chemistry/chemical-language-models/molecular-generation/latent-space/limo-latent-inceptionism/</guid><description>LIMO uses gradient-based optimization through a VAE latent space and stacked property predictor to generate drug-like molecules with high binding affinity.</description><content:encoded><![CDATA[<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Eckmann, P., Sun, K., Zhao, B., Feng, M., Gilson, M. K., &amp; Yu, R. (2022). LIMO: Latent Inceptionism for Targeted Molecule Generation. <em>Proceedings of the 39th International Conference on Machine Learning (ICML 2022)</em>, PMLR 162, 5777&ndash;5792.</p>
<p><strong>Publication</strong>: ICML 2022</p>
<p><strong>Additional Resources</strong>:</p>
<ul>
<li><a href="https://github.com/Rose-STL-Lab/LIMO">GitHub: Rose-STL-Lab/LIMO</a></li>
<li><a href="https://arxiv.org/abs/2206.09010">arXiv: 2206.09010</a></li>
</ul>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{eckmann2022limo,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{LIMO: Latent Inceptionism for Targeted Molecule Generation}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Eckmann, Peter and Sun, Kunyang and Zhao, Bo and Feng, Mudong and Gilson, Michael K and Yu, Rose}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{International Conference on Machine Learning}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{5777--5792}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2022}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">organization</span>=<span style="color:#e6db74">{PMLR}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div><h2 id="gradient-based-reverse-optimization-in-molecular-latent-space">Gradient-Based Reverse Optimization in Molecular Latent Space</h2>
<p>This is a <strong>Method</strong> paper that introduces LIMO, a framework for generating molecules with desired properties using gradient-based optimization on a VAE latent space. The key innovation is a stacked architecture where a property predictor operates on the decoded molecular representation rather than directly on the latent space, combined with an inceptionism-like technique that backpropagates through the frozen decoder and predictor to optimize the latent code. This approach is 6-8x faster than RL baselines and 12x faster than sampling-based approaches while producing molecules with higher binding affinities.</p>
<h2 id="slow-property-optimization-in-existing-methods">Slow Property Optimization in Existing Methods</h2>
<p>Generating molecules with high binding affinity to target proteins is a central goal of early drug discovery, but existing computational approaches are slow when optimizing for properties that are expensive to evaluate (such as docking-based binding affinity). RL-based methods require many calls to the property function during training. Sampling-based approaches like MARS need hundreds of iterations. Latent optimization methods that predict properties directly from the latent space suffer from poor prediction accuracy because the mapping from latent space to molecular properties is difficult to learn.</p>
<h2 id="the-limo-framework">The LIMO Framework</h2>
<p>LIMO consists of three components: a VAE for learning a molecular latent space, a property predictor with a novel stacked architecture, and a gradient-based reverse optimization procedure.</p>
<h3 id="selfies-based-vae">SELFIES-Based VAE</h3>
<p>The VAE encodes molecules represented as SELFIES strings into a 1024-dimensional latent space $\mathbf{z} \in \mathbb{R}^m$ and decodes to probability distributions over SELFIES symbols. Since all SELFIES strings correspond to valid molecules, this guarantees 100% chemical validity. The output molecule is obtained by taking the argmax at each position:</p>
<p>$$\hat{x}_i = s_{d_i^*}, \quad d_i^* = \operatorname{argmax}_{d} \{y_{i,1}, \ldots, y_{i,d}\}$$</p>
<p>The VAE uses fully-connected layers (not recurrent), with a 64-dimensional embedding layer, four batch-normalized linear layers (2000-dimensional first layer, 1000-dimensional for the rest) with ReLU activation, and is trained with ELBO loss (0.9 weight on reconstruction, 0.1 on KL divergence).</p>
<h3 id="stacked-property-predictor">Stacked Property Predictor</h3>
<p>The critical architectural choice: the property predictor $g_\theta$ takes the decoded molecular representation $\hat{\mathbf{x}}$ as input rather than the latent code $\mathbf{z}$. The predictor is trained after the VAE is frozen by minimizing MSE on VAE-generated molecules:</p>
<p>$$\ell_0(\theta) = \left\| g_\theta\left(f_{\text{dec}}(\mathbf{z})\right) - \pi\left(f_{\text{dec}}(\mathbf{z})\right) \right\|^2$$</p>
<p>where $\pi$ is the ground-truth property function. This stacking improves prediction accuracy from $r^2 = 0.04$ (predicting from $\mathbf{z}$) to $r^2 = 0.38$ (predicting from $\hat{\mathbf{x}}$) on an unseen test set. The improvement comes because the mapping from molecular space to property is easier to learn than the mapping from latent space to property.</p>
<h3 id="reverse-optimization-inceptionism">Reverse Optimization (Inceptionism)</h3>
<p>After training, the decoder and predictor weights are frozen and $\mathbf{z}$ becomes the trainable parameter. For multiple properties with weights $(w_1, \ldots, w_k)$, the optimization minimizes:</p>
<p>$$\ell_1(\mathbf{z}) = -\sum_{i=1}^{k} w_i \cdot g^i\left(f_{\text{dec}}(\mathbf{z})\right)$$</p>
<p>Since both the decoder and predictor are neural networks, gradients flow through the entire chain, enabling efficient optimization with Adam. This is analogous to the &ldquo;inceptionism&rdquo; (DeepDream) technique from computer vision, where network inputs are optimized to maximize specific outputs.</p>
<h3 id="substructure-constrained-optimization">Substructure-Constrained Optimization</h3>
<p>For lead optimization, LIMO can fix a molecular substructure during optimization by adding a regularization term:</p>
<p>$$\ell_2(\mathbf{z}) = \lambda \sum_{i=1}^{n} \sum_{j=1}^{d} \left(M_{i,j} \cdot \left(f_{\text{dec}}(\mathbf{z})_{i,j} - (\hat{\mathbf{x}}_{\text{start}})_{i,j}\right)\right)^2$$</p>
<p>where $M$ is a binary mask specifying which SELFIES positions must remain unchanged and $\lambda = 1000$. This capability is enabled by the intermediate decoded representation, which most VAE-based methods lack.</p>
<h2 id="experiments-and-results">Experiments and Results</h2>
<h3 id="benchmark-tasks-qed-and-penalized-logp">Benchmark Tasks (QED and Penalized LogP)</h3>
<p>LIMO achieves competitive results with deep generative and RL-based models in 1 hour, compared to 8-24 hours for baselines. Top QED score: 0.947 (maximum possible: 0.948). Top penalized LogP: 10.5 (among length-limited models, comparable to MolDQN&rsquo;s 11.8).</p>
<p>The ablation study (&ldquo;LIMO on z&rdquo;) confirms the stacked predictor architecture: predicting from $\hat{\mathbf{x}}$ yields top p-logP of 10.5 versus 6.52 when predicting directly from $\mathbf{z}$.</p>
<h3 id="binding-affinity-maximization">Binding Affinity Maximization</h3>
<p>The primary contribution. LIMO generates molecules with substantially higher computed binding affinities (lower $K_D$) than baselines against two protein targets:</p>
<table>
  <thead>
      <tr>
          <th>Method</th>
          <th>ESR1 best $K_D$ (nM)</th>
          <th>ACAA1 best $K_D$ (nM)</th>
          <th>Time (hrs)</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>GCPN</td>
          <td>6.4</td>
          <td>75</td>
          <td>6</td>
      </tr>
      <tr>
          <td>MolDQN</td>
          <td>373</td>
          <td>240</td>
          <td>6</td>
      </tr>
      <tr>
          <td>MARS</td>
          <td>17</td>
          <td>163</td>
          <td>6</td>
      </tr>
      <tr>
          <td>GraphDF</td>
          <td>25</td>
          <td>370</td>
          <td>12</td>
      </tr>
      <tr>
          <td>LIMO</td>
          <td>0.72</td>
          <td>37</td>
          <td>1</td>
      </tr>
  </tbody>
</table>
<p>For ESR1, LIMO&rsquo;s best molecule has a $K_D$ of 0.72 nM from docking, nearly 10x better than the next method (GCPN at 6.4 nM). When corroborated with more rigorous absolute binding free energy (ABFE) calculations, one LIMO compound achieved a predicted $K_D$ of $6 \times 10^{-14}$ M (0.00006 nM), far exceeding the affinities of approved drugs tamoxifen ($K_D$ = 1.5 nM) and raloxifene ($K_D$ = 0.03 nM).</p>
<h3 id="multi-objective-optimization">Multi-Objective Optimization</h3>
<p>Single-objective optimization produces molecules with high affinity but problematic structures (polyenes, large rings). Multi-objective optimization simultaneously targeting binding affinity, QED ($&gt;$ 0.4), and SA ($&lt;$ 5.5) produces drug-like, synthesizable molecules that still have nanomolar binding affinities. Generated molecules satisfy Lipinski&rsquo;s rule of 5 with zero PAINS alerts.</p>
<h2 id="limitations">Limitations</h2>
<p>The LIMO property predictor achieves only moderate prediction accuracy ($r^2$ = 0.38), meaning the optimization relies on gradient direction being correct rather than absolute predictions being accurate. AutoDock-GPU docking scores do not correlate well with the more accurate ABFE results, a known limitation of docking. The fully-connected VAE architecture limits the molecular diversity compared to recurrent or attention-based alternatives (LSTM decoder produced max QED of only 0.3). The greedy fine-tuning step (replacing carbons with heteroatoms) is a heuristic rather than a learned procedure.</p>
<h2 id="reproducibility">Reproducibility</h2>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/Rose-STL-Lab/LIMO">Rose-STL-Lab/LIMO</a></td>
          <td>Code</td>
          <td>UC San Diego Custom (non-commercial)</td>
          <td>Full training, optimization, and evaluation code</td>
      </tr>
  </tbody>
</table>
<p><strong>Data</strong>: ZINC250k dataset for optimization tasks. MOSES dataset for random generation evaluation. Binding affinities computed with AutoDock-GPU.</p>
<p><strong>Hardware</strong>: Two GTX 1080 Ti GPUs (one for PyTorch, one for AutoDock-GPU), 4 CPU cores, 32 GB memory.</p>
<p><strong>Training</strong>: VAE trained for 18 epochs with learning rate 0.0001. Property predictor uses 3 layers of 1000 units, trained for 5 epochs. Reverse optimization uses learning rate 0.1 for 10 epochs.</p>
<p><strong>Targets</strong>: Human estrogen receptor (ESR1, PDB 1ERR) and human peroxisomal acetyl-CoA acyl transferase 1 (ACAA1, PDB 2IIK).</p>
]]></content:encoded></item></channel></rss>