Latent-Space Generation on Hunter Heidenreich | ML Research Scientist

LatentGAN: Latent-Space GAN for Molecular Generation

Sat, 28 Mar 2026 00:00:00 +0000

A GAN Operating in Learned Latent Space for Molecular Design

LatentGAN is a Method paper that introduces a two-stage architecture for de novo molecular generation. The first stage trains a heteroencoder to map SMILES strings into a continuous latent vector space. The second stage trains a Wasserstein GAN with gradient penalty (WGAN-GP) to generate new latent vectors that, when decoded, produce valid and novel molecular structures. The key contribution is decoupling the GAN from direct SMILES string generation, allowing the adversarial training to focus on learning the distribution of molecular latent representations rather than character-level sequence generation.

Limitations of Direct SMILES Generation with GANs

Prior GAN-based molecular generation methods such as ORGAN and ORGANIC operated directly on SMILES strings. This created a fundamental challenge: the generator had to simultaneously learn valid SMILES syntax and the distribution of chemically meaningful molecules. ORGAN struggled with optimizing discrete molecular properties like Lipinski’s Rule of Five, while ORGANIC showed limited success beyond the QED drug-likeness score. Other approaches (RANC, ATNC) substituted more advanced recurrent architectures but still operated in the discrete SMILES space.

Meanwhile, variational autoencoders (VAEs) demonstrated that working in continuous latent space could enable molecular generation, but they relied on forcing the latent distribution to match a Gaussian prior through KL divergence. This assumption is not necessarily appropriate for chemical space, which is inherently discontinuous.

RNN-based methods with transfer learning offered an alternative for target-biased generation, but the authors hypothesized that combining GANs with learned latent representations could produce complementary chemical space coverage.

Heteroencoder Plus Wasserstein GAN Architecture

The core innovation of LatentGAN is separating molecular representation learning from adversarial generation through a two-component pipeline.

Heteroencoder

The heteroencoder is an autoencoder trained on pairs of different non-canonical (randomized) SMILES representations of the same molecule. This is distinct from a standard autoencoder because the input and target SMILES are different representations of the same structure.

The encoder uses a two-layer bidirectional LSTM with 512 units per layer (256 forward, 256 backward). The concatenated output feeds into a 512-dimensional feed-forward layer. During training, zero-centered Gaussian noise with $\sigma = 0.1$ is added to the latent vector as regularization. The decoder is a four-layer unidirectional LSTM with a softmax output layer. Batch normalization with momentum 0.9 is applied to all hidden layers except the noise layer.

Training uses teacher forcing with categorical cross-entropy loss for 100 epochs. The learning rate starts at $10^{-3}$ for the first 50 epochs and decays exponentially to $10^{-6}$ by the final epoch. After training, the noise layer is deactivated for deterministic encoding and decoding.

An important design choice is that the heteroencoder makes no assumption about the latent space distribution (unlike VAEs with their KL divergence term). The latent space is shaped purely by reconstruction loss, and the GAN later learns to sample from this unconstrained distribution.

Wasserstein GAN with Gradient Penalty

The GAN uses the WGAN-GP formulation. The critic (discriminator) consists of three feed-forward layers of 256 dimensions each with leaky ReLU activations (no activation on the final layer). The generator has five feed-forward layers of 256 dimensions each with batch normalization and leaky ReLU between layers.

The training ratio is 5:1, with five critic updates for every generator update. The generator takes random vectors sampled from a uniform distribution and learns to produce latent vectors indistinguishable from the real encoded molecular latent vectors.

The WGAN-GP loss for the critic is:

$$L_{\text{critic}} = \mathbb{E}_{\tilde{x} \sim \mathbb{P}_g}[D(\tilde{x})] - \mathbb{E}_{x \sim \mathbb{P}_r}[D(x)] + \lambda \mathbb{E}_{\hat{x} \sim \mathbb{P}_{\hat{x}}}[(|\nabla_{\hat{x}} D(\hat{x})|_2 - 1)^2]$$

where $\lambda$ is the gradient penalty coefficient, $\mathbb{P}_r$ is the real data distribution (encoded latent vectors), $\mathbb{P}_g$ is the generator distribution, and $\mathbb{P}_{\hat{x}}$ samples uniformly along straight lines between pairs of real and generated points.

Generation Pipeline

At inference time, the full pipeline operates as: (1) sample a random vector, (2) pass through the trained generator to produce a latent vector, (3) decode the latent vector into a SMILES string using the pretrained heteroencoder decoder.

Experiments on Drug-Like and Target-Biased Generation

Datasets

The heteroencoder was trained on 1,347,173 SMILES from ChEMBL 25, standardized with MolVS and restricted to molecules with atoms from {H, C, N, O, S, Cl, Br} and at most 50 heavy atoms.

For general drug-like generation, a random subset of 100,000 ChEMBL compounds was used to train the GAN model for 30,000 epochs.

For target-biased generation, three datasets were extracted from ExCAPE-DB for EGFR, HTR1A, and S1PR1 targets. These were clustered into training and test sets to ensure chemical series were not split across sets.

Target	Training Set	Test Set	SVM ROC-AUC	SVM Kappa
EGFR	2,949	2,326	0.850	0.56
HTR1A	48,283	23,048	0.993	0.90
S1PR1	49,381	23,745	0.995	0.91

SVM target prediction models using 2048-bit FCFP6 fingerprints were built with scikit-learn to evaluate generated compounds.

Baselines

RNN-based generative models with transfer learning served as the primary baseline. A prior RNN model was trained on the same ChEMBL set, then fine-tuned on each target dataset. The LatentGAN was also benchmarked on the MOSES platform against VAE, JTN-VAE, and AAE architectures.

Heteroencoder Performance

The heteroencoder achieved 99% valid SMILES on the training set and 98% on the test set. Reconstruction error (decoding to a different molecule) was 18% on training and 20% on test. Notably, decoding to a different valid SMILES of the same molecule is not counted as an error.

Target-Biased Generation Results

From 50,000 sampled SMILES per target model:

Target	Arch.	Valid (%)	Unique (%)	Novel (%)	Active (%)	Recovered Actives (%)	Recovered Neighbors
EGFR	GAN	86	56	97	71	5.26	196
EGFR	RNN	96	46	95	65	7.74	238
HTR1A	GAN	86	66	95	71	5.05	284
HTR1A	RNN	96	50	90	81	7.28	384
S1PR1	GAN	89	31	98	44	0.93	24
S1PR1	RNN	97	35	97	65	3.72	43

MOSES Benchmark

On the MOSES benchmark (trained on a ZINC subset of 1,584,663 compounds, sampled 30,000 SMILES), LatentGAN showed comparable or better results than JTN-VAE and AAE on Frechet ChemNet Distance (FCD), Fragment similarity, and Scaffold similarity, while producing slightly worse nearest-neighbor cosine similarity (SNN). The standard VAE showed signs of mode collapse with high test metric overlap and low novelty.

Complementary Generation and Drug-Likeness Preservation

Key Findings

Validity and novelty: LatentGAN achieved 86-89% validity on target-biased tasks (lower than RNN’s 96-97%) but produced higher uniqueness on two of three targets and comparable or higher novelty (95-98%).

Complementary chemical space: The overlap between LatentGAN-generated and RNN-generated active compounds was very small at both compound and scaffold levels. A probabilistic analysis showed that the RNN model would be very unlikely to eventually cover the LatentGAN output space. This suggests the two architectures can work complementarily in de novo design campaigns.

Drug-likeness: QED score distributions of LatentGAN-generated compounds closely matched training set distributions across all three targets, with training compounds showing only slightly higher drug-likeness. SA score distributions were similarly well-preserved.

Chemical space coverage: PCA analysis using MQN fingerprints confirmed that generated compounds occupy most of the chemical space of the training sets. Some regions of the PCA plots contained compounds predicted as inactive, which corresponded to non-drug-like outliers in the training data.

Novel scaffolds: About 14% of scaffolds in the sampled sets had similarity below 0.4 to the training set across all three targets, indicating LatentGAN can generate genuinely novel chemical scaffolds. Around 5% of generated compounds were identical to training set compounds, while 21-25% had Tanimoto similarity below 0.4.

Limitations

The paper acknowledges several limitations. The 18-20% heteroencoder reconstruction error means a non-trivial fraction of encoded molecules decode to different structures. Validity rates (86-89%) are lower than RNN baselines (96-97%). The S1PR1 target showed notably lower uniqueness (31%) and predicted activity (44%) compared to the other targets, possibly due to the smaller effective training set of active compounds. The paper does not report specific hardware requirements or training times. No wet-lab experimental validation of generated compounds was performed.

Future Directions

The authors envision LatentGAN as a complementary tool to existing RNN-based generative models, with the two architectures covering different regions of chemical space. The approach of operating in learned latent space rather than directly on SMILES strings offers a general framework that could be extended to other molecular representations or generation objectives.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Heteroencoder training	ChEMBL 25 (subset)	1,347,173 SMILES	Standardized with MolVS; atoms restricted to H, C, N, O, S, Cl, Br; max 50 heavy atoms
General GAN training	ChEMBL 25 (random subset)	100,000	Subset of heteroencoder training set
Target-biased training	ExCAPE-DB (EGFR)	2,949 actives	Clustered train/test split
Target-biased training	ExCAPE-DB (HTR1A)	48,283 actives	Clustered train/test split
Target-biased training	ExCAPE-DB (S1PR1)	49,381 actives	Clustered train/test split
Benchmarking	ZINC (MOSES subset)	1,584,663	Canonical SMILES

Algorithms

Heteroencoder: Bidirectional LSTM encoder (2 layers, 512 units) + unidirectional LSTM decoder (4 layers), trained with teacher forcing and categorical cross-entropy for 100 epochs
GAN: WGAN-GP with 5:1 critic-to-generator training ratio. General model trained 30,000 epochs; target models trained 10,000 epochs
Evaluation: SVM classifiers with FCFP6 fingerprints (2048 bits) for activity prediction; MQN fingerprints for PCA-based chemical space analysis; Murcko scaffolds for scaffold-level analysis

Models

Heteroencoder: 512-dim latent space, bidirectional LSTM encoder, unidirectional LSTM decoder
Generator: 5 feed-forward layers of 256 dims with batch norm and leaky ReLU
Critic: 3 feed-forward layers of 256 dims with leaky ReLU

Evaluation

Metric	LatentGAN (EGFR)	RNN Baseline (EGFR)	Notes
Validity	86%	96%	Percent valid SMILES
Uniqueness	56%	46%	Percent unique among valid
Novelty	97%	95%	Not in training set
Predicted active	71%	65%	By SVM model

Hardware

Not specified in the paper.

Artifacts

Artifact	Type	License	Notes
LatentGAN source code	Code	Not specified	Includes trained heteroencoder model and training sets

Paper Information

Citation: Prykhodko, O., Johansson, S.V., Kotsias, P.-C., Arús-Pous, J., Bjerrum, E.J., Engkvist, O., & Chen, H. (2019). A de novo molecular generation method using latent vector based generative adversarial network. Journal of Cheminformatics, 11(1), 74. https://doi.org/10.1186/s13321-019-0397-9

@article{prykhodko2019latentgan,
  title={A de novo molecular generation method using latent vector based generative adversarial network},
  author={Prykhodko, Oleksii and Johansson, Simon Viet and Kotsias, Panagiotis-Christos and Ar{\'u}s-Pous, Josep and Bjerrum, Esben Jannik and Engkvist, Ola and Chen, Hongming},
  journal={Journal of Cheminformatics},
  volume={11},
  number={1},
  pages={74},
  year={2019},
  publisher={Springer},
  doi={10.1186/s13321-019-0397-9}
}

Grammar VAE: Generating Valid Molecules via CFGs

Sat, 28 Mar 2026 00:00:00 +0000

A Grammar-Constrained VAE for Discrete Data Generation

This is a Method paper that introduces the Grammar Variational Autoencoder (GVAE), a variant of the variational autoencoder that operates directly on parse trees from context-free grammars (CFGs) rather than on raw character sequences. The primary contribution is a decoding mechanism that uses a stack and grammar-derived masks to restrict the output at every timestep to only syntactically valid production rules. This guarantees that every decoded output is a valid string under the grammar, addressing a fundamental limitation of character-level VAEs when applied to structured discrete data such as SMILES molecular strings and arithmetic expressions.

Why Character-Level VAEs Fail on Structured Discrete Data

Generative models for continuous data (images, audio) had achieved impressive results by 2017, but generating structured discrete data remained difficult. The key challenge is that string representations of molecules and mathematical expressions are brittle: small perturbations to a character sequence often produce invalid outputs. Gomez-Bombarelli et al. (2016) demonstrated a character-level VAE (CVAE) for SMILES strings that could encode molecules into a continuous latent space and decode them back, enabling latent-space optimization for molecular design. However, the CVAE frequently decoded latent points into strings that were not valid SMILES, particularly when exploring regions of latent space far from training data.

The fundamental issue is that character-level decoders must implicitly learn the syntactic rules of the target language from data alone. For SMILES, this includes matching parentheses, valid atom types, proper bonding, and ring closure notation. The GVAE addresses this by giving the decoder explicit knowledge of the grammar, so it can focus entirely on learning the semantic structure of the data.

Core Innovation: Stack-Based Grammar Masking in the Decoder

The GVAE encodes and decodes sequences of production rules from a context-free grammar rather than sequences of characters.

Encoding. Given an input string (e.g., a SMILES molecule), the encoder first parses it into a parse tree using the CFG, then performs a left-to-right pre-order traversal of the tree to extract an ordered sequence of production rules. Each rule is represented as a one-hot vector of dimension $K$ (total number of production rules in the grammar). The resulting $T(\mathbf{X}) \times K$ matrix is processed by a convolutional neural network to produce the mean and variance of a Gaussian posterior $q_{\phi}(\mathbf{z} \mid \mathbf{X})$.

Decoding with grammar masks. The decoder maps a latent vector $\mathbf{z}$ through an RNN to produce a matrix of logits $\mathbf{F} \in \mathbb{R}^{T_{max} \times K}$. The key innovation is a last-in first-out (LIFO) stack that tracks the current parsing state. At each timestep $t$, the decoder:

Pops the top non-terminal $\alpha$ from the stack
Applies a fixed binary mask $\mathbf{m}_{\alpha} \in {0, 1}^K$ that zeros out all production rules whose left-hand side is not $\alpha$
Samples a production rule from the masked softmax distribution:

$$ p(\mathbf{x}_{t} = k \mid \alpha, \mathbf{z}) = \frac{m_{\alpha,k} \exp(f_{tk})}{\sum_{j=1}^{K} m_{\alpha,j} \exp(f_{tj})} $$

Pushes the right-hand-side non-terminals of the selected rule onto the stack (right-to-left, so the leftmost is on top)

This process continues until the stack is empty or $T_{max}$ timesteps are reached. Because the mask restricts selection to only those rules applicable to the current non-terminal, every generated sequence of production rules is guaranteed to be a valid derivation under the grammar.

Training. The model is trained by maximizing the ELBO:

$$ \mathcal{L}(\phi, \theta; \mathbf{X}) = \mathbb{E}_{q(\mathbf{z} \mid \mathbf{X})} \left[ \log p_{\theta}(\mathbf{X}, \mathbf{z}) - \log q_{\phi}(\mathbf{z} \mid \mathbf{X}) \right] $$

where the likelihood factorizes as:

$$ p(\mathbf{X} \mid \mathbf{z}) = \prod_{t=1}^{T(\mathbf{X})} p(\mathbf{x}_{t} \mid \mathbf{z}) $$

During training, the masks at each timestep are determined by the ground-truth production rule sequence, so no stack simulation is needed. The stack-based decoding is only required at generation time.

Syntactic vs. semantic validity. The grammar guarantees syntactic validity but not semantic validity. The GVAE can still produce chemically implausible molecules (e.g., an oxygen atom with three bonds) because such constraints are not context-free. SMILES ring-bond digit matching is also not context-free, so the grammar cannot enforce it. Additionally, sequences that have not emptied the stack by $T_{max}$ are marked invalid.

Experiments on Symbolic Regression and Molecular Optimization

The authors evaluate the GVAE on two domains: arithmetic expressions and molecules. Both use Bayesian optimization (BO) over the learned latent space.

Setup. After training each VAE, the authors encode training data into latent vectors and train a sparse Gaussian process (SGP) with 500 inducing points to predict properties from latent representations. They then run batch BO with expected improvement, selecting 50 candidates per iteration.

Arithmetic Expressions

Data: 100,000 randomly generated univariate expressions from a simple grammar (3 binary operators, 2 unary operators, 3 constants), each with at most 15 production rules
Target: Find an expression minimizing $\log(1 + \text{MSE})$ against the true function $1/3 + x + \sin(x \cdot x)$
BO iterations: 5, averaged over 10 repetitions

Method	Fraction Valid	Average Score
GVAE	0.99 +/- 0.01	3.47 +/- 0.24
CVAE	0.86 +/- 0.06	4.75 +/- 0.25

The GVAE’s best expression ($x/1 + \sin(3) + \sin(x \cdot x)$, score 0.04) nearly exactly recovers the true function, while the CVAE’s best ($x \cdot 1 + \sin(3) + \sin(3/1)$, score 0.39) misses the sinusoidal component.

Molecular Optimization

Data: 250,000 SMILES strings from the ZINC database
Target: Maximize penalized logP (water-octanol partition coefficient penalized for ring size and synthetic accessibility)
BO iterations: 10, averaged over 5 trials

Method	Fraction Valid	Average Score
GVAE	0.31 +/- 0.07	-9.57 +/- 1.77
CVAE	0.17 +/- 0.05	-54.66 +/- 2.66

The GVAE produces roughly twice as many valid molecules as the CVAE and finds molecules with substantially better penalized logP scores (best: 2.94 vs. 1.98).

Latent Space Quality

Interpolation experiments show that the GVAE produces valid outputs at every intermediate point when linearly interpolating between two encoded expressions, while the CVAE passes through invalid strings. Grid searches around encoded molecules in the GVAE latent space show smooth transitions where neighboring points differ by single atoms.

Predictive Performance

Sparse GP models trained on GVAE latent features achieve better test RMSE and log-likelihood than those trained on CVAE features for both expressions and molecules:

Metric	GVAE (Expressions)	CVAE (Expressions)	GVAE (Molecules)	CVAE (Molecules)
Test LL	-1.320 +/- 0.001	-1.397 +/- 0.003	-1.739 +/- 0.004	-1.812 +/- 0.004
Test RMSE	0.884 +/- 0.002	0.975 +/- 0.004	1.404 +/- 0.006	1.504 +/- 0.006

Reconstruction and Prior Sampling

On held-out molecules, the GVAE achieves 53.7% reconstruction accuracy vs. 44.6% for the CVAE. When sampling from the prior $p(\mathbf{z}) = \mathcal{N}(0, \mathbf{I})$, 7.2% of GVAE samples are valid molecules vs. 0.7% for the CVAE.

Key Findings, Limitations, and Impact

Key findings. Incorporating grammar structure into the VAE decoder consistently improves validity rates, latent space smoothness, downstream predictive performance, and Bayesian optimization outcomes across both domains. The approach is general: any domain with a context-free grammar can benefit.

Limitations acknowledged by the authors.

The GVAE guarantees syntactic but not semantic validity. For molecules, invalid ring-bond patterns and chemically implausible structures can still be generated.
The molecular validity rate during BO (31%) is substantially higher than the CVAE (17%) but still means most decoded molecules are invalid, largely due to non-context-free constraints in SMILES.
The approach requires a context-free grammar for the target domain, which limits applicability to well-defined formal languages.
Sequences that do not complete parsing within $T_{max}$ timesteps are discarded as invalid.

Impact. The GVAE was an influential early contribution to constrained molecular generation. It directly inspired the Syntax-Directed VAE (SD-VAE) by Dai et al. (2018), which uses attribute grammars for tighter semantic constraints, and contributed to the broader movement toward structured molecular generation methods including graph-based approaches. The paper demonstrated that encoding domain knowledge into the decoder architecture is more effective than relying on the model to learn structural constraints from data alone.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Training (expressions)	Generated arithmetic expressions	100,000	Up to 15 production rules each
Training (molecules)	ZINC database subset	250,000 SMILES	Same subset as Gomez-Bombarelli et al. (2016)

Algorithms

Encoder: 1D convolutional neural network over one-hot rule sequences
Decoder: RNN with stack-based grammar masking
Latent space: 56 dimensions (molecules), isotropic Gaussian prior
Property predictor: Sparse Gaussian process with 500 inducing points
Optimization: Batch Bayesian optimization with expected improvement, 50 candidates per iteration, Kriging Believer for batch selection

Models

Architecture details follow Gomez-Bombarelli et al. (2016) with modifications for grammar-based encoding/decoding. Specific layer sizes and hyperparameters are described in the supplementary material.

Evaluation

Metric	GVAE	CVAE	Notes
Fraction valid (expressions)	0.99	0.86	During BO
Fraction valid (molecules)	0.31	0.17	During BO
Best penalized logP	2.94	1.98	Best molecule found
Reconstruction accuracy	53.7%	44.6%	On held-out molecules
Prior validity	7.2%	0.7%	Sampling from N(0,I)

Hardware

Not specified in the paper.

Artifacts

Artifact	Type	License	Notes
grammarVAE	Code	Not specified	Official implementation

Paper Information

Citation: Kusner, M. J., Paige, B., & Hernández-Lobato, J. M. (2017). Grammar Variational Autoencoder. Proceedings of the 34th International Conference on Machine Learning (ICML), 1945-1954.

@inproceedings{kusner2017grammar,
  title={Grammar Variational Autoencoder},
  author={Kusner, Matt J. and Paige, Brooks and Hern{\'a}ndez-Lobato, Jos{\'e} Miguel},
  booktitle={Proceedings of the 34th International Conference on Machine Learning},
  pages={1945--1954},
  year={2017},
  publisher={PMLR}
}

VAE for Automatic Chemical Design (2018 Seminal)

Thu, 26 Mar 2026 00:00:00 +0000

A Foundational Method for Continuous Molecular Representation

This is a Method paper that introduces a variational autoencoder (VAE) framework for mapping discrete molecular representations (SMILES strings) into a continuous latent space. The primary contribution is demonstrating that this continuous representation enables three key capabilities: (1) automatic generation of novel molecules by decoding random or perturbed latent vectors, (2) smooth interpolation between molecules in latent space, and (3) gradient-based optimization of molecular properties using a jointly trained property predictor. This work is widely regarded as one of the earliest and most influential applications of deep generative models to molecular design.

The Challenge of Searching Discrete Chemical Space

Molecular design is fundamentally an optimization problem: identify molecules that maximize some set of desirable properties. The search space is enormous (estimated $10^{23}$ to $10^{60}$ drug-like molecules) and discrete, making systematic exploration difficult. Prior approaches fell into two categories, each with significant limitations:

Virtual screening over fixed libraries: effective but monolithic, costly to enumerate, and requiring hand-crafted rules to avoid impractical chemistries.
Discrete local search (e.g., genetic algorithms): requires manual specification of mutation and crossover heuristics, and cannot leverage gradient information to guide the search.

The core insight is that mapping molecules into a continuous vector space sidesteps these problems entirely. In a continuous space, new compounds can be generated by vector perturbation (no hand-crafted mutation rules), optimization can follow property gradients (enabling larger and more directed jumps), and large unlabeled chemical databases can be leveraged through unsupervised representation learning.

A VAE Architecture for SMILES Strings with Joint Property Prediction

The architecture consists of three coupled neural networks trained jointly:

Encoder: Converts SMILES character strings into fixed-dimensional continuous vectors (the latent representation). Uses three 1D convolutional layers followed by a fully connected layer. For ZINC molecules, the latent space has 196 dimensions; for QM9, 156 dimensions.
Decoder: Converts latent vectors back into SMILES strings character by character using three layers of gated recurrent units (GRUs). The output is stochastic, as each character is sampled from a probability distribution over the SMILES alphabet.
Property Predictor: A multilayer perceptron that predicts molecular properties directly from the latent representation. Joint training with the autoencoder reconstruction loss organizes the latent space so that molecules with similar properties cluster together.

The VAE Objective

The model uses the variational autoencoder framework of Kingma and Welling. The training objective combines three terms:

$$\mathcal{L} = \mathcal{L}_{recon} + \beta \cdot D_{KL}(q(z|x) | p(z)) + \lambda \cdot \mathcal{L}_{prop}$$

where $\mathcal{L}_{recon}$ is the reconstruction loss (cross-entropy over SMILES characters), $D_{KL}$ is the KL divergence regularizer that encourages the latent distribution $q(z|x)$ to match a standard Gaussian prior $p(z)$, and $\mathcal{L}_{prop}$ is the property prediction regression loss. Both the variational loss and the property prediction loss are annealed in using a sigmoid schedule after 29 epochs over a total of 120 epochs of training.

The KL regularization is critical: it forces the decoder to handle a wider variety of latent points, preventing “dead areas” in latent space that would decode to invalid molecules.

Gradient-Based Optimization

After training, a Gaussian process (GP) surrogate model is fit on top of the latent representations to predict the target property. Optimization proceeds by:

Encoding a seed molecule into the latent space
Using the GP model to define a smooth property surface over the latent space
Optimizing the latent vector $z$ to maximize the predicted property via gradient ascent
Decoding the optimized $z$ back into a SMILES string

The objective used for demonstration is $5 \times \text{QED} - \text{SAS}$, balancing drug-likeness (QED) against synthetic accessibility (SAS).

Experiments on ZINC and QM9 Datasets

Two autoencoder systems were trained:

ZINC: 250,000 drug-like molecules from the ZINC database, with a 196-dimensional latent space. Properties predicted: logP, QED, SAS.
QM9: 108,000 molecules with fewer than 9 heavy atoms, with a 156-dimensional latent space. Properties predicted: HOMO energy, LUMO energy, electronic spatial extent ($\langle R^2 \rangle$).

Latent Space Quality

The encoded latent dimensions follow approximately normal distributions as enforced by the variational regularizer. Decoding is stochastic: sampling the same latent point multiple times yields different SMILES strings, with the most frequent decoding tending to be closest to the original point in latent space. Decoding validity rates are 73-79% for points near known molecules but only 4% for randomly selected latent points.

Spherical interpolation (slerp) between molecules in latent space produces smooth structural transitions, accounting for the geometry of high-dimensional Gaussian distributions where linear interpolation would pass through low-probability regions.

Molecular Generation Comparison

Source	Dataset	Samples	logP	SAS	QED	% in ZINC	% in eMolecules
Data	ZINC	249k	2.46 (1.43)	3.05 (0.83)	0.73 (0.14)	100	12.9
GA	ZINC	5303	2.84 (1.86)	3.80 (1.01)	0.57 (0.20)	6.5	4.8
VAE	ZINC	8728	2.67 (1.46)	3.18 (0.86)	0.70 (0.14)	5.8	7.0
Data	QM9	134k	0.30 (1.00)	4.25 (0.94)	0.48 (0.07)	0.0	8.6
GA	QM9	5470	0.96 (1.53)	4.47 (1.01)	0.53 (0.13)	0.018	3.8
VAE	QM9	2839	0.30 (0.97)	4.34 (0.98)	0.47 (0.08)	0.0	8.9

The VAE generates molecules whose property distributions closely match the training data, outperforming a genetic algorithm baseline that biases toward higher chemical complexity and decreased drug-likeness. Only 5.8% of VAE-generated ZINC molecules were found in the original ZINC database, indicating genuine novelty.

Property Prediction

Dataset/Property	Mean Baseline	ECFP	Graph Conv.	1-hot SMILES	Encoder Only	VAE
ZINC/logP	1.14	0.38	0.05	0.16	0.13	0.15
ZINC/QED	0.112	0.045	0.017	0.041	0.037	0.054
QM9/HOMO (eV)	0.44	0.20	0.12	0.12	0.13	0.16
QM9/LUMO (eV)	1.05	0.20	0.15	0.11	0.14	0.16
QM9/Gap (eV)	1.07	0.30	0.18	0.16	0.18	0.21

The VAE latent representation achieves property prediction accuracy comparable to graph convolutions for some properties, though graph convolutions generally perform best. The primary purpose of joint training is not to maximize prediction accuracy but to organize the latent space for optimization.

Optimization Results

Bayesian optimization with a GP model on the jointly trained latent space consistently produces molecules with higher percentile scores on the $5 \times \text{QED} - \text{SAS}$ objective compared to both random Gaussian search and genetic algorithm baselines. Starting from molecules in the bottom 10th percentile of the ZINC dataset, the optimizer reliably discovers molecules in regions of high objective value. Training the GP with 1000 molecules (vs. 2000) produces a wider diversity of solutions by optimizing to multiple local optima rather than a single global optimum.

Key Findings, Limitations, and Legacy

Key Findings

A continuous latent representation of molecules enables gradient-based search through chemical space, a qualitatively different approach from discrete enumeration or genetic algorithms.
Joint training with property prediction organizes the latent space by property values, creating smooth gradients that optimization can follow.
The VAE generates novel molecules with realistic property distributions, and the latent space encodes an estimated 7.5 million molecules despite training on only 250,000.

Acknowledged Limitations

The SMILES-based decoder sometimes produces formally valid but chemically undesirable molecules (acid chlorides, anhydrides, cyclopentadienes, aziridines, etc.) because the grammar of valid SMILES does not capture all synthetic or stability constraints.
Character-level SMILES generation is fragile: the decoder must implicitly learn which strings are valid SMILES, making the learning problem harder than necessary.
Decoding validity drops to only 4% for random latent points far from training data, limiting the ability to explore truly novel regions of chemical space.

Directions Identified

The authors point to several extensions that were already underway at the time of publication:

Grammar VAE: Using an explicitly defined SMILES grammar instead of forcing the model to learn one (Kusner et al., 2017).
Graph-based decoders: Directly outputting molecular graphs to avoid the SMILES validity problem.
Adversarial training: Using GANs for molecular generation (ORGAN, ORGANIC).
LSTM/RNN generators: Applying recurrent networks directly to SMILES for generation and reaction prediction.

This paper has been cited over 2,900 times and launched a large body of follow-up work in VAE-based, GAN-based, and reinforcement learning-based molecular generation.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Training	ZINC (drug-like subset)	250,000 molecules	Randomly sampled from ZINC database
Training	QM9	108,000 molecules	Molecules with fewer than 9 heavy atoms
Evaluation	ZINC held-out set	5,000 molecules	For latent space analysis

Algorithms

Encoder: 3 x 1D convolutional layers (ZINC: filters 9,9,10 with kernels 9,9,11; QM9: filters 2,2,1 with kernels 5,5,4), followed by a fully connected layer
Decoder: 3 x GRU layers (ZINC: hidden dim 488; QM9: hidden dim 500), trained with teacher forcing
Property Predictor: 2 fully connected layers of 1000 neurons (dropout 0.20) for prediction; smaller 3-layer MLP of 67 neurons (dropout 0.15) for latent space shaping
Variational loss annealing: Sigmoid schedule after 29 epochs, total 120 epochs
SMILES validation: Post-hoc filtering with RDKit; invalid outputs discarded
Optimization: Gaussian process surrogate model trained on 2000 maximally diverse molecules from latent space

Models

Built with Keras and TensorFlow. Latent dimensions: 196 (ZINC), 156 (QM9). SMILES alphabet: 35 characters (ZINC), 22 characters (QM9). Maximum string length: 120 (ZINC), 34 (QM9). Only canonicalized SMILES used for training.

Evaluation

Metric	Description
logP	Water-octanol partition coefficient
QED	Quantitative Estimation of Drug-likeness (0-1)
SAS	Synthetic Accessibility Score
HOMO/LUMO (eV)	Frontier orbital energies (QM9)
Decoding validity	Fraction of latent points producing valid SMILES
Novelty	Fraction of generated molecules not in training set

Hardware

Training was performed on the Harvard FAS Odyssey Cluster. Specific GPU types and training times are not reported. The Gaussian process optimization requires only minutes to train on a few thousand molecules.

Artifacts

Artifact	Type	License	Notes
chemical_vae	Code	Apache-2.0	Official implementation with training scripts and pre-trained models

Paper Information

Citation: Gómez-Bombarelli, R., Wei, J. N., Duvenaud, D., Hernández-Lobato, J. M., Sánchez-Lengeling, B., Sheberla, D., Aguilera-Iparraguirre, J., Hirzel, T. D., Adams, R. P., & Aspuru-Guzik, A. (2018). Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules. ACS Central Science, 4(2), 268-276. https://doi.org/10.1021/acscentsci.7b00572

@article{gomez2018automatic,
  title={Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules},
  author={G{\'o}mez-Bombarelli, Rafael and Wei, Jennifer N. and Duvenaud, David and Hern{\'a}ndez-Lobato, Jos{\'e} Miguel and S{\'a}nchez-Lengeling, Benjam{\'i}n and Sheberla, Dennis and Aguilera-Iparraguirre, Jorge and Hirzel, Timothy D. and Adams, Ryan P. and Aspuru-Guzik, Al{\'a}n},
  journal={ACS Central Science},
  volume={4},
  number={2},
  pages={268--276},
  year={2018},
  publisher={American Chemical Society},
  doi={10.1021/acscentsci.7b00572}
}

PASITHEA: Gradient-Based Molecular Design via Dreaming

Thu, 26 Mar 2026 00:00:00 +0000

Inceptionism Applied to Molecular Inverse Design

This is a Method paper that introduces PASITHEA, a gradient-based approach to de-novo molecular design inspired by inceptionism (deep dreaming) techniques from computer vision. The core contribution is a direct optimization framework that modifies molecular structures by backpropagating through a trained property-prediction network, with the molecular input (rather than weights) serving as the optimizable variable. PASITHEA is enabled by SELFIES, a surjective molecular string representation that guarantees 100% validity of generated molecules.

The Need for Direct Gradient-Based Molecular Optimization

Existing inverse molecular design methods, including variational autoencoders (VAEs), generative adversarial networks (GANs), reinforcement learning (RL), and genetic algorithms (GAs), share a common characteristic: they optimize molecules indirectly. VAEs and GANs learn distributions and scan latent spaces. RL agents learn policies from environmental rewards. GAs iteratively apply mutations and selections. None of these approaches directly maximize an objective function in a gradient-based manner with respect to the molecular representation itself.

This indirection has several consequences. VAE-based methods require learning a latent space, and the optimization happens in that space rather than directly on molecular structures. RL and GA methods require expensive function evaluations for each candidate molecule. The authors identify an opportunity to exploit gradients more directly by reversing the learning process of a neural network trained to predict molecular properties, thereby sidestepping latent spaces, policies, and population-based search entirely.

A second motivation is interpretability. By operating directly on the molecular representation (rather than a learned latent space), PASITHEA can reveal what a regression network has learned about structure-property relationships, a capability the authors frame as analogous to how deep dreaming reveals what image classifiers have learned about visual features.

Core Innovation: Inverting Regression Networks on SELFIES

PASITHEA’s key insight is a two-phase training procedure that repurposes the standard neural network training loop for molecule generation.

Phase 1: Prediction training. A fully connected neural network is trained to predict a real-valued chemical property (logP) from one-hot encoded SELFIES strings. The standard feedforward and backpropagation process updates the network weights to minimize mean squared error between predicted and ground-truth property values:

$$ \min_{\theta} \frac{1}{N} \sum_{i=1}^{N} (f_{\theta}(\mathbf{x}_i) - y_i)^2 $$

where $f_{\theta}$ is the neural network with parameters $\theta$, $\mathbf{x}_i$ is the one-hot encoded SELFIES input, and $y_i$ is the target logP value.

Phase 2: Inverse training (deep dreaming). The network weights $\theta$ are frozen. For a given input molecule $\mathbf{x}$ and a desired target property value $y_{\text{target}}$, the gradients are computed with respect to the input representation rather than the weights:

$$ \mathbf{x} \leftarrow \mathbf{x} - \eta \nabla_{\mathbf{x}} \mathcal{L}(f_{\theta}(\mathbf{x}), y_{\text{target}}) $$

This gradient descent on the input incrementally modifies the one-hot encoding of the molecular string, transforming it toward a structure whose predicted property matches the target value. At each step, the argmax function converts the continuous one-hot encoding back to a discrete SELFIES string, which always maps to a valid molecular graph due to the surjective property of SELFIES.

The role of SELFIES. The surjective mapping from strings to molecular graphs is essential. With SMILES, intermediate strings during optimization can become syntactically invalid (e.g., an unclosed ring like “CCCC1CCCCC”), producing no valid molecule. SELFIES enforces constraints that guarantee every string maps to a valid molecular graph, making the continuous gradient-based optimization feasible.

Input noise injection. Because inverse training transforms a one-hot encoding from binary values to real numbers, the discrete-to-continuous transition can cause convergence problems. The authors address this by initializing the input with noise: every zero in the one-hot encoding is replaced by a random number in $[0, k]$, where $k$ is a hyperparameter between 0.5 and 0.95. This smooths the optimization landscape and enables incremental molecular modifications rather than abrupt changes.

Experimental Setup on QM9 with LogP Optimization

Dataset and Property

The experiments use a random subset of 10,000 molecules from the QM9 dataset. The target property is the logarithm of the partition coefficient (logP), computed using RDKit. LogP measures lipophilicity, an important drug-likeness indicator that follows an approximately normal distribution in QM9 and has a nearly continuous range, making it suitable for gradient-based optimization.

Network Architecture

PASITHEA uses a fully connected neural network with four layers, each containing 500 nodes with ReLU activation. The loss function is mean squared error. Data is split 85%/15% for training/testing. The prediction model trains for approximately 1,500 epochs with an Adam optimizer and a learning rate of $1 \times 10^{-6}$.

For inverse training, the authors select a noise upper-bound of 0.9 and a learning rate of 0.01, chosen from hyperparameter tuning experiments that evaluate the percentage of molecules optimized toward the target property.

Optimization Targets

Two extreme logP targets are used: $+6$ (high lipophilicity) and $-6$ (low lipophilicity). These values exceed the range of logP values in the QM9 dataset (minimum: $-2.19$, maximum: $3.08$), testing whether the model can extrapolate beyond the training distribution.

Distribution Shifts and Interpretable Molecular Transformations

Distribution-Level Results

Applying deep dreaming to the full set of 10,000 molecules produces a clear shift in the logP distribution:

Statistic	QM9 Original	Optimized (target +6)	Optimized (target -6)
Mean logP	0.3909	1.8172	-0.3360
Min logP	-2.1903	-0.8240	-2.452
Max logP	3.0786	4.2442	0.9018

The optimized distributions extend beyond the original dataset’s property range. The right-shifted distribution (target +6) produces molecules with logP values up to 4.24, exceeding the original maximum of 3.08. The left-shifted distribution (target -6) reaches -2.45, below the original minimum. This indicates that PASITHEA can generate molecules with properties outside the training data bounds.

Additionally, 97.2% of the generated molecules do not exist in the original training set, indicating that the network is not memorizing data but rather using structural features to guide optimization. Some generated molecules contain more heavy atoms than the QM9 maximum of 9, since the SELFIES string length allows for larger structures.

Molecule-Level Interpretability

The stepwise molecular transformations reveal interpretable “strategies” the network employs:

Nitrogen appendage: When optimizing for lower logP, the network repeatedly appends nitrogen atoms to the molecule. The authors observe this as a consistent pattern across multiple test molecules, reflecting the known relationship between nitrogen content and reduced lipophilicity.
Length modulation: When optimizing for higher logP, the network tends to increase molecular chain length (e.g., extending a carbon chain). When optimizing for lower logP, it shortens chains. This captures the intuition that larger, more carbon-heavy molecules tend to be more lipophilic.
Bond order changes: The network replaces single bonds with double or triple bonds during optimization, demonstrating an understanding of the relationship between bonding patterns and logP.
Consistency across trials: Because the input initialization includes random noise, repeated trials with the same molecule produce different transformation sequences. Despite this stochasticity, the network applies consistent strategies across trials (e.g., always shortening chains for negative optimization), validating that it has learned genuine structure-property relationships.

Thermodynamic Stability

The authors assess synthesizability by computing heats of formation using MOPAC2016 at the PM7 level of theory. Some optimization trajectories move toward thermodynamically stable molecules (negative heats of formation), while others produce less stable structures. The authors acknowledge this limitation and propose multi-objective optimization incorporating stability as a future direction.

Comparison to VAEs

The key distinction from VAEs is where gradient computation occurs. In VAEs, a latent space is learned through encoding and decoding, and property optimization happens in that latent space. In PASITHEA, gradients are computed directly with respect to the molecular representation (SELFIES one-hot encoding). The authors argue this makes the approach more interpretable, since we can probe what the network learned about molecular structure without the “detour” through a latent space.

Limitations

The authors are forthright about the preliminary nature of these results:

The method is demonstrated only on a small subset of QM9 with a single, computationally inexpensive property (logP).
The simple four-layer architecture may not scale to larger molecular spaces or more complex properties.
Generated molecules are not always thermodynamically stable, requiring additional optimization objectives.
The approach has not been benchmarked against established methods (VAEs, GANs, RL) on standard generative benchmarks.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Training/Evaluation	QM9 (random subset)	10,000 molecules	logP values computed via RDKit

Algorithms

Prediction training: 4-layer fully connected NN, 500 nodes/layer, ReLU activation, MSE loss, Adam optimizer, LR $1 \times 10^{-6}$, ~1,500 epochs, 85/15 train/test split
Inverse training: Frozen weights, Adam optimizer, LR 0.01, noise upper-bound 0.9, logP targets of +6 and -6
Heats of formation: MOPAC2016, PM7 level, geometry optimization with eigenvector following (EF)

Models

The architecture is a simple 4-layer MLP. No pre-trained weights are distributed, but the full code is available.

Evaluation

Metric	Value	Notes
Novel molecules	97.2%	Generated molecules not in training set
Max logP (target +6)	4.2442	Exceeds QM9 max of 3.0786
Min logP (target -6)	-2.452	Below QM9 min of -2.1903

Hardware

Not specified in the paper.

Artifacts

Artifact	Type	License	Notes
Pasithea	Code	MIT	Official implementation

Paper Information

Citation: Shen, C., Krenn, M., Eppel, S., & Aspuru-Guzik, A. (2021). Deep molecular dreaming: inverse machine learning for de-novo molecular design and interpretability with surjective representations. Machine Learning: Science and Technology, 2(3), 03LT02. https://doi.org/10.1088/2632-2153/ac09d6

@article{shen2021deep,
  title={Deep molecular dreaming: inverse machine learning for de-novo molecular design and interpretability with surjective representations},
  author={Shen, Cynthia and Krenn, Mario and Eppel, Sagi and Aspuru-Guzik, Al{\'a}n},
  journal={Machine Learning: Science and Technology},
  volume={2},
  number={3},
  pages={03LT02},
  year={2021},
  publisher={IOP Publishing},
  doi={10.1088/2632-2153/ac09d6}
}

CogMol: Controlled Molecule Generation for COVID-19

Thu, 26 Mar 2026 00:00:00 +0000

A Controlled Generation Framework for Target-Specific Drug Design

This is a Method paper that introduces CogMol (Controlled Generation of Molecules), an end-to-end framework for de novo drug design. The primary contribution is a pipeline that combines a SMILES-based Variational Autoencoder (VAE) with multi-attribute controlled latent space sampling (CLaSS) to generate novel drug-like molecules with high binding affinity to specified protein targets, off-target selectivity, and favorable drug-likeness properties. The framework operates on protein sequence embeddings, allowing it to generalize to unseen target proteins without model retraining.

Multi-Constraint Drug Design for Novel Viral Targets

Traditional drug discovery costs 2-3 billion USD and takes over a decade with less than 10% success rate. Generating drug molecules requires satisfying multiple competing objectives simultaneously: target binding affinity, off-target selectivity, synthetic accessibility, drug-likeness, and low toxicity. Prior generative approaches using reinforcement learning or Bayesian optimization are computationally expensive and typically require fine-tuning on target-specific ligand libraries, making them unable to generalize to unseen protein targets.

The emergence of SARS-CoV-2 in 2020 created an urgent need for antiviral drug candidates targeting novel viral proteins. Because no binding affinity data existed for these new targets, and the viral proteins were not closely related to proteins in existing databases like BindingDB, existing target-specific generative frameworks could not be directly applied. CogMol addresses this by using pre-trained protein sequence embeddings from UniRep (trained on 24 million UniRef50 sequences) rather than learning protein representations from the limited BindingDB training set.

Controlled Latent Space Sampling with Pre-trained Protein Embeddings

CogMol’s core innovation is a three-component architecture that enables multi-constraint molecule generation for unseen targets:

1. SMILES VAE with adaptive pre-training. A Variational Autoencoder is first trained unsupervised on the MOSES/ZINC dataset (1.6M molecules), then jointly fine-tuned with QED and SA property predictors on BindingDB molecules. The standard VAE objective is:

$$\mathcal{L}_{\text{VAE}}(\theta, \phi) = \mathbb{E}_{p(x)} \left\{ \mathbb{E}_{q_\phi(z|x)} [\log p_\theta(x|z)] - D_{\text{KL}}(q_\phi(z|x) | p(z)) \right\}$$

where $q_\phi(z|x) = \mathcal{N}(z; \mu(x), \Sigma(x))$ specifies a diagonal Gaussian encoder distribution.

2. Protein-molecule binding affinity predictor. A regression model takes pre-trained UniRep protein sequence embeddings and molecule latent embeddings $z$ as input and predicts pIC50 binding affinity ($= -\log(\text{IC50})$). Because UniRep embeddings capture sequence, structural, and functional relationships from a large unsupervised corpus, the predictor can estimate binding affinity for novel target sequences not present in the training data.

3. CLaSS controlled sampling. The Conditional Latent attribute Space Sampling scheme generates molecules satisfying multiple constraints (affinity, QED, selectivity) through rejection sampling in the VAE latent space:

$$p(\mathbf{x} | \mathbf{a}) = \mathbb{E}_{\mathbf{z}} [p(\mathbf{z} | \mathbf{a}) , p(\mathbf{x} | \mathbf{z})] \approx \mathbb{E}_{\mathbf{z}} [\hat{p}_\xi(\mathbf{z} | \mathbf{a}) , p_\theta(\mathbf{x} | \mathbf{z})]$$

where $\mathbf{a} = [a_1, a_2, \ldots, a_n]$ is a set of independent attribute constraints. The conditional density $\hat{p}_\xi(\mathbf{z} | \mathbf{a})$ is approximated using a Gaussian mixture model $Q_\xi(\mathbf{z})$ and per-attribute classifiers $q_\xi(a_i | \mathbf{z})$, with Bayes’ rule and conditional independence assumptions. The acceptance probability equals the product of all attribute predictor scores, enabling efficient multi-constraint sampling without surrogate model or policy learning.

Selectivity modeling. Off-target selectivity for a molecule $m$ against target $T$ is defined as:

$$\text{Sel}_{T,m} = \text{BA}(T, m) - \frac{1}{k} \sum_{i=1}^{k} \text{BA}(T_i, m)$$

where $\text{BA}(T, m)$ is binding affinity to the target and $T_i$ are $k$ randomly selected off-targets. This selectivity score is incorporated as a control attribute during CLaSS sampling.

Experimental Setup: COVID-19 Targets and In Silico Screening

Target proteins. CogMol was applied to three SARS-CoV-2 targets not present in BindingDB: NSP9 Replicase dimer, Main Protease (Mpro), and the Receptor-Binding Domain (RBD) of the spike protein. A cancer target (human HDAC1) with low ligand coverage in the training data was also evaluated.

Training data. The SMILES VAE was trained on the MOSES benchmark (1.6M molecules from ZINC). The binding affinity predictor used curated IC50 data from BindingDB as reported in DeepAffinity, with all protein classes included in training.

CLaSS controlled generation. Molecules were generated with simultaneous constraints on binding affinity (> 0.5 normalized), QED (> 0.8 normalized), and selectivity (> 0.5 normalized). Approximately 1000 molecules per target were selected for downstream evaluation.

In silico screening pipeline. Generated molecules underwent:

Toxicity prediction via a multi-task deep neural network (MT-DNN) on 12 Tox21 in vitro endpoints and ClinTox clinical trial failure
Binding affinity rescoring with a higher-accuracy SMILES-level predictor
Blind docking (5 independent runs per molecule) using AutoDock Vina against target protein structures
Synthetic feasibility assessment using a retrosynthetic algorithm based on the Molecular Transformer trained on patent reaction data

Baselines. VAE performance was benchmarked against models from the MOSES platform. CLaSS-accepted molecules were compared against randomly sampled molecules from the latent space. Generated molecules were compared against FDA-approved drugs for toxicity and synthesizability.

Key Results

CLaSS enrichment (Table 1). CLaSS consistently produced higher fractions of molecules meeting all criteria compared to random sampling. For the triple constraint (affinity > 0.5, QED > 0.8, selectivity > 0.5), the enrichment was substantial: 6.9% vs. 0.7% for NSP9, 9.0% vs. 0.9% for RBD, and 10.4% vs. 1.1% for Mpro.

Target	CLaSS (Aff+QED+Sel)	Random (Aff+QED+Sel)	Enrichment
NSP9	6.9%	0.7%	~10x
RBD	9.0%	0.9%	~10x
Mpro	10.4%	1.1%	~9.5x

Docking results (Table 3). 87-95% of high-affinity generated molecules showed docking binding free energy (BFE) below -6 kcal/mol, with minimum BFEs reaching -8.6 to -9.5 kcal/mol depending on the target.

Novelty. The likelihood of generating an exact duplicate of a training molecule was 2% or less. Against the full PubChem database (~103M molecules), exact matches ranged from 3.7% to 9.5%. Generated molecules also showed novel chemical scaffolds as confirmed by high Frechet ChemNet Distance.

Synthesizability. Generated molecules for COVID-19 targets showed 85-90% synthetic feasibility using retrosynthetic analysis, exceeding the ~78% rate of FDA-approved drugs.

Toxicity. Approximately 70% of generated parent molecules and ~80% of predicted metabolites were toxic in 0-1 endpoints out of 13, comparable to FDA-approved drugs.

Generated Molecules Show Favorable Binding and Drug-Like Properties

CogMol demonstrates that controlled latent space sampling with pre-trained protein embeddings can generate novel, drug-like molecules for unseen viral targets. The key findings are:

CLaSS provides roughly 10x enrichment over random latent space sampling for molecules satisfying all three constraints (affinity, QED, selectivity).
Generated molecules bind favorably to druggable pockets in target protein 3D structures, even though the generation model uses only 1D sequence information.
Some generated SMILES matched existing PubChem molecules with known biological activity, suggesting the model identifies chemically relevant regions of molecular space.
The framework generalizes across targets of varying novelty, with Mpro (more similar to training proteins) yielding easier generation than NSP9 or RBD.

Limitations. The authors note that no wet-lab validation was performed on generated candidates. There may be divergence between ML-predicted properties and experimental measurements. The binding affinity predictor’s accuracy is bounded by the quality and coverage of BindingDB training data. Selectivity modeling uses a random sample of off-targets rather than a pharmacologically curated panel.

Future directions. The authors propose incorporating additional contexts beyond target protein (e.g., metabolic pathways), adding more pharmacologically relevant controls, and weighting objectives by relative importance.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
VAE pre-training	MOSES/ZINC	1.6M train, 176K test	Publicly available benchmark
VAE adaptive training	BindingDB (DeepAffinity split)	~27K protein-ligand pairs	Curated IC50 data
Protein embeddings	UniRef50 via UniRep	24M sequences	Pre-trained, publicly available
Toxicity prediction	Tox21 + ClinTox	12 in vitro + clinical endpoints	Public benchmark datasets
Docking validation	PDB structures	3 SARS-CoV-2 targets	Public crystal structures

Algorithms

VAE architecture: SMILES encoder-decoder with diagonal Gaussian latent space, jointly trained with QED and SA regressors
CLaSS: rejection sampling from Gaussian mixture model of latent space with per-attribute classifiers
Binding affinity: regression on concatenated UniRep protein embeddings and VAE molecule embeddings
Selectivity: excess binding affinity over average of $k$ random off-targets

Models

SMILES VAE with adaptive pre-training (ZINC then BindingDB)
Multi-task toxicity classifier (MT-DNN) for Tox21 and ClinTox endpoints
Binding affinity predictor (latent-level for generation, SMILES-level for screening)
Retrosynthetic predictor based on Molecular Transformer

Evaluation

Metric	Value	Baseline	Notes
Validity	90%	-	Generated SMILES
Uniqueness	99%	-	Among valid molecules
Filter pass	95%	-	Relevant chemical filters
Docking BFE < -6 kcal/mol	87-95%	-	Varies by target
Synthetic feasibility	85-90%	78% (FDA drugs)	COVID-19 targets
Low toxicity (0-1 endpoints)	~70% parent, ~80% metabolite	Comparable to FDA drugs	MT-DNN prediction

Hardware

The paper does not specify GPU types or training times. The work was funded internally by IBM Research.

Artifacts

Artifact	Type	License	Notes
CogMol (GitHub)	Code	Apache-2.0	Official implementation
~3500 generated molecules	Dataset	Open license	For three SARS-CoV-2 targets

Paper Information

Citation: Chenthamarakshan, V., Das, P., Hoffman, S. C., Strobelt, H., Padhi, I., Lim, K. W., Hoover, B., Manica, M., Born, J., Laino, T., & Mojsilovic, A. (2020). CogMol: Target-Specific and Selective Drug Design for COVID-19 Using Deep Generative Models. Advances in Neural Information Processing Systems, 33, 4320-4332.

@inproceedings{chenthamarakshan2020cogmol,
  title={CogMol: Target-Specific and Selective Drug Design for COVID-19 Using Deep Generative Models},
  author={Chenthamarakshan, Vijil and Das, Payel and Hoffman, Samuel C. and Strobelt, Hendrik and Padhi, Inkit and Lim, Kar Wai and Hoover, Benjamin and Manica, Matteo and Born, Jannis and Laino, Teodoro and Mojsilovi{\'c}, Aleksandra},
  booktitle={Advances in Neural Information Processing Systems},
  volume={33},
  pages={4320--4332},
  year={2020}
}

LIMO: Latent Inceptionism for Targeted Molecule Generation

Sun, 22 Mar 2026 00:00:00 +0000

Paper Information

Citation: Eckmann, P., Sun, K., Zhao, B., Feng, M., Gilson, M. K., & Yu, R. (2022). LIMO: Latent Inceptionism for Targeted Molecule Generation. Proceedings of the 39th International Conference on Machine Learning (ICML 2022), PMLR 162, 5777–5792.

Publication: ICML 2022

Additional Resources:

@inproceedings{eckmann2022limo,
  title={LIMO: Latent Inceptionism for Targeted Molecule Generation},
  author={Eckmann, Peter and Sun, Kunyang and Zhao, Bo and Feng, Mudong and Gilson, Michael K and Yu, Rose},
  booktitle={International Conference on Machine Learning},
  pages={5777--5792},
  year={2022},
  organization={PMLR}
}

Gradient-Based Reverse Optimization in Molecular Latent Space

This is a Method paper that introduces LIMO, a framework for generating molecules with desired properties using gradient-based optimization on a VAE latent space. The key innovation is a stacked architecture where a property predictor operates on the decoded molecular representation rather than directly on the latent space, combined with an inceptionism-like technique that backpropagates through the frozen decoder and predictor to optimize the latent code. This approach is 6-8x faster than RL baselines and 12x faster than sampling-based approaches while producing molecules with higher binding affinities.

Slow Property Optimization in Existing Methods

Generating molecules with high binding affinity to target proteins is a central goal of early drug discovery, but existing computational approaches are slow when optimizing for properties that are expensive to evaluate (such as docking-based binding affinity). RL-based methods require many calls to the property function during training. Sampling-based approaches like MARS need hundreds of iterations. Latent optimization methods that predict properties directly from the latent space suffer from poor prediction accuracy because the mapping from latent space to molecular properties is difficult to learn.

The LIMO Framework

LIMO consists of three components: a VAE for learning a molecular latent space, a property predictor with a novel stacked architecture, and a gradient-based reverse optimization procedure.

SELFIES-Based VAE

The VAE encodes molecules represented as SELFIES strings into a 1024-dimensional latent space $\mathbf{z} \in \mathbb{R}^m$ and decodes to probability distributions over SELFIES symbols. Since all SELFIES strings correspond to valid molecules, this guarantees 100% chemical validity. The output molecule is obtained by taking the argmax at each position:

$$\hat{x}_i = s_{d_i^*}, \quad d_i^* = \operatorname{argmax}_{d} \{y_{i,1}, \ldots, y_{i,d}\}$$

The VAE uses fully-connected layers (not recurrent), with a 64-dimensional embedding layer, four batch-normalized linear layers (2000-dimensional first layer, 1000-dimensional for the rest) with ReLU activation, and is trained with ELBO loss (0.9 weight on reconstruction, 0.1 on KL divergence).

Stacked Property Predictor

The critical architectural choice: the property predictor $g_\theta$ takes the decoded molecular representation $\hat{\mathbf{x}}$ as input rather than the latent code $\mathbf{z}$. The predictor is trained after the VAE is frozen by minimizing MSE on VAE-generated molecules:

$$\ell_0(\theta) = \left\| g_\theta\left(f_{\text{dec}}(\mathbf{z})\right) - \pi\left(f_{\text{dec}}(\mathbf{z})\right) \right\|^2$$

where $\pi$ is the ground-truth property function. This stacking improves prediction accuracy from $r^2 = 0.04$ (predicting from $\mathbf{z}$) to $r^2 = 0.38$ (predicting from $\hat{\mathbf{x}}$) on an unseen test set. The improvement comes because the mapping from molecular space to property is easier to learn than the mapping from latent space to property.

Reverse Optimization (Inceptionism)

After training, the decoder and predictor weights are frozen and $\mathbf{z}$ becomes the trainable parameter. For multiple properties with weights $(w_1, \ldots, w_k)$, the optimization minimizes:

$$\ell_1(\mathbf{z}) = -\sum_{i=1}^{k} w_i \cdot g^i\left(f_{\text{dec}}(\mathbf{z})\right)$$

Since both the decoder and predictor are neural networks, gradients flow through the entire chain, enabling efficient optimization with Adam. This is analogous to the “inceptionism” (DeepDream) technique from computer vision, where network inputs are optimized to maximize specific outputs.

Substructure-Constrained Optimization

For lead optimization, LIMO can fix a molecular substructure during optimization by adding a regularization term:

$$\ell_2(\mathbf{z}) = \lambda \sum_{i=1}^{n} \sum_{j=1}^{d} \left(M_{i,j} \cdot \left(f_{\text{dec}}(\mathbf{z})_{i,j} - (\hat{\mathbf{x}}_{\text{start}})_{i,j}\right)\right)^2$$

where $M$ is a binary mask specifying which SELFIES positions must remain unchanged and $\lambda = 1000$. This capability is enabled by the intermediate decoded representation, which most VAE-based methods lack.

Experiments and Results

Benchmark Tasks (QED and Penalized LogP)

LIMO achieves competitive results with deep generative and RL-based models in 1 hour, compared to 8-24 hours for baselines. Top QED score: 0.947 (maximum possible: 0.948). Top penalized LogP: 10.5 (among length-limited models, comparable to MolDQN’s 11.8).

The ablation study (“LIMO on z”) confirms the stacked predictor architecture: predicting from $\hat{\mathbf{x}}$ yields top p-logP of 10.5 versus 6.52 when predicting directly from $\mathbf{z}$.

Binding Affinity Maximization

The primary contribution. LIMO generates molecules with substantially higher computed binding affinities (lower $K_D$) than baselines against two protein targets:

Method	ESR1 best $K_D$ (nM)	ACAA1 best $K_D$ (nM)	Time (hrs)
GCPN	6.4	75	6
MolDQN	373	240	6
MARS	17	163	6
GraphDF	25	370	12
LIMO	0.72	37	1

For ESR1, LIMO’s best molecule has a $K_D$ of 0.72 nM from docking, nearly 10x better than the next method (GCPN at 6.4 nM). When corroborated with more rigorous absolute binding free energy (ABFE) calculations, one LIMO compound achieved a predicted $K_D$ of $6 \times 10^{-14}$ M (0.00006 nM), far exceeding the affinities of approved drugs tamoxifen ($K_D$ = 1.5 nM) and raloxifene ($K_D$ = 0.03 nM).

Multi-Objective Optimization

Single-objective optimization produces molecules with high affinity but problematic structures (polyenes, large rings). Multi-objective optimization simultaneously targeting binding affinity, QED ($>$ 0.4), and SA ($<$ 5.5) produces drug-like, synthesizable molecules that still have nanomolar binding affinities. Generated molecules satisfy Lipinski’s rule of 5 with zero PAINS alerts.

Limitations

The LIMO property predictor achieves only moderate prediction accuracy ($r^2$ = 0.38), meaning the optimization relies on gradient direction being correct rather than absolute predictions being accurate. AutoDock-GPU docking scores do not correlate well with the more accurate ABFE results, a known limitation of docking. The fully-connected VAE architecture limits the molecular diversity compared to recurrent or attention-based alternatives (LSTM decoder produced max QED of only 0.3). The greedy fine-tuning step (replacing carbons with heteroatoms) is a heuristic rather than a learned procedure.

Reproducibility

Artifact	Type	License	Notes
Rose-STL-Lab/LIMO	Code	UC San Diego Custom (non-commercial)	Full training, optimization, and evaluation code

Data: ZINC250k dataset for optimization tasks. MOSES dataset for random generation evaluation. Binding affinities computed with AutoDock-GPU.

Hardware: Two GTX 1080 Ti GPUs (one for PyTorch, one for AutoDock-GPU), 4 CPU cores, 32 GB memory.

Training: VAE trained for 18 epochs with learning rate 0.0001. Property predictor uses 3 layers of 1000 units, trained for 5 epochs. Reverse optimization uses learning rate 0.1 for 10 epochs.

Targets: Human estrogen receptor (ESR1, PDB 1ERR) and human peroxisomal acetyl-CoA acyl transferase 1 (ACAA1, PDB 2IIK).