CogMol: Controlled Molecule Generation for COVID-19

A Controlled Generation Framework for Target-Specific Drug Design

This is a Method paper that introduces CogMol (Controlled Generation of Molecules), an end-to-end framework for de novo drug design. The primary contribution is a pipeline that combines a SMILES-based Variational Autoencoder (VAE) with multi-attribute controlled latent space sampling (CLaSS) to generate novel drug-like molecules with high binding affinity to specified protein targets, off-target selectivity, and favorable drug-likeness properties. The framework operates on protein sequence embeddings, allowing it to generalize to unseen target proteins without model retraining.

Multi-Constraint Drug Design for Novel Viral Targets

Traditional drug discovery costs 2-3 billion USD and takes over a decade with less than 10% success rate. Generating drug molecules requires satisfying multiple competing objectives simultaneously: target binding affinity, off-target selectivity, synthetic accessibility, drug-likeness, and low toxicity. Prior generative approaches using reinforcement learning or Bayesian optimization are computationally expensive and typically require fine-tuning on target-specific ligand libraries, making them unable to generalize to unseen protein targets.

The emergence of SARS-CoV-2 in 2020 created an urgent need for antiviral drug candidates targeting novel viral proteins. Because no binding affinity data existed for these new targets, and the viral proteins were not closely related to proteins in existing databases like BindingDB, existing target-specific generative frameworks could not be directly applied. CogMol addresses this by using pre-trained protein sequence embeddings from UniRep (trained on 24 million UniRef50 sequences) rather than learning protein representations from the limited BindingDB training set.

Controlled Latent Space Sampling with Pre-trained Protein Embeddings

CogMol’s core innovation is a three-component architecture that enables multi-constraint molecule generation for unseen targets:

1. SMILES VAE with adaptive pre-training. A Variational Autoencoder is first trained unsupervised on the MOSES/ZINC dataset (1.6M molecules), then jointly fine-tuned with QED and SA property predictors on BindingDB molecules. The standard VAE objective is:

$$\mathcal{L}_{\text{VAE}}(\theta, \phi) = \mathbb{E}_{p(x)} \left\{ \mathbb{E}_{q_\phi(z|x)} [\log p_\theta(x|z)] - D_{\text{KL}}(q_\phi(z|x) | p(z)) \right\}$$

where $q_\phi(z|x) = \mathcal{N}(z; \mu(x), \Sigma(x))$ specifies a diagonal Gaussian encoder distribution.

2. Protein-molecule binding affinity predictor. A regression model takes pre-trained UniRep protein sequence embeddings and molecule latent embeddings $z$ as input and predicts pIC50 binding affinity ($= -\log(\text{IC50})$). Because UniRep embeddings capture sequence, structural, and functional relationships from a large unsupervised corpus, the predictor can estimate binding affinity for novel target sequences not present in the training data.

3. CLaSS controlled sampling. The Conditional Latent attribute Space Sampling scheme generates molecules satisfying multiple constraints (affinity, QED, selectivity) through rejection sampling in the VAE latent space:

$$p(\mathbf{x} | \mathbf{a}) = \mathbb{E}_{\mathbf{z}} [p(\mathbf{z} | \mathbf{a}) , p(\mathbf{x} | \mathbf{z})] \approx \mathbb{E}_{\mathbf{z}} [\hat{p}_\xi(\mathbf{z} | \mathbf{a}) , p_\theta(\mathbf{x} | \mathbf{z})]$$

where $\mathbf{a} = [a_1, a_2, \ldots, a_n]$ is a set of independent attribute constraints. The conditional density $\hat{p}_\xi(\mathbf{z} | \mathbf{a})$ is approximated using a Gaussian mixture model $Q_\xi(\mathbf{z})$ and per-attribute classifiers $q_\xi(a_i | \mathbf{z})$, with Bayes’ rule and conditional independence assumptions. The acceptance probability equals the product of all attribute predictor scores, enabling efficient multi-constraint sampling without surrogate model or policy learning.

Selectivity modeling. Off-target selectivity for a molecule $m$ against target $T$ is defined as:

$$\text{Sel}_{T,m} = \text{BA}(T, m) - \frac{1}{k} \sum_{i=1}^{k} \text{BA}(T_i, m)$$

where $\text{BA}(T, m)$ is binding affinity to the target and $T_i$ are $k$ randomly selected off-targets. This selectivity score is incorporated as a control attribute during CLaSS sampling.

Experimental Setup: COVID-19 Targets and In Silico Screening

Target proteins. CogMol was applied to three SARS-CoV-2 targets not present in BindingDB: NSP9 Replicase dimer, Main Protease (Mpro), and the Receptor-Binding Domain (RBD) of the spike protein. A cancer target (human HDAC1) with low ligand coverage in the training data was also evaluated.

Training data. The SMILES VAE was trained on the MOSES benchmark (1.6M molecules from ZINC). The binding affinity predictor used curated IC50 data from BindingDB as reported in DeepAffinity, with all protein classes included in training.

CLaSS controlled generation. Molecules were generated with simultaneous constraints on binding affinity (> 0.5 normalized), QED (> 0.8 normalized), and selectivity (> 0.5 normalized). Approximately 1000 molecules per target were selected for downstream evaluation.

In silico screening pipeline. Generated molecules underwent:

Toxicity prediction via a multi-task deep neural network (MT-DNN) on 12 Tox21 in vitro endpoints and ClinTox clinical trial failure
Binding affinity rescoring with a higher-accuracy SMILES-level predictor
Blind docking (5 independent runs per molecule) using AutoDock Vina against target protein structures
Synthetic feasibility assessment using a retrosynthetic algorithm based on the Molecular Transformer trained on patent reaction data

Baselines. VAE performance was benchmarked against models from the MOSES platform. CLaSS-accepted molecules were compared against randomly sampled molecules from the latent space. Generated molecules were compared against FDA-approved drugs for toxicity and synthesizability.

Key Results

CLaSS enrichment (Table 1). CLaSS consistently produced higher fractions of molecules meeting all criteria compared to random sampling. For the triple constraint (affinity > 0.5, QED > 0.8, selectivity > 0.5), the enrichment was substantial: 6.9% vs. 0.7% for NSP9, 9.0% vs. 0.9% for RBD, and 10.4% vs. 1.1% for Mpro.

Target	CLaSS (Aff+QED+Sel)	Random (Aff+QED+Sel)	Enrichment
NSP9	6.9%	0.7%	~10x
RBD	9.0%	0.9%	~10x
Mpro	10.4%	1.1%	~9.5x

Docking results (Table 3). 87-95% of high-affinity generated molecules showed docking binding free energy (BFE) below -6 kcal/mol, with minimum BFEs reaching -8.6 to -9.5 kcal/mol depending on the target.

Novelty. The likelihood of generating an exact duplicate of a training molecule was 2% or less. Against the full PubChem database (~103M molecules), exact matches ranged from 3.7% to 9.5%. Generated molecules also showed novel chemical scaffolds as confirmed by high Frechet ChemNet Distance.

Synthesizability. Generated molecules for COVID-19 targets showed 85-90% synthetic feasibility using retrosynthetic analysis, exceeding the ~78% rate of FDA-approved drugs.

Toxicity. Approximately 70% of generated parent molecules and ~80% of predicted metabolites were toxic in 0-1 endpoints out of 13, comparable to FDA-approved drugs.

Generated Molecules Show Favorable Binding and Drug-Like Properties

CogMol demonstrates that controlled latent space sampling with pre-trained protein embeddings can generate novel, drug-like molecules for unseen viral targets. The key findings are:

CLaSS provides roughly 10x enrichment over random latent space sampling for molecules satisfying all three constraints (affinity, QED, selectivity).
Generated molecules bind favorably to druggable pockets in target protein 3D structures, even though the generation model uses only 1D sequence information.
Some generated SMILES matched existing PubChem molecules with known biological activity, suggesting the model identifies chemically relevant regions of molecular space.
The framework generalizes across targets of varying novelty, with Mpro (more similar to training proteins) yielding easier generation than NSP9 or RBD.

Limitations. The authors note that no wet-lab validation was performed on generated candidates. There may be divergence between ML-predicted properties and experimental measurements. The binding affinity predictor’s accuracy is bounded by the quality and coverage of BindingDB training data. Selectivity modeling uses a random sample of off-targets rather than a pharmacologically curated panel.

Future directions. The authors propose incorporating additional contexts beyond target protein (e.g., metabolic pathways), adding more pharmacologically relevant controls, and weighting objectives by relative importance.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
VAE pre-training	MOSES/ZINC	1.6M train, 176K test	Publicly available benchmark
VAE adaptive training	BindingDB (DeepAffinity split)	~27K protein-ligand pairs	Curated IC50 data
Protein embeddings	UniRef50 via UniRep	24M sequences	Pre-trained, publicly available
Toxicity prediction	Tox21 + ClinTox	12 in vitro + clinical endpoints	Public benchmark datasets
Docking validation	PDB structures	3 SARS-CoV-2 targets	Public crystal structures

Algorithms

VAE architecture: SMILES encoder-decoder with diagonal Gaussian latent space, jointly trained with QED and SA regressors
CLaSS: rejection sampling from Gaussian mixture model of latent space with per-attribute classifiers
Binding affinity: regression on concatenated UniRep protein embeddings and VAE molecule embeddings
Selectivity: excess binding affinity over average of $k$ random off-targets

Models

SMILES VAE with adaptive pre-training (ZINC then BindingDB)
Multi-task toxicity classifier (MT-DNN) for Tox21 and ClinTox endpoints
Binding affinity predictor (latent-level for generation, SMILES-level for screening)
Retrosynthetic predictor based on Molecular Transformer

Evaluation

Metric	Value	Baseline	Notes
Validity	90%	-	Generated SMILES
Uniqueness	99%	-	Among valid molecules
Filter pass	95%	-	Relevant chemical filters
Docking BFE < -6 kcal/mol	87-95%	-	Varies by target
Synthetic feasibility	85-90%	78% (FDA drugs)	COVID-19 targets
Low toxicity (0-1 endpoints)	~70% parent, ~80% metabolite	Comparable to FDA drugs	MT-DNN prediction

Hardware

The paper does not specify GPU types or training times. The work was funded internally by IBM Research.

Artifacts

Artifact	Type	License	Notes
CogMol (GitHub)	Code	Apache-2.0	Official implementation
~3500 generated molecules	Dataset	Open license	For three SARS-CoV-2 targets

Paper Information

Citation: Chenthamarakshan, V., Das, P., Hoffman, S. C., Strobelt, H., Padhi, I., Lim, K. W., Hoover, B., Manica, M., Born, J., Laino, T., & Mojsilovic, A. (2020). CogMol: Target-Specific and Selective Drug Design for COVID-19 Using Deep Generative Models. Advances in Neural Information Processing Systems, 33, 4320-4332.

@inproceedings{chenthamarakshan2020cogmol,
  title={CogMol: Target-Specific and Selective Drug Design for COVID-19 Using Deep Generative Models},
  author={Chenthamarakshan, Vijil and Das, Payel and Hoffman, Samuel C. and Strobelt, Hendrik and Padhi, Inkit and Lim, Kar Wai and Hoover, Benjamin and Manica, Matteo and Born, Jannis and Laino, Teodoro and Mojsilovi{\'c}, Aleksandra},
  booktitle={Advances in Neural Information Processing Systems},
  volume={33},
  pages={4320--4332},
  year={2020}
}

A Controlled Generation Framework for Target-Specific Drug Design#

Multi-Constraint Drug Design for Novel Viral Targets#

Controlled Latent Space Sampling with Pre-trained Protein Embeddings#

Experimental Setup: COVID-19 Targets and In Silico Screening#

Key Results#

Generated Molecules Show Favorable Binding and Drug-Like Properties#

Reproducibility Details#

Data#

Algorithms#

Models#

Evaluation#

Hardware#

Artifacts#

Paper Information#