What kind of paper is this?
This is a Resource and Benchmarking paper. It introduces Molecular Sets (MOSES), a platform designed to standardize the training, comparison, and evaluation of molecular generative models. It provides a standardized dataset, a suite of evaluation metrics, and a collection of baseline models to serve as reference points for the field.
What is the motivation?
Generative models are increasingly popular for drug discovery and material design, capable of exploring the vast chemical space ($10^{23}$ to $10^{80}$ compounds) more efficiently than traditional methods. However, the field faces a significant reproducibility crisis:
- Lack of Standardization: There is no consensus on how to properly compare and rank the efficacy of different generative models.
- Inconsistent Metrics: Different papers use different metrics or distinct implementations of the same metrics.
- Data Variance: Models are often trained on different subsets of chemical databases (like ZINC), making direct comparison impossible.
MOSES aims to solve these issues by providing a unified “measuring stick” for distribution learning models in chemistry.
What is the novelty here?
The core contribution is the standardization of the distribution learning definition for molecular generation. Why focus on distribution learning? While simple rule-based restrictions (like molecular weight limits) are easy to apply, distribution learning allows chemists to apply implicit or soft restrictions. This ensures that generated molecules not only satisfy hard constraints but also reflect complex chemical realities—such as the prevalence of certain substructures or avoiding unstable motifs—defined by the training distribution rather than explicit programming.
Unlike MoleculeNet (which focuses on regression/classification), MOSES focuses on:
- A Clean, Standardized Dataset: A specific subset of the ZINC Clean Leads collection with rigorous filtering.
- Diverse Metrics: A comprehensive suite of metrics that measure not just validity, but also novelty, diversity (internal and external), chemical properties (properties distribution), and substructure similarity.
- Open Source Platform: A Python library
molsetsthat decouples the data and evaluation logic from the model implementation, ensuring everyone measures performance exactly the same way.
What experiments were performed?
The authors benchmarked a wide variety of generative models against the MOSES dataset to establish baselines:
- Baselines: Character-level RNN (CharRNN), Variational Autoencoder (VAE), Adversarial Autoencoder (AAE), Junction Tree VAE (JTN-VAE), and LatentGAN.
- Non-Neural Baselines: HMM, n-gram models, and a combinatorial generator (randomly connecting fragments).
- Evaluation: Models were trained on the standard set and evaluated on:
- Validity/Uniqueness: Can the model generate valid, non-duplicate SMILES?
- Feature Distribution: Do generated molecules match the physicochemical properties of the training set? Evaluated using the Wasserstein-1 distance on 1D distributions of:
- LogP: Octanol-water partition coefficient (lipophilicity).
- SA: Synthetic Accessibility score (ease of synthesis).
- QED: Quantitative Estimation of Drug-likeness.
- MW: Molecular Weight.
- Fréchet ChemNet Distance (FCD): Measures similarity in biological/chemical space using the close-to-final layer activations of a pre-trained network (ChemNet).
- Similarity to Nearest Neighbor (SNN): Measures the precision of generation by checking the closest match in the training set (Tanimoto similarity).
What outcomes/conclusions?
- CharRNN Performance: Surprisingly, the simple character-level RNN (CharRNN) outperformed more complex models (like VAEs and GANs) on many metrics, achieving the best FCD scores ($0.073$).
- Metric Trade-offs: No single metric captures “quality.”
- The Combinatorial Generator achieved 100% validity and high diversity but struggled with distribution learning metrics (FCD), indicating it explores chemical space broadly but not naturally.
- VAEs often achieve high Similarity to Nearest Neighbor (SNN) but suffer from low novelty. This indicates they tend to memorize the training set prototypes rather than learning the generalized distribution.
- Implicit Constraints: A major finding was that neural models successfully learned implicit chemical rules (like avoiding PAINS structures) purely from the data distribution rather than explicit programming.
- Recommendation: The authors suggest using FCD/Test for general model ranking, but emphasize checking specific metrics (validity, diversity) to diagnose model failure modes.
Reproducibility Details
Data
The benchmark uses a curated subset of the ZINC Clean Leads collection.
- Source Size: ~4.5M molecules.
- Final Size: 1,936,962 molecules.
- Splits: Train (1.5M), Test (176k), Scaffold Test (176k).
- Scaffold Test Split: This split is crucial for distinct generalization testing. It contains molecules whose Bemis-Murcko scaffolds are completely absent from the training and test sets. Evaluating on this split strictly tests a model’s ability to generate novel chemical structures (generalization) rather than just variations of known scaffolds (memorization).
- Filters Applied:
- Molecular weight: 250–350 Da
- Rotatable bonds: $\leq 7$
- XlogP: $\leq 3.5$
- Atom types: C, N, S, O, F, Cl, Br, H
- No charged atoms or cycles > 8 atoms
- Medicinal Chemistry Filters (MCF) and PAINS filters applied.
Evaluation Metrics
MOSES introduces a standard suite of metrics. Key definitions:
- Validity: Fraction of valid SMILES strings (via RDKit).
- Unique@k: Fraction of unique molecules in the first $k$ valid samples.
- Novelty: Fraction of generated molecules not present in the training set.
- Internal Diversity (IntDiv): Average Tanimoto distance between generated molecules ($G$), useful for detecting mode collapse: $$ \text{IntDiv}_p(G) = 1 - \sqrt[p]{\frac{1}{|G|^2} \sum_{m_1, m_2 \in G} T(m_1, m_2)^p} $$
- Fragment Similarity (Frag): Cosine similarity of fragment frequency vectors (BRICS decomposition) between generated and test sets.
- Scaffold Similarity (Scaff): Cosine similarity of Bemis-Murcko scaffold frequency vectors between sets. Measures how well the model captures higher-level structural motifs.
- Similarity to Nearest Neighbor (SNN): The average Tanimoto similarity between a generated molecule’s fingerprint and its nearest neighbor in the reference set. This serves as a measure of precision; high SNN suggests the model produces molecules very similar to the training distribution, potentially indicating memorization if novelty is low.
- Fréchet ChemNet Distance (FCD): Fréchet distance between the Gaussian approximations (mean and covariance) of final-layer activations from ChemNet. This measures how close the distribution of generated molecules is to the real distribution in chemical/biological space. The authors note that FCD correlates heavily with other metrics; for example, if a model produces low diversity or low uniqueness, the FCD will worsen (increase) due to smaller variance in the generated distribution.
- Properties Distribution (Wasserstein-1): The 1D Wasserstein-1 distance between the distributions of molecular properties (MW, LogP, SA, QED) in the generated and test sets.
Models & Baselines
The paper selects baselines to represent different theoretical approaches to distribution learning:
- Explicit Density Models: Models where the probability mass function $P(x)$ can be computed analytically.
- N-gram: Simple statistical models. They failed to generate valid molecules reliably due to limited long-range dependency modeling.
- Implicit Density Models: Models that cannot compute $P(x)$ explicitly but can sample from the distribution.
- VAE/AAE: Optimizes a lower bound on the log-likelihood (ELBO) or uses adversarial training.
- GANs (LatentGAN): Directly minimizes the distance between real and generated distributions via a discriminator.
Models are also distinguished by their data representation:
- String-based (SMILES): Models like CharRNN, VAE, and AAE treat molecules as SMILES strings. They learn the syntax of the formal language by predicting the next character in a DFS traversal of the molecular graph atoms.
- Graph-based: JTN-VAE operates directly on molecular subgraphs (junction tree), ensuring chemical validity by construction but often requiring more complex training.
Key baselines implemented in PyTorch:
- CharRNN: LSTM-based sequence model.
- VAE/AAE: Encoder-decoder architectures with KL or adversarial regularization.
- LatentGAN: GAN trained on the latent space of a pre-trained autoencoder.
- JTN-VAE: Tree-structured graph generation.
Paper Information
Citation: Polykovskiy, D., et al. (2020). Molecular Sets (MOSES): A Benchmarking Platform for Molecular Generation Models. Frontiers in Pharmacology, 11, 565644. https://doi.org/10.3389/fphar.2020.565644
Publication: Frontiers in Pharmacology, 2020
@article{polykovskiy2020moses,
title={Molecular Sets (MOSES): A benchmarking platform for molecular generation models},
author={Polykovskiy, Daniil and Zhebrak, Alexander and Sanchez-Lengeling, Benjamin and Golovanov, Sergey and Tatanov, Oktai and Belyaev, Stanislav and Kurbanov, Rauf and Artamonov, Aleksey and Aladinskiy, Vladimir and Veselov, Mark and others},
journal={Frontiers in pharmacology},
volume={11},
pages={565644},
year={2020},
publisher={Frontiers}
}