Molecular Sets (MOSES): A Generative Modeling Benchmark

The Role of MOSES: A Benchmarking Resource

This is a Resource and Benchmarking paper. It introduces Molecular Sets (MOSES), a platform designed to standardize the training, comparison, and evaluation of molecular generative models. It provides a standardized dataset, a suite of evaluation metrics, and a collection of baseline models to serve as reference points for the field.

Motivation: The Reproducibility Crisis in Generative Chemistry

Generative models are increasingly popular for drug discovery and material design, capable of exploring the vast chemical space ($10^{23}$ to $10^{80}$ compounds) more efficiently than traditional methods. However, the field faces a significant reproducibility crisis:

Lack of Standardization: There is no consensus on how to properly compare and rank the efficacy of different generative models.
Inconsistent Metrics: Different papers use different metrics or distinct implementations of the same metrics.
Data Variance: Models are often trained on different subsets of chemical databases (like ZINC), making direct comparison impossible.

MOSES aims to solve these issues by providing a unified “measuring stick” for distribution learning models in chemistry.

Core Innovation: Standardizing Chemical Distribution Learning

The core contribution is the standardization of the distribution learning definition for molecular generation. Why focus on distribution learning? Rule-based filters enforce strict boundaries like molecular weight limits. Distribution learning complements this by allowing chemists to apply implicit or soft restrictions. This ensures that generated molecules satisfy hard constraints and reflect complex chemical realities defined by the training distribution. These realities include the prevalence of certain substructures and the avoidance of unstable motifs.

MOSES specifically targets distribution learning by providing:

A Clean, Standardized Dataset: A specific subset of the ZINC Clean Leads collection with rigorous filtering.
Diverse Metrics: A comprehensive suite of metrics that measure validity alongside novelty, diversity (internal and external), chemical properties (properties distribution), and substructure similarity.
Open Source Platform: A Python library molsets that decouples the data and evaluation logic from the model implementation, ensuring everyone measures performance exactly the same way.

Experimental Setup and Baseline Generative Models

The authors benchmarked a wide variety of generative models against the MOSES dataset to establish baselines:

Baselines: Character-level RNN (CharRNN), Variational Autoencoder (VAE), Adversarial Autoencoder (AAE), Junction Tree VAE (JTN-VAE), and LatentGAN. Given the 2020 publication date, these baselines reflect the pre-Transformer and pre-Diffusion era of generative chemistry, serving primarily as historical performance floors.
Non-Neural Baselines: HMM, n-gram models, and a combinatorial generator (randomly connecting fragments).
Evaluation: Models were trained on the standard set and evaluated on:
- Validity/Uniqueness: Can the model generate valid, non-duplicate SMILES? Uniqueness is measured at $k = 1{,}000$ and $k = 10{,}000$ samples.
- Filters: What fraction of generated molecules pass the same medicinal chemistry and PAINS filters used for dataset construction?
- Feature Distribution: Do generated molecules match the physicochemical properties of the training set? Evaluated using the Wasserstein-1 distance on 1D distributions of:
  - LogP: Octanol-water partition coefficient (lipophilicity).
  - SA: Synthetic Accessibility score (ease of synthesis).
  - QED: Quantitative Estimation of Drug-likeness.
  - MW: Molecular Weight.
- Fréchet ChemNet Distance (FCD): Measures similarity in biological/chemical space using the penultimate-layer (second-to-last layer) activations of a pre-trained network (ChemNet).
- Similarity to Nearest Neighbor (SNN): Measures the precision of generation by checking the closest match in the training set (Tanimoto similarity).

Key Findings and Metric Trade-offs

CharRNN Performance: The simple character-level RNN (CharRNN) outperformed more complex models (like VAEs and GANs) on many metrics, achieving the best FCD scores ($0.073$).
Metric Trade-offs: No single metric captures “quality.”
- The Combinatorial Generator achieved 100% validity and high diversity. It struggled with distribution learning metrics (FCD), indicating it explores chemical space broadly without capturing natural distributions.
- VAEs often achieve high Similarity to Nearest Neighbor (SNN) while exhibiting low novelty. The authors suggest this pattern may indicate overfitting to training set prototypes, though they treat this as a hypothesis rather than a proven mechanism.
Implicit Constraints: A major finding was that neural models successfully learned implicit chemical rules (like avoiding PAINS structures) purely from the data distribution.
Recommendation: The authors suggest using FCD/Test for general model ranking, while emphasizing the importance of checking specific metrics (validity, diversity) to diagnose model failure modes.
Limitations of the Benchmark: MOSES establishes necessary standardization, although its heavy emphasis on Fréchet ChemNet Distance (FCD) embeds the biases of a single pre-trained model (ChemNet). Additionally, relying exclusively on 1D (SMILES) and 2D features ignores the complexities of 3D conformational viability required for practical drug discovery. Furthermore, strict distribution metrics penalize out-of-distribution exploration. Models successfully extrapolating into novel regions of chemical space often receive poor scores because the benchmark strictly equates generation quality with the ZINC training distribution.

Reproducibility Details

Data

The benchmark uses a curated subset of the ZINC Clean Leads collection.

Source Size: ~4.6M molecules (4,591,276 after initial extraction).
Final Size: 1,936,963 molecules.
Splits: Train (1,584,664), Test (176,075), Scaffold Test (176,226).
- Scaffold Test Split: This split is crucial for distinct generalization testing. It contains molecules whose Bemis-Murcko scaffolds are completely absent from the training and test sets. Evaluating on this split strictly tests a model’s ability to generate novel chemical structures (generalization).
Filters Applied:
- Molecular weight: 250 to 350 Da
- Rotatable bonds: $\leq 7$
- XlogP: $\leq 3.5$
- Atom types: C, N, S, O, F, Cl, Br, H
- No charged atoms or cycles > 8 atoms
- Medicinal Chemistry Filters (MCF) and PAINS filters applied.

Evaluation Metrics

MOSES introduces a standard suite of metrics. Key definitions:

Validity: Fraction of valid SMILES strings (via RDKit).
Unique@k: Fraction of unique molecules in the first $k$ valid samples ($k = 1{,}000$ and $k = 10{,}000$).
Filters: Fraction of generated molecules passing the MCF and PAINS filters used during dataset construction. High scores here indicate the model learned implicit chemical validity constraints from the data distribution.
Novelty: Fraction of generated molecules not present in the training set.
Internal Diversity (IntDiv): Average Tanimoto distance between generated molecules ($G$), useful for detecting mode collapse: $$ \text{IntDiv}_p(G) = 1 - \sqrt[p]{\frac{1}{|G|^2} \sum_{m_1, m_2 \in G} T(m_1, m_2)^p} $$
Fragment Similarity (Frag): Cosine similarity of fragment frequency vectors (BRICS decomposition) between generated and test sets.
Scaffold Similarity (Scaff): Cosine similarity of Bemis-Murcko scaffold frequency vectors between sets. Measures how well the model captures higher-level structural motifs.
Similarity to Nearest Neighbor (SNN): The average Tanimoto similarity between a generated molecule’s fingerprint and its nearest neighbor in the reference set. This serves as a measure of precision; high SNN suggests the model produces molecules very similar to the training distribution, potentially indicating memorization if novelty is low. $$ \text{SNN}(G, R) = \frac{1}{|G|} \sum_{m_G \in G} \max_{m_R \in R} T(m_G, m_R) $$
Fréchet ChemNet Distance (FCD): Fréchet distance between the Gaussian approximations (mean and covariance) of penultimate-layer activations from ChemNet. This measures how close the distribution of generated molecules is to the real distribution in chemical/biological space. The authors note that FCD correlates with other metrics. How distribution collapse affects FCD depends on the regime: when a model collapses into a near-memorization mode (mean close to the reference), reduced variance can drive FCD down, making the model appear artificially better than it is; when collapse occurs toward an off-distribution region, FCD increases. $$ \text{FCD}(G, R) = |\mu_G - \mu_R|^2 + \text{Tr}(\Sigma_G + \Sigma_R - 2(\Sigma_G \Sigma_R)^{1/2}) $$
Properties Distribution (Wasserstein-1): The 1D Wasserstein-1 distance between the distributions of molecular properties (MW, LogP, SA, QED) in the generated and test sets.

Models & Baselines

The paper selects baselines to represent different theoretical approaches to distribution learning:

Explicit Density Models: Models where the probability mass function $P(x)$ can be computed analytically.
- N-gram: Simple statistical models. They failed to generate valid molecules reliably due to limited long-range dependency modeling.
Implicit Density Models: Models that sample from the distribution without explicitly computing $P(x)$.
- VAE/AAE: Optimizes a lower bound on the log-likelihood (ELBO) or uses adversarial training.
- GANs (LatentGAN): Directly minimizes the distance between real and generated distributions via a discriminator.

Models are also distinguished by their data representation:

String-based (SMILES): Models like CharRNN, VAE, and AAE treat molecules as SMILES strings. They learn the syntax of the formal language by predicting the next character in a DFS traversal of the molecular graph atoms.
Graph-based: JTN-VAE operates directly on molecular subgraphs (junction tree), ensuring chemical validity by construction but often requiring more complex training.

Key baselines implemented in PyTorch (hyperparameters are detailed in Appendix C of the original paper):

CharRNN: LSTM-based sequence model (3 layers, 768 hidden units). Trained with Adam ($lr = 10^{-3}$, batch size 64, 80 epochs, learning rate halved every 10 epochs).
VAE: Encoder-decoder architectures (bidirectional GRU encoder, 3-layer GRU decoder with 512 hidden units) with KL regularization.
AAE: Encoder (single layer bidirectional LSTM with 512 units) and decoder (2-layer LSTM with 512 units) initialized with adversarial formulation.
LatentGAN: GAN (5-layer fully connected generator) trained on the latent space of a pre-trained heteroencoder.
JTN-VAE: Tree-structured graph generation.

Code & Hardware Requirements

Code Repository: Available at github.com/molecularsets/moses as well as the PyPI library molsets. The platform provides standard scripts (scripts/run.py to evaluate models end-to-end, and scripts/run_all_models.sh for multi-seed evaluations).
Hardware: The repository supports GPU acceleration via nvidia-docker (defaulting to 10GB shared memory). However, specific training times and exact GPU models used by the authors for the baselines are not formally documented in the source text.
Model Weights: Pre-trained model checkpoints are not natively pre-packaged as standalone downloads; practitioners are expected to re-train the default baselines using the provided scripts.

Paper Information

Citation: Polykovskiy, D., et al. (2020). Molecular Sets (MOSES): A Benchmarking Platform for Molecular Generation Models. Frontiers in Pharmacology, 11, 565644. https://doi.org/10.3389/fphar.2020.565644

Publication: Frontiers in Pharmacology, 2020

@article{polykovskiy2020moses,
  title={Molecular Sets (MOSES): A benchmarking platform for molecular generation models},
  author={Polykovskiy, Daniil and Zhebrak, Alexander and Sanchez-Lengeling, Benjamin and Golovanov, Sergey and Tatanov, Oktai and Belyaev, Stanislav and Kurbanov, Rauf and Artamonov, Aleksey and Aladinskiy, Vladimir and Veselov, Mark and others},
  journal={Frontiers in pharmacology},
  volume={11},
  pages={565644},
  year={2020},
  publisher={Frontiers}
}

The Role of MOSES: A Benchmarking Resource#

Motivation: The Reproducibility Crisis in Generative Chemistry#

Core Innovation: Standardizing Chemical Distribution Learning#

Experimental Setup and Baseline Generative Models#

Key Findings and Metric Trade-offs#

Reproducibility Details#

Data#

Evaluation Metrics#

Models & Baselines#

Code & Hardware Requirements#

Paper Information#