A Training-Free Algorithm for Molecular Generation
This is a Method paper that introduces STONED (Superfast Traversal, Optimization, Novelty, Exploration and Discovery), a suite of algorithms for molecular generation and chemical space exploration. STONED operates entirely through string manipulations on the SELFIES molecular representation, avoiding the need for deep learning models, training data, or GPU resources. The key claim is that simple character-level mutations and interpolations in SELFIES can achieve results competitive with state-of-the-art deep generative models on standard benchmarks.
Why Deep Generative Models May Be Overkill
Deep generative models (VAEs, GANs, RNNs, reinforcement learning) have become popular for inverse molecular design, but they come with practical costs: large training datasets, expensive GPU compute, and long training times. Fragile representations like SMILES compound the problem, since large portions of a latent space can map to invalid molecules. Even with the introduction of SELFIES (a 100% valid string representation), prior work still embedded it within neural network architectures.
The authors argue that for tasks like local chemical space exploration and molecular interpolation, the guarantees of SELFIES alone may be sufficient. Because every SELFIES string maps to a valid molecule, random character mutations always produce valid structures. This observation eliminates the need for learned generation procedures entirely.
Core Innovation: SELFIES String Mutations as Molecular Operators
STONED relies on four key techniques built on SELFIES string manipulations:
1. Random character mutations. A point mutation in SELFIES (character replacement, deletion, or addition) always yields a valid molecule. The position of mutations serves as a hyperparameter controlling exploration vs. exploitation: terminal character mutations preserve more structural similarity to the seed, while random mutations explore more broadly.
2. Multiple SMILES orderings. A single molecule has many valid SMILES strings, each mapping to a different SELFIES. By generating 50,000 SMILES orderings and converting to SELFIES before mutation, the diversity of generated structures increases substantially.
3. Deterministic interpolation. Given two SELFIES strings (padded to equal length), characters at equivalent positions can be successively replaced from the start molecule to the target molecule. Every intermediate string is a valid molecule. A chemical path is extracted by keeping only those intermediates that increase fingerprint similarity to the target.
4. Fingerprint-based filtering. Since edit distance in SELFIES does not reflect molecular similarity, STONED uses fingerprint comparisons (ECFP4, FCFP4, atom-pair) to enforce structural similarity constraints.
The authors also propose a revised joint molecular similarity metric for evaluating median molecules. Given $n$ reference molecules $M = {m_1, m_2, \ldots, m_n}$, the joint similarity of a candidate molecule $m$ is:
$$ F(m) = \frac{1}{n} \sum_{i=1}^{n} \text{sim}(m_i, m) - \left[\max_{i} \text{sim}(m_i, m) - \min_{i} \text{sim}(m_i, m)\right] $$
This penalizes candidates that are similar to only a subset of references, unlike the geometric mean metric used in GuacaMol which can yield high scores even with lopsided similarities.
Experimental Setup and Applications
Local chemical subspace formation
Starting from a single seed molecule (aripiprazole, albuterol, mestranol, or celecoxib), the algorithm generates 50,000 SMILES orderings and performs 1-5 point mutations per ordering, producing 250,000 candidate strings. Unique valid molecules are filtered by fingerprint similarity thresholds.
| Starting structure | Fingerprint | Molecules at $\delta > 0.75$ | Molecules at $\delta > 0.60$ | Molecules at $\delta > 0.40$ |
|---|---|---|---|---|
| Aripiprazole (SELFIES, random) | ECFP4 | 513 (0.25%) | 4,206 (2.15%) | 34,416 (17.66%) |
| Albuterol (SELFIES, random) | FCFP4 | 587 (0.32%) | 4,156 (2.33%) | 16,977 (9.35%) |
| Mestranol (SELFIES, random) | AP | 478 (0.22%) | 4,079 (1.90%) | 45,594 (21.66%) |
| Celecoxib (SELFIES, random) | ECFP4 | 198 (0.10%) | 1,925 (1.00%) | 18,045 (9.44%) |
| Celecoxib (SELFIES, terminal 10%) | ECFP4 | 864 (2.02%) | 9,407 (21.99%) | 34,187 (79.91%) |
Key finding: restricting mutations to terminal characters yields a 20x increase in high-similarity molecules compared to random positions. Compared to SMILES mutations (0.30% valid) and DeepSMILES (1.44% valid), SELFIES mutations are all valid by construction.
A two-step expansion (mutating all unique first-round neighbors) produced over 17 million unique molecules, with 120,000 having similarity greater than 0.4 to celecoxib.
Chemical path formation and drug design
Deterministic SELFIES interpolation between tadalafil and sildenafil generated paths where logP and QED values varied smoothly. A more challenging application docked intermediates between dihydroergotamine (5-HT1B binder) and prinomastat (CYP2D6 binder), finding molecules with non-trivial binding affinity to both proteins without any optimization routine.
Median molecules for photovoltaics
Using 100 triplets from the Harvard Clean Energy (HCE) dataset, each with one molecule optimized for high LUMO energy, one for high dipole moment, and one for high HOMO-LUMO gap, generalized chemical paths produced median molecules. These were evaluated with GFN2-xTB semiempirical calculations. The generated medians matched or exceeded the best molecules available in the HCE database in both structural similarity and target properties.
GuacaMol benchmarks
Without any training, STONED achieved an overall GuacaMol score of 14.70, competitive with several deep generative models. The approach simply identifies the single best molecule in the benchmark’s training set and generates its local chemical subspace. 38% of the top-100 molecules from each benchmark passed compound quality filters, comparable to Graph GA and SMILES GA.
Results Summary and Limitations
STONED demonstrates that SELFIES string mutations can match or approach deep generative models on standard molecular design benchmarks while being orders of magnitude faster and requiring no training. The most expensive benchmark (aripiprazole subspace) completed in 500 seconds on a laptop CPU.
The method comparison table from the paper highlights STONED’s unique position:
| Feature | Expert Systems | VAE | GAN | RL | STONED |
|---|---|---|---|---|---|
| Expert rule-free | No | Yes | Yes | Yes | Yes |
| Structure coverage | Partial | Partial | Partial | Partial | Yes |
| Interpolatability | No | Yes | Yes | No | Yes |
| Property-based navigation | Partial | Yes | Yes | Yes | Partial |
| Training-free | Yes | No | No | No | Yes |
| Data independence | Yes | No | No | No | Yes |
Limitations acknowledged by the authors:
- STONED lacks property-based navigation (gradient-guided optimization toward specific property targets). It can only do stochastic property optimization when wrapped in a genetic algorithm.
- The success rate of mutations leading to structurally similar molecules is relatively low (0.1-2% at high similarity thresholds), though speed compensates.
- Chemical paths can contain molecules with unstable functional groups or tautomerization issues, requiring post-hoc filtering with domain-specific rules.
- Fingerprint similarity does not capture all aspects of chemical similarity (3D geometry, reactivity, synthesizability).
- The penalized logP and QED benchmarks used by GuacaMol do not represent the full complexity of practical molecular design.
Reproducibility Details
Data
| Purpose | Dataset | Size | Notes |
|---|---|---|---|
| Photovoltaics | Harvard Clean Energy (HCE) database | ~2.3M molecules | Used for median molecule triplet experiments |
| Benchmarking | GuacaMol benchmark suite | Varies per task | Standard benchmarks for generative molecular design |
| Comparison | ChEMBL (SCScore <= 2.5 subset) | Fragment database | Used for CReM comparison experiments |
Algorithms
- Local subspace formation: 50,000 SMILES orderings per seed molecule, 1-5 SELFIES point mutations each, totaling 250,000 candidates per experiment.
- Chemical paths: Deterministic character-by-character interpolation between padded SELFIES strings, with monotonic fingerprint similarity filtering.
- Median molecules: Generalized paths between 3+ reference molecules using 10,000 paths per triplet with randomized SMILES orderings.
- Docking: SMINA with crystal structures from PDB (4IAQ for 5-HT1B, 3QM4 for CYP2D6). Top-5 binding poses averaged.
- Quantum chemistry: GFN2-xTB for dipole moments, LUMO energies, and HOMO-LUMO gaps.
Evaluation
| Metric | Value | Baseline | Notes |
|---|---|---|---|
| GuacaMol overall score | 14.70 | Varies by model | Competitive with deep generative models |
| Quality filter pass rate | 38% | Graph GA/SMILES GA comparable | Top-100 molecules per benchmark |
| Celecoxib neighbors ($\delta > 0.75$) | 198-864 | CReM: 239 | Depends on mutation position strategy |
Hardware
All experiments run on a laptop with Intel i7-8750H CPU at 2.20 GHz. No GPU required. Most expensive single experiment (aripiprazole subspace) completed in 500 seconds.
Artifacts
| Artifact | Type | License | Notes |
|---|---|---|---|
| stoned-selfies | Code | Not specified | Official implementation of STONED algorithms |
Paper Information
Citation: Nigam, A. K., Pollice, R., Krenn, M., dos Passos Gomes, G., & Aspuru-Guzik, A. (2021). Beyond generative models: superfast traversal, optimization, novelty, exploration and discovery (STONED) algorithm for molecules using SELFIES. Chemical Science, 12(20), 7079-7090. https://doi.org/10.1039/d1sc00231g
@article{nigam2021stoned,
title={Beyond generative models: superfast traversal, optimization, novelty, exploration and discovery ({STONED}) algorithm for molecules using {SELFIES}},
author={Nigam, AkshatKumar and Pollice, Robert and Krenn, Mario and dos Passos Gomes, Gabriel and Aspuru-Guzik, Al{\'a}n},
journal={Chemical Science},
volume={12},
number={20},
pages={7079--7090},
year={2021},
publisher={Royal Society of Chemistry},
doi={10.1039/d1sc00231g}
}
