Paper Information

Citation: Fang, Y., Zhang, N., Chen, Z., Guo, L., Fan, X., & Chen, H. (2024). Domain-Agnostic Molecular Generation with Chemical Feedback. Proceedings of the Twelfth International Conference on Learning Representations (ICLR 2024).

Publication: ICLR 2024

Additional Resources:

A SELFIES-Based Method for Molecular Generation

This is a Method paper that introduces MolGen, a pre-trained molecular language model for generating molecules with desired chemical properties. The primary contribution is a three-part framework: (1) pre-training on 100M+ molecular SELFIES to learn structural and grammatical knowledge, (2) domain-agnostic molecular prefix tuning for cross-domain knowledge transfer, and (3) a chemical feedback paradigm that aligns the model’s generative probabilities with real-world chemical preferences. MolGen is the first language model pre-trained on SELFIES rather than SMILES, which guarantees 100% syntactic validity of generated molecules.

Challenges in Language Model-Based Molecule Generation

Generating novel molecules with desirable properties is a central task in drug discovery and chemical design. The molecular space is estimated at $10^{33}$ possible structures, making exhaustive search impractical. Prior deep generative approaches face several limitations:

  1. Syntactic invalidity: SMILES-based language models frequently generate strings that do not correspond to valid molecular graphs. A single random mutation of a SMILES string has only a 9.9% chance of remaining valid.
  2. Narrow domain focus: Most existing models focus exclusively on synthetic molecules and neglect natural products, which have distinct structural complexity and scaffold diversity.
  3. Molecular hallucinations: Generated molecules may satisfy chemical structural rules yet fail to exhibit anticipated chemical activity in practical applications. The authors formally define this as molecules that “comply with chemical structural rules, yet fail to exhibit practical utility or the anticipated properties.”
  4. Limited optimization signals: Existing approaches rely on reinforcement learning (high variance), fixed-dimensional latent spaces, or expert-provided generation rules, all of which impede efficient exploration of chemical space.

Core Innovation: Pre-training with SELFIES and Chemical Feedback

MolGen’s novelty rests on three interconnected components.

SELFIES-Based Pre-training

MolGen uses SELFIES (Self-Referencing Embedded Strings) instead of SMILES. SELFIES guarantees that every possible combination of symbols in the alphabet corresponds to a chemically valid molecular graph. The model uses a compact vocabulary of 185 tokens.

The first pre-training stage uses a BART-style encoder-decoder. Tokens from a SELFIES string $S = {s_1, \ldots, s_l}$ are randomly replaced with [MASK], then the corrupted input is encoded bidirectionally and decoded left-to-right. The reconstruction loss is:

$$ \mathcal{L}_{\text{ce}}(S) = -\sum_{j=1}^{l} \sum_{s} p_{\text{true}}(s \mid S, S_{< j}) \log p_{\theta}(s \mid S, S_{< j}; \theta) $$

where $S_{< j}$ denotes the partial sequence ${s_0, \ldots, s_{j-1}}$ and $p_{\text{true}}$ is the one-hot distribution under standard maximum likelihood estimation.

Domain-Agnostic Molecular Prefix Tuning

The second pre-training stage introduces shared prefix vectors $P_k, P_v \in \mathbb{R}^{m \times d}$ prepended to the keys and values of multi-head attention at each layer. Unlike conventional prefix tuning that freezes model parameters, MolGen updates the entire model. The attention output becomes:

$$ \text{head} = \text{Attn}\left(xW_q, [P_k, XW_k], [P_v, XW_v]\right) $$

This decomposes into a linear interpolation between prefix attention and standard attention:

$$ \text{head} = \lambda(x) \cdot \text{Attn}(xW_q, P_k, P_v) + (1 - \lambda(x)) \cdot \text{Attn}(xW_q, XW_k, XW_v) $$

where $\lambda(x)$ is a scalar representing the sum of normalized attention weights on the prefixes. The prefixes are trained simultaneously across synthetic and natural product domains, acting as a domain instructor.

Chemical Feedback Paradigm

To address molecular hallucinations, MolGen aligns the model’s probabilistic rankings with chemical preference rankings. Given a molecule $S$ and a set of candidate outputs $\mathcal{S}^*$ with distinct property scores $\text{Ps}(\cdot)$, the model should satisfy:

$$ p_{\text{true}}(S_i \mid S) > p_{\text{true}}(S_j \mid S), \quad \forall S_i, S_j \in \mathcal{S}^*, \text{Ps}(S_i) > \text{Ps}(S_j) $$

This is enforced via a rank loss:

$$ \mathcal{L}_{\text{rank}}(S) = \sum_{i} \sum_{j > i} \max\left(0, f(S_j) - f(S_i) + \gamma_{ij}\right) $$

where $\gamma_{ij} = (j - i) \cdot \gamma$ is a margin scaled by rank difference and $f(S) = \sum_{t=1}^{l} \log p_{\theta}(s_t \mid S, S_{< t}; \theta)$ is the estimated log-probability. The overall training objective combines cross-entropy and rank loss:

$$ \mathcal{L} = \mathcal{L}_{\text{ce}} + \alpha \mathcal{L}_{\text{rank}} $$

Label smoothing is applied to the target distribution in $\mathcal{L}_{\text{ce}}$, allocating probability mass $\beta$ to non-target tokens to maintain generative diversity.

Experiments Across Distribution Learning and Property Optimization

Datasets

  • Stage 1 pre-training: 100M+ unlabeled molecules from ZINC-15 (molecular weight $\leq$ 500 Da, LogP $\leq$ 5)
  • Stage 2 pre-training: 2.22M molecules spanning synthetic (ZINC, MOSES) and natural product (NPASS, 30,926 compounds) domains
  • Downstream evaluation: MOSES synthetic dataset, ZINC250K, and natural product molecules

Molecular Distribution Learning

MolGen generates 10,000 synthetic and 80,000 natural product molecules, evaluated on seven metrics (Validity, Fragment similarity, Scaffold similarity, SNN, Internal Diversity, FCD, and Novelty). Baselines include AAE, LatentGAN, CharRNN, VAE, JT-VAE, LIMO, and Chemformer.

ModelValidityFragScafSNNIntDivFCDNovelty
Chemformer.9843.9889.9248.5622.8553.0061.9581
MolGen1.000.9999.9999.9996.8567.00151.000

On synthetic molecules, MolGen achieves 100% validity, near-perfect fragment and scaffold similarity, and the lowest FCD (0.0015). For natural products, MolGen achieves FCD of 0.6519 compared to Chemformer’s 0.8346.

Targeted Molecule Discovery

For penalized logP maximization (top-3 scores):

Model1st2nd3rd
MARS (no length limit)44.9944.3243.81
MolGen (no length limit)80.3074.7069.85
MolGen (length-limited)30.5128.9828.95

For QED maximization, MolGen achieves the maximum score of 0.948 across the top-3.

Molecular Docking

MolGen optimizes binding affinity for two protein targets (ESR1 and ACAA1), measured by dissociation constant $K_D$ (lower is better):

ModelESR1 1stESR1 2ndESR1 3rdACAA1 1stACAA1 2ndACAA1 3rd
LIMO0.720.891.4373741
MolGen0.130.350.473.363.988.50

MolGen achieves the lowest dissociation constants across both targets. Optimization of the 1,000 worst-affinity molecules yields 96.7% relative improvement for ESR1 and 70.4% for ACAA1.

Constrained Molecular Optimization

Optimizing 800 molecules from ZINC250K with lowest p-logP scores under Tanimoto similarity constraints:

Model$\delta = 0.6$$\delta = 0.4$
RetMol3.78 (3.29)11.55 (11.27)
MolGen12.08 (0.82)12.35 (1.21)

MolGen achieves the highest mean improvement with the lowest standard deviation under both constraints.

Ablation Studies

  • Chemical feedback: Without it, the model generates molecules with property scores similar to initial molecules. With it ($\alpha = 3$), property scores increase progressively across generation rounds.
  • Prefix tuning: Removing prefix tuning reduces constrained optimization improvement by 0.45 at $\delta = 0.6$ and 2.12 at $\delta = 0.4$.
  • Label smoothing: Enhances diversity of generated molecules as measured by Internal Diversity.
  • Substructure attention: MolGen focuses attention on chemically meaningful functional groups (fluoro, phenyl, hydroxyl), while SMILES-based PLMs scatter attention across syntactic tokens. The Substructure Attention Level (SAL) metric confirms MolGen’s superior focus.

Key Findings, Limitations, and Future Directions

Key Findings

  1. SELFIES pre-training guarantees 100% molecular validity, eliminating the need for external valency checks.
  2. Domain-agnostic prefix tuning enables effective knowledge transfer between synthetic and natural product domains.
  3. The chemical feedback paradigm aligns model outputs with chemical preferences without requiring external annotated data or reference databases.
  4. MolGen achieves the best or competitive results across all evaluated tasks: distribution learning, targeted molecule discovery, constrained optimization, and molecular docking.

Limitations

The authors acknowledge several limitations:

  • Computational cost: Training and fine-tuning on large datasets is computationally intensive.
  • Model interpretability: The transformer architecture makes it difficult to understand explicit rationale behind decisions.
  • Single-target optimization only: The chemical feedback paradigm handles single-target optimization; multiple conflicting objectives could create ambiguous optimization trajectories.
  • Task specificity: MolGen is designed for 2D molecular generation; 3D conformation information is not incorporated.
  • Reaction prediction: When applied to reaction prediction (an off-target task), MolGen achieves only 71.4% accuracy on 39,990 reaction samples.

Future Directions

The authors suggest applying MolGen to retrosynthesis and reaction prediction, exploring multimodal pre-training, and incorporating additional knowledge sources.


Reproducibility Details

Data

PurposeDatasetSizeNotes
Stage 1 pre-trainingZINC-15100M+ moleculesMW $\leq$ 500 Da, LogP $\leq$ 5
Stage 2 pre-trainingZINC + MOSES + NPASS2.22M moleculesSynthetic and natural product domains
Distribution learning (synthetic)MOSES~1.9M moleculesStandard benchmark split
Distribution learning (natural)NPASS30,926 compounds30,126 train / 800 test
Constrained optimizationZINC250K800 moleculesLowest p-logP scores

Algorithms

  • Architecture: BART-based encoder-decoder with SELFIES vocabulary (185 tokens)
  • Prefix length: 5 tunable vectors per layer
  • Optimizer: LAMB (pre-training), AdamW (fine-tuning)
  • Pre-training: 600M steps with linear warm-up (180,000 steps) followed by linear decay
  • Rank loss weight ($\alpha$): Recommended values of 3 or 5
  • Candidate generation: 30 candidates per molecule (synthetic), 8 candidates (natural products)

Models

MolGen is publicly available on Hugging Face. The model uses a vocabulary of 185 SELFIES tokens and is comparable in size to Chemformer-large.

Evaluation

MetricDomainMolGenBest BaselineNotes
FCD (lower is better)Synthetic0.00150.0061 (Chemformer)Distribution learning
p-logP top-1 (no limit)Synthetic80.3044.99 (MARS)Targeted discovery
QED top-1Synthetic0.9480.948 (several)Tied at maximum
ESR1 $K_D$ top-1Docking0.130.72 (LIMO)Binding affinity
p-logP improvement ($\delta=0.4$)Synthetic12.35 (1.21)11.55 (11.27) (RetMol)Constrained optimization

Hardware

  • 6 NVIDIA V100 GPUs
  • Pre-training batch size: 256 molecules per GPU
  • Fine-tuning batch size: 6 (synthetic and natural product)
  • Training: 100 epochs for fine-tuning tasks

Artifacts

ArtifactTypeLicenseNotes
zjunlp/MolGenCodeUnknownOfficial PyTorch implementation
zjunlp/MolGen-largeModelUnknownPre-trained weights on Hugging Face

Citation

@inproceedings{fang2024domain,
  title={Domain-Agnostic Molecular Generation with Chemical Feedback},
  author={Fang, Yin and Zhang, Ningyu and Chen, Zhuo and Guo, Lingbing and Fan, Xiaohui and Chen, Huajun},
  booktitle={The Twelfth International Conference on Learning Representations},
  year={2024},
  url={https://openreview.net/forum?id=9rPyHyjfwP}
}