Perplexity for Molecule Ranking and CLM Bias Detection

A Method for Intrinsic Scoring and Bias Detection in Chemical Language Models

This is a Method paper that introduces two contributions to the chemical language model (CLM) pipeline for de novo molecular design. First, the authors propose using perplexity as a model-intrinsic score to rank generated SMILES strings by how well they match the design objectives encoded in the fine-tuning data. Second, they introduce a “delta score” that compares molecule rankings from pretrained and fine-tuned CLMs to detect pretraining bias, where molecules are generated primarily based on generic pretraining knowledge rather than task-specific fine-tuning objectives.

The Ranking and Bias Problem in CLM-Based Molecule Generation

Chemical language models generate new molecules as SMILES strings by iteratively predicting the next character based on learned probability distributions. After training, CLMs can produce large virtual libraries of candidate molecules via multinomial sampling. However, two key challenges remain: (1) the generated molecules lack a natural ranking, requiring external scoring methods such as similarity assessment or activity prediction for prioritization, and (2) transfer learning (pretraining on a large corpus followed by fine-tuning on a small target set) can introduce “pretraining bias,” where some generated molecules reflect generic chemical knowledge from pretraining rather than the specific design objectives of the fine-tuning data.

Beam search offers an alternative sampling approach that produces inherently ranked molecules by greedily selecting the most probable SMILES strings. However, beam search explores only a narrow portion of chemical space. The authors sought to combine the ranking advantage of beam search with the chemical space exploration of multinomial sampling by applying perplexity scoring as a post-hoc ranking criterion.

Perplexity Scoring and the Delta Score for Bias Estimation

The core innovation is the application of perplexity, a standard evaluation metric from natural language processing, to score SMILES strings generated by CLMs. For a SMILES string of length $N$ with character probabilities $p_i$ assigned by the CLM, perplexity is computed as:

$$ \text{perplexity} = 2^{-\frac{1}{N} \sum_{i=1}^{N} \log(p_{i})} $$

Low perplexity indicates that the CLM assigns high probability to each character in the SMILES string, suggesting the molecule closely matches the learned distribution of the fine-tuning data. The metric is normalized by string length, making it comparable across molecules of different sizes.

To address pretraining bias, the authors introduce a delta score. For each generated molecule, the perplexity-based rank from the fine-tuned model ($\text{rank}_{ft}$) is compared against the rank from the pretrained model ($\text{rank}_{pt}$):

$$ \text{delta} = \text{rank}_{ft} - \text{rank}_{pt} $$

A positive delta score indicates that the fine-tuned model ranks the molecule higher than the pretrained model, suggesting the molecule was generated based on task-specific fine-tuning knowledge. A negative delta score flags molecules that may have been generated primarily from pretraining information, which do not necessarily match the design objectives.

The multinomial sampling probability for each character is computed via the softmax function:

$$ p_{i} = \frac{e^{z_{i}/T}}{\sum_{j} e^{z_{j}/T}} $$

where $z_{i}$ is the CLM output logit for the $i$th character, $j$ runs over all dictionary characters, and $T$ is the temperature parameter (set to $T = 1$ in this study).

Experimental Setup: 10 Protein Targets Across Four Data Regimes

The authors systematically evaluated perplexity scoring across 10 macromolecular targets and four low-data fine-tuning regimes (5, 10, 20, and 40 molecules per target).

Model architecture: A four-layer LSTM-based RNN (5,820,515 parameters) with batch normalization layers, LSTM layers of 1024 and 256 units, trained using the Adam optimizer with a learning rate of $10^{-4}$.

Pretraining: The model was pretrained on 1,683,181 molecules from ChEMBL (version 28), encoded as canonical SMILES (20-90 characters), for 90 epochs.

Fine-tuning: For each of 10 randomly selected protein targets (Table 1), bioactive ligands with pChEMBL > 6 were selected. Fine-tuning sets of 5, 10, 20, and 40 molecules were compiled for each target. Fine-tuning ran for 100 epochs, with 1,000 SMILES strings sampled every second epoch via multinomial sampling ($T = 1$).

CHEMBL ID	Target	Protein Classification
CHEMBL1836	Prostanoid EP4 receptor	G protein-coupled receptor
CHEMBL1945	Melatonin receptor 1A	G protein-coupled receptor
CHEMBL1983	Serotonin 1D (5-HT1D) receptor	Family A GPCR
CHEMBL202	Dihydrofolate reductase	Oxidoreductase
CHEMBL3522	Cytochrome P450 17A1	Cytochrome P450
CHEMBL4029	Interleukin-8 receptor A	Family A GPCR
CHEMBL5073	CaM kinase I delta	Kinase
CHEMBL5137	Metabotropic glutamate receptor 2	G protein-coupled receptor
CHEMBL5408	Serine/threonine-protein kinase TBK1	Kinase
CHEMBL5608	NT-3 growth factor receptor	Kinase

Sampling comparison: Beam search sampling was performed with beam widths $k = 10$ and $k = 50$ for comparison against multinomial sampling.

Molecular similarity: Tanimoto similarity was computed using Morgan fingerprints (radius 2, length 1024) and 2D pharmacophore fingerprints via RDKit (2019.03.2).

Key Findings: Multinomial Sampling Outperforms Beam Search

Perplexity correlates with molecular similarity. The Pearson correlation between perplexity and Tanimoto distance to the fine-tuning set stabilized at approximately 0.5 across all data regimes. This correlation emerged earlier with larger fine-tuning sets. The result confirms that perplexity captures both substructural and pharmacophore features while also incorporating additional CLM-learned information.

Multinomial sampling produces better-ranked molecules than beam search. With the smallest fine-tuning sets (5 molecules), the top 50 molecules from multinomial sampling consistently exhibited lower (better) perplexity values than beam search at $k = 10$ or $k = 50$. Increasing the beam width from 10 to 50 did not markedly improve beam search performance. For novel molecules (Tanimoto similarity below 50% to the nearest fine-tuning compound), multinomial sampling identified lower-perplexity molecules in 72% of cases with the smallest fine-tuning sets.

Perplexity scoring narrows the quality distribution. The top 50 molecules selected by perplexity from multinomial sampling spanned a narrower range of perplexity values compared to beam search, suggesting a more consistent pool of high-quality candidates for follow-up synthesis.

Pretraining bias is substantial. The delta score analysis revealed that more than 40% of sampled molecules had negative delta scores during the first 20 fine-tuning epochs, meaning they were ranked higher by the pretrained model than the fine-tuned model. This fraction remained above 10% even at the end of 100 fine-tuning epochs across all data regimes, confirming that 10-40% of generated molecules reflect “generic” pretraining rather than task-focused fine-tuning.

Perplexity alone partially mitigates bias. Among the top 50 molecules selected by perplexity from multinomial sampling, only up to 3% had negative delta scores, compared to 10-40% in the unfiltered population. This suggests that perplexity-based ranking already reduces pretraining bias, though the delta score provides additional filtering power.

SMILES validity remained high. Mean SMILES string validity consistently exceeded 90% across all fine-tuned models and fine-tuning epochs.

Limitations

The authors note several limitations and future directions. The study used a fixed temperature of $T = 1$ for multinomial sampling; combining perplexity with temperature tuning or SMILES augmentation remains unexplored. The evaluation focused on 10 protein targets, and broader validation across diverse target classes would strengthen the conclusions. The authors also suggest that combining CLMs with perplexity scoring could be applied to screen large collections of commercially available compounds, which has not yet been tested.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pretraining	ChEMBL v28	1,683,181 molecules	Canonical SMILES, 20-90 characters, salts and duplicates removed
Validation	ChEMBL v28 (split)	84,160 molecules	Random split from pretraining set
Fine-tuning	ChEMBL v28 (per target)	5, 10, 20, or 40 molecules	pChEMBL > 6, 10 targets

Algorithms

LSTM-based CLM with character-level SMILES prediction
Multinomial sampling at $T = 1$
Beam search at $k = 10$ and $k = 50$
Perplexity computed per Equation 1; delta score per Equation 2
Adam optimizer, learning rate $10^{-4}$, 90 pretraining epochs, 100 fine-tuning epochs

Models

4-layer LSTM RNN: batch normalization, LSTM (1024 units), LSTM (256 units), batch normalization
5,820,515 parameters total
One-hot encoded SMILES input
Pretrained weights available in the GitHub repository

Evaluation

Metric	Description	Notes
Perplexity	Model confidence in generated SMILES	Lower is better
Delta score	Rank difference between fine-tuned and pretrained models	Positive indicates task-relevant generation
Tanimoto similarity	Morgan and pharmacophore fingerprints	Compared to fine-tuning set
Pearson correlation	Perplexity vs. Tanimoto distance	Stabilizes at ~0.5
SMILES validity	Fraction of valid SMILES strings	Consistently > 90%

Hardware

Hardware specifications are not reported in the paper. The implementation uses Keras (v2.2.0) with TensorFlow GPU backend (v1.9.0).

Artifacts

Artifact	Type	License	Notes
CLM_perplexity	Code	MIT	Framework, pretrained weights, and training data
Beam search implementation	Code	Unknown	Referenced beam search implementation

Paper Information

Citation: Moret, M., Grisoni, F., Katzberger, P., & Schneider, G. (2022). Perplexity-Based Molecule Ranking and Bias Estimation of Chemical Language Models. Journal of Chemical Information and Modeling, 62(5), 1199-1206. https://doi.org/10.1021/acs.jcim.2c00079

Publication: Journal of Chemical Information and Modeling, 2022

Additional Resources:

GitHub: CLM_perplexity (MIT License)

Citation

@article{moret2022perplexity,
  title={Perplexity-Based Molecule Ranking and Bias Estimation of Chemical Language Models},
  author={Moret, Michael and Grisoni, Francesca and Katzberger, Paul and Schneider, Gisbert},
  journal={Journal of Chemical Information and Modeling},
  volume={62},
  number={5},
  pages={1199--1206},
  year={2022},
  publisher={American Chemical Society},
  doi={10.1021/acs.jcim.2c00079}
}

A Method for Intrinsic Scoring and Bias Detection in Chemical Language Models#

The Ranking and Bias Problem in CLM-Based Molecule Generation#

Perplexity Scoring and the Delta Score for Bias Estimation#

Experimental Setup: 10 Protein Targets Across Four Data Regimes#

Key Findings: Multinomial Sampling Outperforms Beam Search#

Limitations#

Reproducibility Details#

Data#

Algorithms#

Models#

Evaluation#

Hardware#

Artifacts#

Paper Information#

Citation#