Exposing Limitations of Molecular ML with Activity Cliffs

Paper Information

Citation: van Tilborg, D., Alenicheva, A., & Grisoni, F. (2022). Exposing the Limitations of Molecular Machine Learning with Activity Cliffs. Journal of Chemical Information and Modeling, 62(23), 5938-5951. https://doi.org/10.1021/acs.jcim.2c01073

Publication: Journal of Chemical Information and Modeling 2022

Additional Resources:

A Benchmark for Activity Cliff Prediction

This is a Systematization paper ($\Psi_{\text{Systematization}}$) with a significant Resource component ($\Psi_{\text{Resource}}$).

The paper systematically benchmarks 24 machine learning and deep learning approaches on their ability to predict bioactivity for activity cliff compounds: pairs of structurally similar molecules that exhibit large differences in potency. These cases violate the similarity principle (similar structure implies similar activity) and represent a practical failure mode for molecular property prediction in drug discovery. The authors release MoleculeACE, an open-source benchmarking platform for evaluating ML models on activity cliffs.

The similarity principle underpins most molecular ML: structurally similar compounds should have similar properties. Activity cliffs are the exceptions, where small structural changes cause large potency shifts (e.g., a single substituent change causing a 10x difference in $K_i$).

Despite their importance for hit-to-lead optimization, activity cliffs have received limited attention in ML benchmarking. Standard metrics like RMSE computed over entire test sets can mask poor predictions on cliff compounds. A model might achieve low overall error while systematically mispredicting these edge cases, which are precisely the molecules that matter most for medicinal chemistry applications.

The authors identify 7-52% of compounds as activity cliff molecules across their 30 target datasets, showing this is not a rare phenomenon.

Defining and Detecting Activity Cliffs

The authors use three complementary similarity metrics to identify activity cliffs:

Substructure similarity: Tanimoto coefficient on extended connectivity fingerprints (ECFPs), capturing shared radial substructures
Scaffold similarity: Tanimoto coefficient on ECFPs computed from molecular graph frameworks, detecting core/decoration differences
SMILES similarity: Levenshtein distance on canonical SMILES strings, capturing character-level insertions, deletions, and translocations

Pairs with $\geq 90%$ similarity on any one of the three metrics and $> 10\times$ difference in bioactivity ($K_i$ or $\text{EC}_{50}$) are classified as activity cliff pairs. This union-based approach (rather than requiring agreement across all metrics) captures different types of structural relationships relevant to medicinal chemistry.

24 Methods Across 30 Drug Targets

The benchmark evaluates 16 traditional ML configurations (4 algorithms $\times$ 4 descriptor types) and 8 deep learning approaches across 30 curated ChEMBL v29 datasets (48,707 total molecules).

Traditional ML algorithms: KNN, RF, GBM, SVM, each combined with ECFPs, MACCS keys, WHIM descriptors, or physicochemical properties.

Deep learning methods: MPNN, GCN, GAT, Attentive FP (graph-based), plus LSTM, CNN, Transformer/ChemBERTa (SMILES-based), and an MLP on ECFPs.

Performance is measured with both standard RMSE and a dedicated $\text{RMSE}_{\text{cliff}}$ computed only on activity cliff compounds in the test set:

$$ \text{RMSE}_{\text{cliff}} = \sqrt{\frac{\sum_{j=1}^{n_c} (\hat{y}_j - y_j)^2}{n_c}} $$

Key results:

Molecular descriptors matter more than algorithms: The choice of descriptor (ECFPs vs. MACCS vs. WHIM vs. physicochemical) had a larger impact on $\text{RMSE}_{\text{cliff}}$ than the choice of ML algorithm ($p < 0.05$, Wilcoxon rank-sum test with Benjamini-Hochberg correction).
SVM + ECFPs wins on average: The best overall method for activity cliff prediction, though the difference from RF + ECFPs or GBM + ECFPs was not statistically significant.
Deep learning underperforms: All graph and SMILES-based deep learning methods performed worse than a simple MLP on ECFPs. Among deep learning, LSTM with transfer learning (pretrained on 36K molecules) was the best, outperforming the ChemBERTa transformer pretrained on 10M compounds.
Large case-by-case variation: $\text{RMSE}_{\text{cliff}}$ ranged from 0.62 to 1.60 log units across datasets, with no method consistently best. Deep learning methods showed the highest variance across targets.

Simple Descriptors Beat Complex Architectures on Cliffs

The core finding is that activity cliffs expose a gap in learned molecular representations. Despite graph neural networks and transformers being able to learn directly from molecular structure, they fail to capture the subtle structural differences that drive activity cliffs.

Key observations:

RMSE and $\text{RMSE}_{\text{cliff}}$ correlate ($r = 0.81$ on average), so optimizing overall error usually helps with cliffs too. But this correlation breaks down for some targets (e.g., CLK4), where methods with similar RMSE can have very different $\text{RMSE}_{\text{cliff}}$.
Training set size matters for the RMSE/$\text{RMSE}_{\text{cliff}}$ correlation: Datasets with $> 1000$ training molecules show $r > 0.80$ between the two metrics. In low-data regimes, the correlation weakens, making dedicated cliff evaluation more important.
No relationship between % cliff compounds and model performance, and no target-family-specific effects were found.
Transfer learning helped SMILES models (LSTM) but not graph models: Self-supervised pretraining strategies (context prediction, infomax, edge prediction, masking) did not improve GNN performance, consistent with findings from other studies.

The MoleculeACE platform provides standardized data curation, activity cliff detection, and cliff-specific evaluation, enabling researchers to assess new methods against this benchmark.

Reproducibility Details

Data

Purpose	Source	Size	Notes
Benchmarking	ChEMBL v29	48,707 molecules (35,632 unique) across 30 targets	Curated for duplicates, salts, outliers
Smallest dataset	JAK1	615 molecules	7% activity cliffs
Largest dataset	DRD3	3,657 molecules	39% activity cliffs

Algorithms

Activity cliff detection: Pairwise similarity $\geq 0.9$ (Tanimoto on ECFPs, scaffold ECFPs, or Levenshtein on SMILES) with $> 10\times$ potency difference
Splitting: Spectral clustering on ECFPs (5 clusters), 80/20 stratified split preserving cliff proportion
Hyperparameter optimization: Bayesian optimization with Gaussian process, max 50 combinations, 5-fold cross-validation
SMILES augmentation: 10-fold for all SMILES-based methods
Transfer learning: LSTM pretrained on 36,281 merged training molecules (next-character prediction); ChemBERTa pretrained on 10M PubChem compounds

Models

Traditional ML: KNN, RF, GBM, SVM (scikit-learn v1.0.2)
Descriptors: ECFPs (1024-bit, radius 2), MACCS keys (166-bit), WHIM (114 descriptors), physicochemical (11 properties)
GNNs: MPNN, GCN, GAT, AFP (PyTorch Geometric v2.0.4), with graph multiset transformer pooling
SMILES models: LSTM (4 layers, 5.8M params), 1D CNN, ChemBERTa transformer
Total models trained: 720 (24 methods $\times$ 30 targets)

Evaluation

Metric	Scope	Details
RMSE	All test molecules	Standard root-mean-square error on $\text{pK}_i$ / $\text{pEC}_{50}$
$\text{RMSE}_{\text{cliff}}$	Activity cliff compounds only	RMSE restricted to cliff molecules in test set

Artifacts

Artifact	Type	License	Notes
MoleculeACE	Code + Data	MIT	Benchmark platform with all 30 curated datasets
Curated datasets	Data	MIT	Processed ChEMBL bioactivity data

Citation

@article{vantilborg2022activity,
  title={Exposing the Limitations of Molecular Machine Learning with Activity Cliffs},
  author={van Tilborg, Derek and Alenicheva, Alisa and Grisoni, Francesca},
  journal={Journal of Chemical Information and Modeling},
  volume={62},
  number={23},
  pages={5938--5951},
  year={2022},
  publisher={American Chemical Society},
  doi={10.1021/acs.jcim.2c01073}
}

Paper Information#

A Benchmark for Activity Cliff Prediction#

Activity Cliffs as a Blind Spot in Molecular ML#

Defining and Detecting Activity Cliffs#

24 Methods Across 30 Drug Targets#

Simple Descriptors Beat Complex Architectures on Cliffs#

Reproducibility Details#

Data#

Algorithms#

Models#

Evaluation#

Artifacts#

Citation#