SPECTRA: Evaluating Generalizability of Molecular AI

A Spectral Framework for Evaluating Molecular ML Generalizability

This is a Method paper that introduces SPECTRA (SPECtral framework for model evaluaTion on moleculaR dAtasets), a systematic approach for evaluating how well machine learning models generalize on molecular sequencing data. The primary contribution is a framework that generates train-test splits with controlled, decreasing levels of overlap, producing a spectral performance curve (SPC) and a single summary metric, the area under the spectral performance curve (AUSPC), for comparing model generalizability across tasks and architectures.

Why Existing Molecular Benchmarks Overestimate Generalizability

Deep learning has achieved high performance on molecular sequencing benchmarks, but a persistent gap exists between benchmark performance and real-world deployment. The authors identify the root cause: existing evaluation approaches use either metadata-based (MB) splits or similarity-based (SB) splits, both of which provide an incomplete picture of generalizability.

MB splits partition data by metadata properties (e.g., temporal splits, random splits) without controlling sequence similarity between train and test sets. This means high train-test similarity can inflate performance metrics. SB splits control similarity at a single threshold, but the model’s behavior at other similarity levels remains unknown.

For example, the TAPE benchmark’s remote homology family split has 97% cross-split overlap, while the superfamily split has 71%. Model accuracy drops by 50% between these two points, yet the full curve of performance degradation is never characterized. This gap between evaluated and real-world overlap levels leads to overoptimistic deployment expectations, as demonstrated by the case of rifampicin resistance prediction in M. tuberculosis, where commercial genotypic assays later proved unreliable in specific geographic regions.

The SPECTRA Framework: Spectral Properties, Graphs, and Performance Curves

SPECTRA takes three inputs: a molecular sequencing dataset, a machine learning model, and a spectral property definition. A spectral property (SP) is a molecular sequence property expected to influence model generalizability for a specific task. For sequence-to-sequence datasets, the spectral property is typically sequence identity (proportion of aligned positions > 0.3). For mutational scan datasets, it is defined by sample barcodes (string representations of mutations present in each sample).

Spectral Property Graph Construction

SPECTRA constructs a spectral property graph (SPG) where nodes represent samples and edges connect samples that share the spectral property. The goal is to generate train-test splits with controlled levels of cross-split overlap by finding approximate maximal independent sets of this graph.

Finding the exact maximal independent set is NP-Hard, so SPECTRA uses a greedy randomized algorithm parameterized by a spectral parameter $\mathbf{SP} \in [0, 1]$:

Randomly order SPG vertices
Select the first vertex and delete each neighbor with probability equal to $\mathbf{SP}$
Continue until no vertices remain

When $\mathbf{SP} = 0$, this produces a random split (maximum cross-split overlap). When $\mathbf{SP} = 1$, it approximates the maximal independent set (minimum cross-split overlap). For each spectral parameter value (incremented by 0.05 from 0 to 1), three splits with different random seeds are generated.

The Spectral Performance Curve and AUSPC

The model is trained and evaluated on each split. Plotting test performance against the spectral parameter produces the spectral performance curve (SPC). The area under this curve, the AUSPC, serves as a single summary metric for model generalizability that captures behavior across the full spectrum of train-test overlap.

Handling Mutational Scan Datasets

For mutational scan datasets where sample barcodes map to multiple samples, SPECTRA introduces two modifications: (1) weighting nodes in the SPG by the number of samples they represent, and (2) running a subset sum algorithm to ensure 80/20 train-test splits by sample count.

Evaluation Across 18 Datasets and 19 Models

The authors apply SPECTRA to 18 molecular sequencing datasets spanning three benchmarks (TAPE, PEER, ProteinGym) plus PDBBind, evaluating 19 models including CNNs, LSTMs, GNNs (GearNet), LLMs (ESM2), diffusion models (DiffDock), variational autoencoders (EVE), and logistic regression.

Benchmark Datasets

The core evaluation covers five primary tasks:

Task	Dataset	Type	Metric	Samples
Rifampicin resistance (RIF)	TB clinical isolates	MSD	AUROC	17,474
Isoniazid resistance (INH)	TB clinical isolates	MSD	AUROC	26,574
Pyrazinamide resistance (PZA)	TB clinical isolates	MSD	AUROC	12,146
Fluorescence prediction	GFP variants	MSD	Spearman’s $\rho$	54,024
Vaccine escape	SARS-CoV-2 RBD	MSD	Spearman’s $\rho$	438,046

Additional benchmarks include remote homology detection, secondary structure prediction, subcellular localization, and protein-ligand binding (PDBBind, Astex diverse set, Posebusters).

Models Evaluated

Eight models were evaluated in depth across the five primary tasks: logistic regression, CNN, ESM2 (pretrained), ESM2-Finetuned, GearNet, GearNet-Finetuned, EVE, and SeqDesign. Additional models (LSTM, ResNet, DeepSF, Transformer, HHblits, Equibind, DiffDock, TankBind, Transception, MSA Transformer, ESM1v, Progen2) were evaluated on specific benchmark tasks.

Existing Splits as Points on the SPC

SPECTRA reveals that existing benchmark splits correspond to specific points on the spectral performance curve. For instance:

Task	Benchmark Split	Cross-Split Overlap	Spectral Parameter
Remote homology	TAPE family	97%	0.025
Remote homology	TAPE superfamily	71%	0.475
Secondary structure	CASP12	48%	0.5
Protein-ligand binding	Equibind temporal	76%	0.55
Protein-ligand binding	LPPDBind similarity	91%	0.275
Protein-ligand binding	Posebusters	70%	0.575

Performance Degradation and Foundation Model Insights

Universal Performance Decline

All evaluated models demonstrate decreased performance as cross-split overlap decreases. Logistic regression drops from AUROC > 0.9 to 0.5 for rifampicin resistance. ESM2-Finetuned decreases from Spearman’s $\rho > 0.9$ to less than 0.4 for GFP fluorescence prediction.

No single model achieves the highest AUSPC across all tasks. CNN maintains AUSPC > 0.6 across all tasks but is surpassed by ESM2-Finetuned and ESM2 on rifampicin resistance. Some models retain reasonable performance even at $\mathbf{SP} = 1$ (minimal overlap): ESM2, ESM2-Finetuned, and CNN maintain AUROC > 0.7 for RIF and PZA at this extreme.

Uncovering Hidden Spectral Properties

SPECTRA can detect unconsidered spectral properties through high variance in model performance at fixed spectral parameters. For rifampicin resistance, the CNN shows high variance at $\mathbf{SP} = 0.9$, $0.95$, and $1.0$ (standard deviations of 0.09, 0.10, and 0.08 respectively).

The authors trace this to the rifampicin resistance determining region (RRDR), a 26-amino-acid region of the rpoB gene. They define diff-RRDR as:

$$ \text{diff-RRDR} = \left(\max\left(\text{position}_{\text{train}}\right) - \max\left(\text{position}_{\text{test}}\right)\right) + \left(\min\left(\text{position}_{\text{train}}\right) - \min\left(\text{position}_{\text{test}}\right)\right) $$

diff-RRDR correlates with CNN performance variance (Spearman’s $\rho = -0.51$, p-value $= 1.79 \times 10^{-5}$) but not with ESM2 performance. The authors attribute this to ESM2’s larger context window (512 positions vs. CNN’s 12), making it more invariant to positional shifts in resistance-determining mutations.

Foundation Model Generalizability

For protein foundation models, SPECTRA reveals that AUSPC correlates with the similarity between task-specific datasets and the pretraining dataset. ESM2’s AUSPC varies from 0.91 (RIF) to 0.26 (SARS-CoV-2). The correlation between UniRef50 overlap and AUSPC is strong (Spearman’s $\rho = 0.9$, p-value $= 1.4 \times 10^{-27}$).

This finding holds across multiple foundation models (Transception, MSA Transformer, ESM1v, Progen2) evaluated on five ProteinGym datasets (Spearman’s $\rho = 0.9$, p-value $= 0.04$). Fine-tuning improves AUSPC for tasks with low pretraining overlap (PZA, SARS-CoV-2, GFP).

Computational Cost

Generating SPECTRA splits ranges from 5 minutes (amyloid beta aggregation) to 9 hours (PDBBind). Generating spectral performance curves ranges from 1 hour (logistic regression) to 5 days (ESM2-Finetuned). The authors recommend releasing SPECTRA splits alongside new benchmarks to amortize this cost.

Limitations and Future Directions

The authors acknowledge several limitations:

Spectral property selection is pivotal: The choice of spectral property must be biologically informed and task-specific. Standardized definitions across the community are needed.
Computational cost: Running SPECTRA is expensive, especially for large models. The authors mitigate this with multi-core CPU parallelization and multi-GPU training.
Not a model ranking tool: SPECTRA is designed for understanding generalizability patterns, not for ranking models. Proper ranking requires averaging AUSPCs across many tasks in a standardized benchmark.
Spectral parameter vs. cross-split overlap: The minimal achievable cross-split overlap varies across tasks, so SPECTRA plots performance against the spectral parameter rather than overlap directly. This means the AUSPC reflects relative impact on performance per unit decrease in overlap.

The authors envision SPECTRA as a foundation for next-generation molecular benchmarks that explicitly characterize generalizability across the full spectrum of distribution shift, applicable beyond molecular data to small molecule therapeutics, inverse protein folding, and patient-level clinical datasets.

Reproducibility Details

Data

All data used in this study is publicly available.

Purpose	Dataset	Size	Notes
Evaluation	TB RIF resistance	17,474 isolates	From Green et al. (2022)
Evaluation	TB INH resistance	26,574 isolates	From Green et al. (2022)
Evaluation	TB PZA resistance	12,146 isolates	From Green et al. (2022)
Evaluation	GFP fluorescence	54,024 samples	From Sarkisyan et al. (2016)
Evaluation	SARS-CoV-2 escape	438,046 samples	From Greaney et al. (2021)
Benchmark	TAPE (remote homology, secondary structure)	Various	From Rao et al. (2019)
Benchmark	PEER (subcellular localization)	13,949 samples	From Xu et al. (2022)
Benchmark	ProteinGym (amyloid, RRM)	Various	From Notin et al. (2022)
Benchmark	PDBBind (protein-ligand binding)	14,993-16,742 complexes	From Wang et al. (2005)

Data is also available on Harvard Dataverse.

Algorithms

Spectral property comparison uses Biopython pairwise alignment (match=1, mismatch=-2, gap=-2.5) with a 0.3 similarity threshold for sequence-to-sequence datasets
Greedy randomized maximal independent set approximation for split generation
Spectral parameter incremented in 0.05 steps from 0 to 1
Three random seeds per spectral parameter value
80/20 train-test split ratio enforced via subset sum for mutational scan datasets

Models

ESM2: 650M parameter version from Lin et al. (2023)
ESM2-Finetuned: First 30 layers frozen, masked language head replaced with linear prediction layer
GearNet and GearNet-Finetuned: Protein structures generated via ESMFold
CNN: Architecture from Green et al. (2022), one-hot encoded sequences
Logistic regression: One-hot encoded mutational barcodes
EVE and SeqDesign: MSAs constructed via Jackhmmer against UniRep100

Evaluation

Metric	Task	Notes
AUROC	TB resistance (RIF, INH, PZA)	Binary classification
Spearman’s $\rho$	GFP fluorescence, SARS-CoV-2 escape	Regression tasks
Accuracy	Remote homology, secondary structure, subcellular localization	Per-label/class accuracy
RMSE	Protein-ligand binding	Predicted vs. actual complex
AUSPC	All tasks	Area under spectral performance curve

Hardware

Most models: 1x Tesla A10 GPU
ESM2-Finetuned: 4x Tesla A100 GPUs on Azure cluster
Hyperparameter optimization: Weights & Biases random search over learning rate
All code in PyTorch

Artifacts

Artifact	Type	License	Notes
SPECTRA Code	Code	MIT	Framework implementation and reproduction scripts
Harvard Dataverse	Dataset	CC0 1.0	All datasets and generated splits

Paper Information

Citation: Ektefaie, Y., Shen, A., Bykova, D., Marin, M. G., Zitnik, M., & Farhat, M. (2024). Evaluating generalizability of artificial intelligence models for molecular datasets. Nature Machine Intelligence, 6(12), 1512-1524. https://doi.org/10.1038/s42256-024-00931-6

@article{ektefaie2024evaluating,
  title={Evaluating generalizability of artificial intelligence models for molecular datasets},
  author={Ektefaie, Yasha and Shen, Andrew and Bykova, Daria and Marin, Maximillian G. and Zitnik, Marinka and Farhat, Maha},
  journal={Nature Machine Intelligence},
  volume={6},
  number={12},
  pages={1512--1524},
  year={2024},
  publisher={Nature Publishing Group},
  doi={10.1038/s42256-024-00931-6}
}

A Spectral Framework for Evaluating Molecular ML Generalizability#

Why Existing Molecular Benchmarks Overestimate Generalizability#

The SPECTRA Framework: Spectral Properties, Graphs, and Performance Curves#

Spectral Property Graph Construction#

The Spectral Performance Curve and AUSPC#

Handling Mutational Scan Datasets#

Evaluation Across 18 Datasets and 19 Models#

Benchmark Datasets#

Models Evaluated#

Existing Splits as Points on the SPC#

Performance Degradation and Foundation Model Insights#

Universal Performance Decline#

Uncovering Hidden Spectral Properties#

Foundation Model Generalizability#

Computational Cost#

Limitations and Future Directions#

Reproducibility Details#

Data#

Algorithms#

Models#

Evaluation#

Hardware#

Artifacts#

Paper Information#