BioT5: Cross-Modal Integration of Biology and Chemistry

A Unified Pretraining Framework for Molecules, Proteins, and Text

BioT5 is a Method paper that introduces a comprehensive T5-based pretraining framework for cross-modal integration of molecules, proteins, and natural language. The primary contribution is a multi-task pretraining approach that uses SELFIES (instead of SMILES) for 100% valid molecular representations, separate tokenization for each modality, and a combination of masked language modeling and translation objectives to connect structured biological data with unstructured scientific text. After fine-tuning, BioT5 (252M parameters) achieves state-of-the-art performance on 10 out of 15 downstream tasks spanning molecule property prediction, protein property prediction, drug-target interaction, protein-protein interaction, molecule captioning, and text-based molecule generation.

Bridging the Gap Between Molecular Sequences and Scientific Knowledge

Prior cross-modal models in computational biology face three recurring challenges. First, models like MolT5 and MolXPT rely on SMILES to represent molecules, but SMILES strings are syntactically fragile: random perturbations or model-generated sequences frequently produce invalid molecular structures. Edwards et al. (2022) and Li et al. (2023) both highlight this validity problem as a bottleneck for text-to-molecule generation. Second, the contextual information surrounding molecular and protein names in scientific literature (e.g., mentions in PubMed abstracts that describe properties, interactions, and experimental results) remains underutilized. Most models either ignore this context or treat it identically to structured database entries. Third, existing approaches like MolT5 and Galactica share a single tokenizer and embedding space across molecules, proteins, and text. This leads to chemically incorrect tokenization: the bromine atom “Br” in SMILES gets split into “B” (boron) and “r”, producing erroneous downstream predictions.

BioT5 addresses all three issues simultaneously by adopting SELFIES for molecular representation, extracting entity-linked contextual knowledge from PubMed, and employing separate vocabularies for each modality.

SELFIES, Separate Tokenization, and Multi-Task Pretraining

The core innovations of BioT5 center on three design decisions:

SELFIES for Robust Molecular Representation

BioT5 replaces SMILES with SELFIES (Self-referencing Embedded Strings) for all molecular representations. Every permutation of symbols within the SELFIES alphabet generates a chemically valid molecular structure, guaranteeing 100% validity in generation tasks. Molecules from ZINC20 are converted from SMILES to SELFIES during data preprocessing.

Modality-Specific Tokenization

Rather than sharing a single SentencePiece vocabulary across modalities, BioT5 maintains three separate dictionaries:

Molecules: Each SELFIES token corresponds to a chemically meaningful atom group enclosed in brackets (e.g., [C], [=C], [Br]).
Proteins: Amino acids are prefixed with a special <p> token to distinguish them from text characters (e.g., <p>M, <p>K, <p>R).
Text: The standard T5 vocabulary is retained.

This prevents semantic conflation across modalities. The total vocabulary size is 35,073, and the model comprises 252M parameters using the T5-v1.1-base architecture.

Multi-Task Pretraining Objectives

BioT5 uses six pretraining tasks organized into three categories:

Single-modal T5 objective: Standard span corruption and recovery applied independently to molecule SELFIES (task 1), protein FASTA (task 2), and general text from C4 (task 3).
Wrapped text T5 objective (task 4): Applied to PubMed articles where molecular names are replaced with corresponding SELFIES strings and gene names are appended with protein FASTA sequences, using BERN2 for named entity recognition and entity linking.
Bidirectional translation (tasks 5 and 6): Molecule SELFIES to text description and vice versa (using 339K pairs from PubChem), and protein FASTA to text description and vice versa (using 569K pairs from Swiss-Prot).

The translation direction is randomly sampled with probability 0.5 for each example. For downstream tasks, BioT5 uses prompt-based fine-tuning to cast all tasks into a sequence generation format, reducing the gap between pretraining and fine-tuning.

Evaluation Across 15 Downstream Tasks

BioT5 is evaluated on 15 tasks organized into three categories: single-instance prediction, multi-instance prediction, and cross-modal generation.

Molecule Property Prediction (MoleculeNet)

BioT5 is evaluated on six binary classification tasks from MoleculeNet using scaffold splitting: BBBP, Tox21, ClinTox, HIV, BACE, and SIDER. Results are averaged over three random runs.

Dataset	GEM	MolXPT	BioT5
BBBP	72.4	80.0	77.7
Tox21	78.1	77.1	77.9
ClinTox	90.1	95.3	95.4
HIV	80.6	78.1	81.0
BACE	85.6	88.4	89.4
SIDER	67.2	71.7	73.2
Avg	79.0	81.9	82.4

BioT5 achieves the best average AUROC (82.4) across all six datasets, surpassing both GNN-based methods (GEM) and language model baselines (MolXPT).

Protein Property Prediction (PEER Benchmark)

On the PEER benchmark, BioT5 is evaluated on protein solubility and subcellular localization prediction:

Model	Params	Solubility (Acc)	Localization (Acc)
ESM-1b	652.4M	70.23	92.40
ProtBert	419.9M	68.15	91.32
BioT5	252.1M	74.65	91.69

BioT5 achieves the best solubility prediction accuracy (74.65%) despite being 2-3x smaller than dedicated protein language models like ESM-1b and ProtBert.

Drug-Target Interaction Prediction

BioT5 is evaluated on three DTI datasets (BioSNAP, Human, BindingDB) with five random runs:

Method	BioSNAP AUROC	Human AUROC	BindingDB AUROC
DrugBAN	0.903	0.982	0.960
BioT5	0.937	0.989	0.963

BioT5 consistently outperforms DrugBAN and other specialized DTI models across all three datasets.

Molecule Captioning and Text-Based Molecule Generation

On the ChEBI-20 dataset, BioT5 outperforms all baselines in molecule captioning:

Model	Params	BLEU-4	METEOR	Text2Mol
MolT5-large	783M	0.508	0.614	0.582
MolXPT	350M	0.505	0.626	0.594
BioT5	252M	0.556	0.656	0.603

For text-based molecule generation, BioT5 achieves an exact match score of 0.413 (vs. 0.311 for MolT5-large) while maintaining 100% validity, compared to 90.5% for MolT5-large. This demonstrates the direct benefit of SELFIES: every generated sequence is a valid molecule.

Protein-Protein Interaction Prediction

On the PEER PPI benchmarks (Yeast and Human), BioT5 achieves competitive results, outperforming fully fine-tuned ProtBert and ESM-1b on the Yeast dataset (64.89% vs. 63.72% for ProtBert) and placing second on Human (86.22% vs. 88.06% for ESM-1b with frozen weights).

Key Findings, Limitations, and Future Directions

BioT5 demonstrates that integrating molecular, protein, and textual modalities within a single pretraining framework yields consistent improvements across diverse biological tasks. Three factors drive BioT5’s performance: (1) SELFIES guarantees 100% molecular validity in generation tasks, eliminating a persistent failure mode of SMILES-based models; (2) separate tokenization preserves the semantic integrity of each modality; (3) wrapped text pretraining on PubMed provides contextual biological knowledge that pure sequence models miss.

The authors acknowledge several limitations. BioT5 requires full-parameter fine-tuning for each downstream task because instruction-tuning does not generalize across tasks, and combining datasets via instructions causes data leakage (the authors note overlaps between BindingDB training data and BioSNAP/Human test sets). The model only handles sequence-format bio-entities and does not incorporate 2D or 3D structural information. Additional biological modalities such as DNA/RNA sequences and cell-level data are also left for future work.

The authors also note risks: BioT5 could potentially be misused to generate dangerous molecules, and it may fail to generate effective therapeutic molecules or produce compounds with adverse side effects.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pretraining (molecules)	ZINC20	~300M molecules	Converted from SMILES to SELFIES
Pretraining (proteins)	UniRef50	27M proteins	Filtered by length
Pretraining (text)	C4	Large	Standard T5 corpus
Pretraining (wrapped text)	PubMed	33M articles	Entity linking via BERN2
Pretraining (molecule-text pairs)	PubChem	339K pairs	Excludes ChEBI-20 molecules
Pretraining (protein-text pairs)	Swiss-Prot	569K pairs	High-quality annotations
Evaluation (molecular properties)	MoleculeNet	6 datasets	Scaffold splitting
Evaluation (protein properties)	PEER	2 tasks	Solubility and localization
Evaluation (DTI)	BioSNAP, Human, BindingDB	3 datasets	Binary classification
Evaluation (PPI)	Yeast, Human	2 datasets	From PEER benchmark
Evaluation (generation)	ChEBI-20	33K pairs	Molecule captioning and text-to-molecule

Algorithms

Architecture: T5-v1.1-base (encoder-decoder transformer)
Optimizer: AdamW with RMS scaling
Learning rate: cosine annealing, base $1 \times 10^{-2}$, minimum $1 \times 10^{-5}$
Warmup steps: 10,000
Dropout: 0.0
Maximum input length: 512 tokens
Pretraining steps: 350K
Batch size: 96 per GPU (6 data types per batch)
Prompt-based fine-tuning for all downstream tasks

Models

Model	Parameters	Vocabulary Size	Architecture
BioT5	252M	35,073	T5-v1.1-base

Evaluation

Molecule property prediction: AUROC on 6 MoleculeNet tasks (scaffold split, 3 runs)
Protein property prediction: accuracy on PEER benchmark (3 runs)
Drug-target interaction: AUROC, AUPRC, accuracy on 3 DTI datasets (5 runs)
Protein-protein interaction: accuracy on 2 PPI datasets (3 runs)
Molecule captioning: BLEU, ROUGE, METEOR, Text2Mol on ChEBI-20
Text-based molecule generation: BLEU, exact match, fingerprint similarities, FCD, validity on ChEBI-20

Hardware

8x NVIDIA A100 80GB GPUs for pretraining
Codebase: nanoT5

Artifacts

Artifact	Type	License	Notes
BioT5 Code	Code	MIT	Official implementation

Paper Information

Citation: Pei, Q., Zhang, W., Zhu, J., Wu, K., Gao, K., Wu, L., Xia, Y., & Yan, R. (2023). BioT5: Enriching Cross-modal Integration in Biology with Chemical Knowledge and Natural Language Associations. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 1102-1123. https://doi.org/10.18653/v1/2023.emnlp-main.70

@inproceedings{pei2023biot5,
  title={BioT5: Enriching Cross-modal Integration in Biology with Chemical Knowledge and Natural Language Associations},
  author={Pei, Qizhi and Zhang, Wei and Zhu, Jinhua and Wu, Kehan and Gao, Kaiyuan and Wu, Lijun and Xia, Yingce and Yan, Rui},
  booktitle={Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing},
  pages={1102--1123},
  year={2023},
  publisher={Association for Computational Linguistics},
  doi={10.18653/v1/2023.emnlp-main.70}
}

A Unified Pretraining Framework for Molecules, Proteins, and Text#

Bridging the Gap Between Molecular Sequences and Scientific Knowledge#

SELFIES, Separate Tokenization, and Multi-Task Pretraining#

SELFIES for Robust Molecular Representation#

Modality-Specific Tokenization#

Multi-Task Pretraining Objectives#

Evaluation Across 15 Downstream Tasks#

Molecule Property Prediction (MoleculeNet)#

Protein Property Prediction (PEER Benchmark)#

Drug-Target Interaction Prediction#

Molecule Captioning and Text-Based Molecule Generation#

Protein-Protein Interaction Prediction#

Key Findings, Limitations, and Future Directions#

Reproducibility Details#

Data#

Algorithms#

Models#

Evaluation#

Hardware#

Artifacts#

Paper Information#