Multimodal Molecular Models on Hunter Heidenreich | ML Research Scientist

MoMu: Bridging Molecular Graphs and Natural Language

Sat, 28 Mar 2026 00:00:00 +0000

Bridging Molecular Graphs and Natural Language Through Contrastive Learning

MoMu (Molecular Multimodal foundation model) is a Method paper that proposes a multimodal pre-training approach to associate molecular graphs with natural language descriptions. The primary contribution is a dual-encoder architecture, consisting of a Graph Isomorphism Network (GIN) for molecular graphs and a BERT-based text encoder, jointly trained through contrastive learning on weakly-correlated graph-text pairs collected from scientific literature. The pre-trained model supports four downstream capabilities: cross-modal retrieval (graph-to-text and text-to-graph), molecule captioning, zero-shot text-to-graph molecule generation, and molecular property prediction.

Why Single-Modality Models Are Insufficient for Molecular Understanding

Existing AI models for molecular tasks generally operate on a single modality and learn a single cognitive ability. Language-based models process SMILES strings or natural language texts and handle tasks like property prediction from strings, literature comprehension, or SMILES-based generation. Graph-based models use molecular graph representations and handle graph-level property prediction or graph generation. Neither category connects structural information from molecular graphs with the rich semantic knowledge encoded in scientific texts.

Prior work by Zeng et al. (KV-PLM) jointly modeled molecule-related texts and SMILES strings, but SMILES representations have inherent drawbacks: they are one-dimensional and may lose structural information, they cannot capture structural similarities between molecules, and a single molecule can have multiple valid SMILES representations. Molecular graphs, by contrast, are more intuitive and better reveal functional structures. Human experts learn molecular knowledge by associating both graphical representations and textual descriptions, yet no prior model bridged these two modalities directly.

The key challenge is the scarcity of paired molecular graph-text data compared to general image-text datasets. Additionally, learning specialized molecular knowledge requires foundational cognitive abilities in both the graph and text domains, making training from scratch infeasible with limited data.

MoMu consists of two encoders initialized from pre-trained unimodal models: a GIN graph encoder initialized from GraphCL self-supervised weights, and a BERT text encoder initialized from either Sci-BERT (yielding MoMu-S) or KV-PLM (yielding MoMu-K).

Data Collection

The authors collect approximately 15,613 molecular graph-document pairs by:

Gathering names, synonyms, and SMILES for the top 50K compounds in PubChem
Converting SMILES to molecular graphs using the OGB smiles2graph function
Retrieving related text from the S2ORC corpus (136M+ papers) by querying with molecule names, filtering to Medicine, Biology, Chemistry, and Computer Science fields
Restricting retrieval to abstract, introduction, and conclusion sections to avoid experimental data artifacts

Contrastive Training Objective

For each graph-text pair in a mini-batch of $N$ pairs, MoMu applies two graph augmentations (node dropping and subgraph extraction) to create two augmented graphs, and randomly samples two sentences from the document. This produces $2N$ graph representations ${z_1^G, \tilde{z}_1^G, \ldots, z_N^G, \tilde{z}_N^G}$ and $2N$ text representations ${z_1^T, \tilde{z}_1^T, \ldots, z_N^T, \tilde{z}_N^T}$.

The cross-modal contrastive loss for a pair $(z_i^G, z_i^T)$ is:

$$ \ell_i^{(z_i^G, z_i^T)} = -\log \frac{\exp(\text{sim}(z_i^G, z_i^T) / \tau)}{\sum_{j=1}^{N} \exp(\text{sim}(z_i^G, z_j^T) / \tau)} $$

where $\tau$ is the temperature parameter and $\text{sim}(\cdot, \cdot)$ projects both representations into a shared 256-dimensional space before computing cosine similarity. The total cross-modal loss includes four contrastive terms for each pair: $(z_i^G, z_i^T)$, $(\tilde{z}_i^G, z_i^T)$, $(z_i^G, \tilde{z}_i^T)$, and $(\tilde{z}_i^G, \tilde{z}_i^T)$.

An intra-modal graph contrastive loss further strengthens the graph encoder:

$$ \ell_i^{(z_i^G, \tilde{z}_i^G)} = -\log \frac{\exp(\text{sim}(z_i^G, \tilde{z}_i^G) / \tau)}{\sum_{j=1}^{N} \exp(\text{sim}(z_i^G, \tilde{z}_j^G) / \tau)} $$

Zero-Shot Text-to-Graph Generation

MoMu enables a zero-shot generation pipeline by combining the pre-trained MoMu encoders with MoFlow, a flow-based molecular generator. Given an input text description $x^T$, the method:

Samples a latent variable $q$ from MoFlow’s Gaussian prior $P(q)$
Generates a molecular graph through MoFlow’s reverse flows: $\hat{E} = f_g^{-1}(q_e)$ and $\hat{V} = f_c^{-1}(q_v \mid GN(\hat{E}))$
Feeds $\hat{V}$ (using soft atom type probabilities instead of hard assignments) into MoMu’s graph encoder
Optimizes $q$ to maximize the cosine similarity between the resulting graph and text representations:

$$ \ell_q = -\text{sim}(z^G, z^T) / \tau $$

All MoMu and MoFlow parameters are frozen; only $q$ is updated via Adam for up to 500 iterations. The final molecule is obtained by applying argmax to the optimized probability matrices $\hat{V}$ and $\hat{E}$.

Evaluation Across Four Downstream Tasks

MoMu is evaluated on the PCdes dataset (15K SMILES-description pairs from PubChem, split 10,500/1,500/3,000 for train/val/test). Retrieval is performed in mini-batches of 64 pairs, reporting top-1 accuracy and Recall@20.

Graph-to-Text Retrieval (PCdes, fine-tuned):

Method	Sentence Acc	Sentence R@20	Paragraph Acc	Paragraph R@20
Sci-BERT	50.38	62.11	62.57	60.67
KV-PLM	53.79	66.63	64.81	63.87
KV-PLM*	55.92	68.59	77.92	75.93
MoMu-S	58.64	80.59	80.62	79.11
MoMu-K	58.74	81.29	81.09	80.15

Text-to-Graph Retrieval (PCdes, fine-tuned):

Method	Sentence Acc	Sentence R@20	Paragraph Acc	Paragraph R@20
Sci-BERT	50.12	68.02	61.75	60.77
KV-PLM	54.22	71.80	64.95	64.27
KV-PLM*	55.61	74.77	77.03	75.47
MoMu-S	55.44	76.92	80.22	79.02
MoMu-K	54.94	78.29	81.45	80.62

In zero-shot retrieval (on a separate test set of 5,562 pairs not seen during pre-training), MoMu achieves approximately 39-46% accuracy compared to below 2% for Sci-BERT and KV-PLM, demonstrating strong generalization.

Molecule Captioning

MoMu’s graph features are appended to MolT5’s encoder inputs through a learned MLP mapping module on the ChEBI-20 dataset. Results show improvements in BLEU, METEOR, and Text2Mol scores when incorporating graph features, though ROUGE-L slightly drops. The graph structural information leads to more accurate captions for complex molecular structures.

Molecular Property Prediction

The pre-trained graph encoder from MoMu is fine-tuned on eight MoleculeNet datasets using scaffold splitting and ROC-AUC evaluation (10 runs).

Dataset	No Pre-Train	GraphCL	MoMu-S	MoMu-K
BBBP	65.8	69.7	70.5	70.1
Tox21	74.0	73.9	75.6	75.6
ToxCast	63.4	62.4	63.4	63.0
SIDER	57.3	60.5	60.5	60.4
ClinTox	58.0	76.0	79.9	77.4
MUV	71.8	69.8	70.5	71.1
HIV	75.3	78.5	75.9	76.2
BACE	70.1	75.4	76.7	77.1
Average	66.96	70.78	71.63	71.36

MoMu-S achieves the best average ROC-AUC (71.63%) across all eight datasets, outperforming GraphCL (70.78%), the self-supervised method used to initialize MoMu’s graph encoder. MoMu outperforms GraphCL on six of eight datasets. Notably, MoMu-S and MoMu-K perform comparably, indicating that KV-PLM’s SMILES-based knowledge does not transfer well to graph-based representations.

Zero-Shot Text-to-Graph Generation

The method generates molecules from three types of text descriptions:

High-level vague descriptions (e.g., “The molecule is beautiful”): MoMu generates diverse, interpretable molecules where “beautiful” tends to produce locally symmetric and stretched graphs, “versatile” produces molecules with varied elements and functional groups, and “strange” produces cluttered, irregular structures.
Functional descriptions (e.g., “fluorescent molecules”, “high water solubility and barrier permeability with low toxicity”): MoMu successfully generates molecules with appropriate functional groups and properties. For the solubility/permeability/toxicity query, MoMu generates molecules that satisfy three of three evaluable properties.
Structural descriptions (e.g., “molecules containing nucleophilic groups”): MoMu generates diverse molecules with appropriate functional groups (amino, hydroxyl, carbonyl, halogen atoms).

Promising Multimodal Transfer with Clear Data Limitations

MoMu demonstrates that contrastive pre-training on weakly-correlated graph-text data can bridge molecular graphs and natural language in a shared representation space. The key findings are:

Cross-modal alignment works with limited data: With only 15K graph-text pairs (far fewer than the millions used in vision-language models like CLIP), MoMu achieves meaningful cross-modal retrieval and enables zero-shot generation.
Multimodal supervision improves graph representations: The graph encoder supervised by text descriptions outperforms self-supervised methods (GraphCL, AttrMasking, ContextPred) on average across molecular property prediction benchmarks.
SMILES knowledge does not transfer to graphs: MoMu-S and MoMu-K perform comparably across all tasks, showing that structural information learned from one-dimensional SMILES strings does not readily generalize to graph neural networks.

Limitations

The authors acknowledge several important limitations:

Data scarcity: 15K graph-text pairs is substantially smaller than general image-text datasets, potentially leaving the common space insufficiently aligned.
Noisy supervision: Retrieved texts may mention a molecule by name without describing its properties or structure, leading to spurious correlations.
Generator constraints: The zero-shot generation method is limited by MoFlow’s capacity (maximum 38 atoms, 9 element types from ZINC250K training).
Property coverage: Generation quality degrades for molecular properties that appear infrequently or not at all in the training texts.

Future Directions

The authors propose four avenues: (1) collecting larger-scale multimodal molecular data including 3D conformations, (2) using strongly-correlated paired data with more advanced generators, (3) developing interpretable tools for the learned cross-modal space, and (4) wet-lab validation of generated molecules.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pre-training	Collected graph-text pairs (PubChem + S2ORC)	15,613 pairs	~37M paragraphs total; top 50K PubChem compounds
Cross-modal retrieval	PCdes	15K pairs (10.5K/1.5K/3K split)	SMILES-description pairs from PubChem
Molecule captioning	ChEBI-20	~33K pairs	Used with MolT5
Text-to-graph generation	ZINC250K (MoFlow)	250K molecules	Pre-trained generator, max 38 atoms
Property prediction	MoleculeNet (8 datasets)	Varies	BBBP, Tox21, ToxCast, SIDER, ClinTox, MUV, HIV, BACE

Algorithms

Graph augmentations: Node dropping (10% ratio) and subgraph extraction (80% of original size via random walk)
Contrastive learning: InfoNCE loss with temperature $\tau = 0.1$, following the DeClip paradigm with both inter-modal and intra-modal objectives
Zero-shot generation: Adam optimizer on latent variable $q$ for up to 500 iterations; formal charges prohibited in output

Models

Graph encoder: GIN with 5 layers, 300-dimensional hidden size, initialized from GraphCL checkpoint
Text encoder: BERT-base (768 hidden size), initialized from Sci-BERT or KV-PLM
Projection heads: Two MLPs projecting graph (300-dim) and text (768-dim) features to 256-dimensional shared space
Optimizer: AdamW, learning rate 0.0001, weight decay 1e-5, 300 epochs, batch size 256

Evaluation

Task	Metric	Best Result	Notes
G-T Retrieval (PCdes)	Accuracy / R@20	81.09 / 80.15 (paragraph)	MoMu-K, fine-tuned
T-G Retrieval (PCdes)	Accuracy / R@20	81.45 / 80.62 (paragraph)	MoMu-K, fine-tuned
Zero-shot G-T Retrieval	Accuracy	~46%	vs. ~1.4% for baselines
Property Prediction	ROC-AUC (avg)	71.63%	MoMu-S, 8 MoleculeNet datasets
Molecule Captioning	Text2Mol	Improved over MolT5	MoMu + MolT5-large

Hardware

Pre-training: 8x NVIDIA Tesla V100 PCIe 32GB GPUs
Framework: PyTorch

Artifacts

Artifact	Type	License	Notes
MoMu code	Code	Not specified	Pre-training and downstream task code
GraphTextRetrieval	Code	Not specified	Data collection and cross-modal retrieval code
Pre-training dataset	Dataset	Not specified	Hosted on Baidu Pan (Chinese cloud storage)

Paper Information

Citation: Su, B., Du, D., Yang, Z., Zhou, Y., Li, J., Rao, A., Sun, H., Lu, Z., & Wen, J.-R. (2022). A Molecular Multimodal Foundation Model Associating Molecule Graphs with Natural Language. arXiv preprint arXiv:2209.05481.

@article{su2022momu,
  title={A Molecular Multimodal Foundation Model Associating Molecule Graphs with Natural Language},
  author={Su, Bing and Du, Dazhao and Yang, Zhao and Zhou, Yujie and Li, Jiangmeng and Rao, Anyi and Sun, Hao and Lu, Zhiwu and Wen, Ji-Rong},
  journal={arXiv preprint arXiv:2209.05481},
  year={2022}
}

MolFM: Trimodal Molecular Foundation Pre-training

Sat, 28 Mar 2026 00:00:00 +0000

Trimodal Pre-training for Molecular Understanding

MolFM is a Method paper that introduces a multimodal molecular foundation model integrating three distinct sources of molecular knowledge: 2D molecular graphs, biomedical text, and knowledge graphs. The primary contribution is a pre-training framework that uses fine-grained cross-modal attention to fuse information across all three modalities, combined with theoretical justification from a deep metric learning perspective. MolFM achieves the best reported results (at time of publication) on cross-modal retrieval, molecule captioning, text-based molecule generation, and molecular property prediction.

Why Existing Molecular Models Fall Short

Prior multimodal molecular foundation models operate on at most two modalities (structures and text) and suffer from two key limitations. First, generative approaches like KV-PLM and MolT5 rely on 1D SMILES strings, which cannot capture complex topological and spatial molecular properties such as macrocycles. Contrastive approaches like MoMu and MoleculeSTM learn global alignment between molecule graphs and text but overlook fine-grained connections between specific substructures and textual descriptions.

Second, and more fundamentally, no prior model incorporates knowledge graphs as a third modality. Knowledge graphs encode global-level relationships among molecules, target ligands, diseases, and other biomedical entities. These relationships capture functional and structural similarity patterns that cannot be learned from individual molecule-text pairs alone. MolFM addresses both gaps by introducing cross-modal attention across all three modalities and providing theoretical guarantees about what the pre-training objectives learn.

Architecture

MolFM uses three pre-trained single-modal encoders:

Molecular graph encoder: A 5-layer GIN (1.8M parameters) initialized from GraphMVP, producing atom-level features $h_{SA}$ and a graph-level feature $h_{SM}$
Text encoder: A 6-layer transformer (61.8M parameters) initialized from KV-PLM’s first 6 layers, producing token features $h_T$
Knowledge graph encoder: A TransE model (12.6M parameters) trained on the knowledge graph for 500 epochs, producing entity features $h_K$

A multimodal encoder (61.8M parameters, 6 transformer layers with cross-attention) fuses the three modalities. The cross-attention uses text token features as queries and the concatenation of atom features and knowledge graph neighbor features as keys and values. For each molecule, the knowledge graph input is the molecule’s entity and $N=4$ randomly sampled one-hop neighbors.

Pre-training Objectives

MolFM combines four losses:

Structure-text contrastive (STC) aligns the global feature spaces of structure and text encoders using a symmetric InfoNCE loss:

$$\mathcal{L}_{stc} = -\frac{1}{2} \left[ \log \frac{\exp(s(z_S, z_T) / \tau)}{\sum_{S’ \in B} \exp(s(z_{S’}, z_T) / \tau)} + \log \frac{\exp(s(z_S, z_T) / \tau)}{\sum_{T’ \in B} \exp(s(z_S, z_{T’}) / \tau)} \right]$$

where $s(\cdot, \cdot)$ is cosine similarity and $\tau = 0.1$ is a temperature parameter.

Cross-modal matching (CMM) predicts whether a structure-text-knowledge triplet corresponds to the same molecule, using cross-entropy over the multimodal encoder’s CLS token:

$$\mathcal{L}_{cmm} = \sum_{(\tilde{S}, \tilde{T}, \tilde{K}) \in \tilde{B}} H\left[y_{cmm}(\tilde{S}, \tilde{T}, \tilde{K}),; p_{cmm}\left(\mathcal{M}_\theta(h_{\tilde{S}}, h_{\tilde{T}}, h_{\tilde{K}})\right)\right]$$

Masked language modeling (MLM) predicts masked text tokens conditioned on all three modalities:

$$\mathcal{L}_{mlm} = H\left[y_{mlm}(\hat{T}),; p_{mlm}\left(\mathcal{M}_\theta(h_S, h_{\hat{T}}, h_K)\right)\right]$$

Knowledge graph embedding (KGE) regularizes entity embeddings with a max-margin TransE loss:

$$\mathcal{L}_{kge} = \sum_{h \in K} \left[\max(0, d(h,r,t) - d(h,r,\tilde{t}) + \Delta) + \max(0, d(h,r,t) - d(\tilde{h},r,t) + \Delta)\right]$$

where $d(h,r,t) = | f(h) + g(r) - f(t) |_2$ and $\Delta = 0.2$.

The total pre-training loss is:

$$\mathcal{L} = \mathbb{E}_{(S,T,K)}\left[\mathcal{L}_{stc} + \mathcal{L}_{cmm} + \mathcal{L}_{mlm} + \mathcal{L}_{kge}\right]$$

Theoretical Justifications

The authors provide metric learning interpretations for each objective. For CMM, they show that the loss is proportional to assigning higher scores to matched triplets and lower scores to unmatched ones, aligning the feature space across all three modalities.

For KGE, two lemmas provide guarantees about structurally and functionally similar molecules:

Lemma 1 (Structural similarity): For a symmetric structural-similarity relation $r_s$, the KGE loss satisfies:

$$\mathcal{L}_{kge}(h, r_s, t) \propto 2|f(h) - f(t)| - \mathbb{E}_{\tilde{t}}|f(h) - f(\tilde{t})| - \mathbb{E}_{\tilde{h}}|f(\tilde{h}) - f(t)|$$

This shows KGE pulls structurally similar molecules closer while pushing dissimilar ones apart.

Lemma 2 (Functional similarity): For molecules $h$ and $t$ that interact with a common entity $o$, the distance between their embeddings is upper-bounded:

$$|f(h) - f(t)| \leq \alpha,\mathbb{E}_{(e_1, r, e_2) \sim \mathcal{I}}\left[\mathcal{L}_{kge}(e_1, r, e_2)\right] + C$$

where $\alpha \approx 1$ and $C \approx 0$. This guarantees that minimizing KGE also brings functionally similar molecules closer in the embedding space.

Experiments Across Four Downstream Tasks

Pre-training Data

MolFM pre-trains on 15K molecules from PubChem paired with 37M paragraphs from S2ORC. The knowledge graph contains 49K entities and 3.2M relations, constructed from DrugBank, BindingDB, and additional public databases with heuristic augmentation.

Evaluated on PCdes (paragraph-level) in zero-shot and fine-tuning settings. MolFM uses a re-ranking strategy that linearly combines cosine similarity with CMM logits over the top-$k$ retrieved candidates.

Mode	Model	S-T MRR	S-T R@1	S-T R@10	T-S MRR	T-S R@1	T-S R@10
Zero-shot	MoMu	9.89	5.08	18.93	10.33	4.90	20.69
Zero-shot	MolFM	21.42	13.90	36.21	23.63	16.14	39.54
Fine-tune	MoMu	34.29	24.47	53.84	34.53	24.87	54.25
Fine-tune	MolFM	39.56	29.76	58.63	39.34	29.39	58.49

MolFM achieves 12.13% and 5.04% absolute gains over MoMu under zero-shot and fine-tuning settings, respectively.

Molecule Captioning

Evaluated on ChEBI-20 using MolT5 decoders. MolFM’s structure encoder features are concatenated with the MolT5 encoder outputs.

Decoder	Encoder	BLEU-4	ROUGE-L	METEOR	Text2Mol
MolT5-base	MolT5-base	0.457	0.578	0.569	0.547
MolT5-base	MoMu	0.462	0.575	0.576	0.558
MolT5-base	GraphMVP	0.491	0.592	0.599	0.570
MolT5-base	MolFM	0.498	0.594	0.607	0.576

Text-Based Molecule Generation

Also on ChEBI-20 with MolT5 decoders. MolFM’s text features are projected and fed to the decoder.

Decoder	Encoder	Exact	Valid	Morgan FTS	Text2Mol
MolT5-base	MolT5-base	0.082	0.786	0.601	0.543
MolT5-base	MoMu	0.183	0.863	0.678	0.580
MolT5-base	MolFM	0.210	0.892	0.697	0.583

Molecular Property Prediction

On MoleculeNet (8 classification datasets), MolFM concatenates the structure feature and the multimodal encoder’s CLS feature to predict properties.

Model	BBBP	Tox21	ClinTox	HIV	BACE	Avg
GraphMVP	72.4	74.4	77.5	77.0	81.2	73.07
DeepEIK	72.1	72.4	89.7	75.0	80.5	73.27
MolFM (w/o T+K)	72.2	76.6	78.6	78.2	82.6	73.95
MolFM (w/ T+K)	72.9	77.2	79.7	78.8	83.9	74.62

With multimodal inputs, MolFM averages 74.62% ROC-AUC, a 1.55% absolute gain over GraphMVP.

Ablation Studies

Zero-shot retrieval ablations reveal that cross-modal attention to atoms and CMM are the most critical components. Removing either causes a sharp drop (approximately 3% on S-T retrieval). Knowledge graph incorporation yields a 1.5% average improvement, with both attention to neighbors and KGE contributing marginally.

Key Findings and Limitations

MolFM demonstrates that incorporating knowledge graphs as a third modality provides consistent improvements across all evaluated tasks. The theoretical analysis connecting pre-training objectives to deep metric learning provides interpretability for why the model works: STC and CMM align representations of the same molecule across modalities, while KGE pulls structurally and functionally similar molecules closer in the embedding space.

The cross-modal attention visualizations show that MolFM learns to associate specific atom substructures with relevant text tokens and knowledge graph entities. For example, the model correctly attends to functional groups mentioned in textual descriptions.

The authors acknowledge several limitations:

Data quality: The pre-training dataset (15K molecules) is small and may introduce biases
Cold-start problem: MolFM provides limited benefit for newly emerged molecules lacking text and knowledge graph information
Entity scope: The model focuses on molecules and does not incorporate proteins, genes, or cell lines, which could further improve biomedical understanding

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pre-training (molecules)	PubChem	15K molecules	Follows MoMu’s pre-training data
Pre-training (text)	S2ORC	37M paragraphs	Biomedical literature paragraphs
Knowledge graph	DrugBank, BindingDB, public DBs	49K entities, 3.2M relations	Constructed with heuristics from MoCL
Cross-modal retrieval	PCdes	Paragraph-level	Test split
Captioning/Generation	ChEBI-20	-	Following MolT5 splits
Property prediction	MoleculeNet	8 datasets	Classification tasks, ROC-AUC metric

Algorithms

Optimizer: AdamW with weight decay $1 \times 10^{-4}$
Learning rate: linear warmup to $1 \times 10^{-4}$ over 2,000 iterations, cosine annealing to $1 \times 10^{-5}$
Batch size: 128
Pre-training epochs: 300
Knowledge graph neighbors per molecule: $N = 4$
Temperature: $\tau = 0.1$
Margin: $\Delta = 0.2$

Models

Component	Architecture	Parameters	Initialization
Graph encoder	5-layer GIN	1.8M	GraphMVP
Text encoder	6-layer Transformer	61.8M	KV-PLM (first 6 layers)
Knowledge encoder	TransE	12.6M	Trained 500 epochs on KG
Multimodal encoder	6-layer Transformer + cross-attention	61.8M	KV-PLM (last 6 layers)
Total		~138M

Evaluation

Task	Metrics
Cross-modal retrieval	MRR, Recall@1/5/10
Molecule captioning	BLEU-2/4, ROUGE-1/2/L, METEOR, Text2Mol
Text-to-molecule generation	BLEU, Exact ratio, Validity, Levenshtein, Fingerprint Tanimoto (MACCS/RDKit/Morgan), Text2Mol
Property prediction	ROC-AUC per dataset

Hardware

4 NVIDIA A100 GPUs for pre-training

Artifacts

Artifact	Type	License	Notes
OpenBioMed	Code	MIT	Official implementation including MolFM

Paper Information

Citation: Luo, Y., Yang, K., Hong, M., Liu, X. Y., & Nie, Z. (2023). MolFM: A Multimodal Molecular Foundation Model. arXiv preprint arXiv:2307.09484.

@article{luo2023molfm,
  title={MolFM: A Multimodal Molecular Foundation Model},
  author={Luo, Yizhen and Yang, Kai and Hong, Massimo and Liu, Xing Yi and Nie, Zaiqing},
  journal={arXiv preprint arXiv:2307.09484},
  year={2023}
}

BioT5: Cross-Modal Integration of Biology and Chemistry

Sat, 28 Mar 2026 00:00:00 +0000

A Unified Pretraining Framework for Molecules, Proteins, and Text

BioT5 is a Method paper that introduces a comprehensive T5-based pretraining framework for cross-modal integration of molecules, proteins, and natural language. The primary contribution is a multi-task pretraining approach that uses SELFIES (instead of SMILES) for 100% valid molecular representations, separate tokenization for each modality, and a combination of masked language modeling and translation objectives to connect structured biological data with unstructured scientific text. After fine-tuning, BioT5 (252M parameters) achieves state-of-the-art performance on 10 out of 15 downstream tasks spanning molecule property prediction, protein property prediction, drug-target interaction, protein-protein interaction, molecule captioning, and text-based molecule generation.

Bridging the Gap Between Molecular Sequences and Scientific Knowledge

Prior cross-modal models in computational biology face three recurring challenges. First, models like MolT5 and MolXPT rely on SMILES to represent molecules, but SMILES strings are syntactically fragile: random perturbations or model-generated sequences frequently produce invalid molecular structures. Edwards et al. (2022) and Li et al. (2023) both highlight this validity problem as a bottleneck for text-to-molecule generation. Second, the contextual information surrounding molecular and protein names in scientific literature (e.g., mentions in PubMed abstracts that describe properties, interactions, and experimental results) remains underutilized. Most models either ignore this context or treat it identically to structured database entries. Third, existing approaches like MolT5 and Galactica share a single tokenizer and embedding space across molecules, proteins, and text. This leads to chemically incorrect tokenization: the bromine atom “Br” in SMILES gets split into “B” (boron) and “r”, producing erroneous downstream predictions.

BioT5 addresses all three issues simultaneously by adopting SELFIES for molecular representation, extracting entity-linked contextual knowledge from PubMed, and employing separate vocabularies for each modality.

SELFIES, Separate Tokenization, and Multi-Task Pretraining

The core innovations of BioT5 center on three design decisions:

SELFIES for Robust Molecular Representation

BioT5 replaces SMILES with SELFIES (Self-referencing Embedded Strings) for all molecular representations. Every permutation of symbols within the SELFIES alphabet generates a chemically valid molecular structure, guaranteeing 100% validity in generation tasks. Molecules from ZINC20 are converted from SMILES to SELFIES during data preprocessing.

Modality-Specific Tokenization

Rather than sharing a single SentencePiece vocabulary across modalities, BioT5 maintains three separate dictionaries:

Molecules: Each SELFIES token corresponds to a chemically meaningful atom group enclosed in brackets (e.g., [C], [=C], [Br]).
Proteins: Amino acids are prefixed with a special
token to distinguish them from text characters (e.g.,
M,
K,
R).
Text: The standard T5 vocabulary is retained.

This prevents semantic conflation across modalities. The total vocabulary size is 35,073, and the model comprises 252M parameters using the T5-v1.1-base architecture.

Multi-Task Pretraining Objectives

BioT5 uses six pretraining tasks organized into three categories:

Single-modal T5 objective: Standard span corruption and recovery applied independently to molecule SELFIES (task 1), protein FASTA (task 2), and general text from C4 (task 3).
Wrapped text T5 objective (task 4): Applied to PubMed articles where molecular names are replaced with corresponding SELFIES strings and gene names are appended with protein FASTA sequences, using BERN2 for named entity recognition and entity linking.
Bidirectional translation (tasks 5 and 6): Molecule SELFIES to text description and vice versa (using 339K pairs from PubChem), and protein FASTA to text description and vice versa (using 569K pairs from Swiss-Prot).

The translation direction is randomly sampled with probability 0.5 for each example. For downstream tasks, BioT5 uses prompt-based fine-tuning to cast all tasks into a sequence generation format, reducing the gap between pretraining and fine-tuning.

Evaluation Across 15 Downstream Tasks

BioT5 is evaluated on 15 tasks organized into three categories: single-instance prediction, multi-instance prediction, and cross-modal generation.

Molecule Property Prediction (MoleculeNet)

BioT5 is evaluated on six binary classification tasks from MoleculeNet using scaffold splitting: BBBP, Tox21, ClinTox, HIV, BACE, and SIDER. Results are averaged over three random runs.

Dataset	GEM	MolXPT	BioT5
BBBP	72.4	80.0	77.7
Tox21	78.1	77.1	77.9
ClinTox	90.1	95.3	95.4
HIV	80.6	78.1	81.0
BACE	85.6	88.4	89.4
SIDER	67.2	71.7	73.2
Avg	79.0	81.9	82.4

BioT5 achieves the best average AUROC (82.4) across all six datasets, surpassing both GNN-based methods (GEM) and language model baselines (MolXPT).

Protein Property Prediction (PEER Benchmark)

On the PEER benchmark, BioT5 is evaluated on protein solubility and subcellular localization prediction:

Model	Params	Solubility (Acc)	Localization (Acc)
ESM-1b	652.4M	70.23	92.40
ProtBert	419.9M	68.15	91.32
BioT5	252.1M	74.65	91.69

BioT5 achieves the best solubility prediction accuracy (74.65%) despite being 2-3x smaller than dedicated protein language models like ESM-1b and ProtBert.

Drug-Target Interaction Prediction

BioT5 is evaluated on three DTI datasets (BioSNAP, Human, BindingDB) with five random runs:

Method	BioSNAP AUROC	Human AUROC	BindingDB AUROC
DrugBAN	0.903	0.982	0.960
BioT5	0.937	0.989	0.963

BioT5 consistently outperforms DrugBAN and other specialized DTI models across all three datasets.

Molecule Captioning and Text-Based Molecule Generation

On the ChEBI-20 dataset, BioT5 outperforms all baselines in molecule captioning:

Model	Params	BLEU-4	METEOR	Text2Mol
MolT5-large	783M	0.508	0.614	0.582
MolXPT	350M	0.505	0.626	0.594
BioT5	252M	0.556	0.656	0.603

For text-based molecule generation, BioT5 achieves an exact match score of 0.413 (vs. 0.311 for MolT5-large) while maintaining 100% validity, compared to 90.5% for MolT5-large. This demonstrates the direct benefit of SELFIES: every generated sequence is a valid molecule.

Protein-Protein Interaction Prediction

On the PEER PPI benchmarks (Yeast and Human), BioT5 achieves competitive results, outperforming fully fine-tuned ProtBert and ESM-1b on the Yeast dataset (64.89% vs. 63.72% for ProtBert) and placing second on Human (86.22% vs. 88.06% for ESM-1b with frozen weights).

Key Findings, Limitations, and Future Directions

BioT5 demonstrates that integrating molecular, protein, and textual modalities within a single pretraining framework yields consistent improvements across diverse biological tasks. Three factors drive BioT5’s performance: (1) SELFIES guarantees 100% molecular validity in generation tasks, eliminating a persistent failure mode of SMILES-based models; (2) separate tokenization preserves the semantic integrity of each modality; (3) wrapped text pretraining on PubMed provides contextual biological knowledge that pure sequence models miss.

The authors acknowledge several limitations. BioT5 requires full-parameter fine-tuning for each downstream task because instruction-tuning does not generalize across tasks, and combining datasets via instructions causes data leakage (the authors note overlaps between BindingDB training data and BioSNAP/Human test sets). The model only handles sequence-format bio-entities and does not incorporate 2D or 3D structural information. Additional biological modalities such as DNA/RNA sequences and cell-level data are also left for future work.

The authors also note risks: BioT5 could potentially be misused to generate dangerous molecules, and it may fail to generate effective therapeutic molecules or produce compounds with adverse side effects.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pretraining (molecules)	ZINC20	~300M molecules	Converted from SMILES to SELFIES
Pretraining (proteins)	UniRef50	27M proteins	Filtered by length
Pretraining (text)	C4	Large	Standard T5 corpus
Pretraining (wrapped text)	PubMed	33M articles	Entity linking via BERN2
Pretraining (molecule-text pairs)	PubChem	339K pairs	Excludes ChEBI-20 molecules
Pretraining (protein-text pairs)	Swiss-Prot	569K pairs	High-quality annotations
Evaluation (molecular properties)	MoleculeNet	6 datasets	Scaffold splitting
Evaluation (protein properties)	PEER	2 tasks	Solubility and localization
Evaluation (DTI)	BioSNAP, Human, BindingDB	3 datasets	Binary classification
Evaluation (PPI)	Yeast, Human	2 datasets	From PEER benchmark
Evaluation (generation)	ChEBI-20	33K pairs	Molecule captioning and text-to-molecule

Algorithms

Architecture: T5-v1.1-base (encoder-decoder transformer)
Optimizer: AdamW with RMS scaling
Learning rate: cosine annealing, base $1 \times 10^{-2}$, minimum $1 \times 10^{-5}$
Warmup steps: 10,000
Dropout: 0.0
Maximum input length: 512 tokens
Pretraining steps: 350K
Batch size: 96 per GPU (6 data types per batch)
Prompt-based fine-tuning for all downstream tasks

Models

Model	Parameters	Vocabulary Size	Architecture
BioT5	252M	35,073	T5-v1.1-base

Evaluation

Molecule property prediction: AUROC on 6 MoleculeNet tasks (scaffold split, 3 runs)
Protein property prediction: accuracy on PEER benchmark (3 runs)
Drug-target interaction: AUROC, AUPRC, accuracy on 3 DTI datasets (5 runs)
Protein-protein interaction: accuracy on 2 PPI datasets (3 runs)
Molecule captioning: BLEU, ROUGE, METEOR, Text2Mol on ChEBI-20
Text-based molecule generation: BLEU, exact match, fingerprint similarities, FCD, validity on ChEBI-20

Hardware

8x NVIDIA A100 80GB GPUs for pretraining
Codebase: nanoT5

Artifacts

Artifact	Type	License	Notes
BioT5 Code	Code	MIT	Official implementation

Paper Information

Citation: Pei, Q., Zhang, W., Zhu, J., Wu, K., Gao, K., Wu, L., Xia, Y., & Yan, R. (2023). BioT5: Enriching Cross-modal Integration in Biology with Chemical Knowledge and Natural Language Associations. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 1102-1123. https://doi.org/10.18653/v1/2023.emnlp-main.70

@inproceedings{pei2023biot5,
  title={BioT5: Enriching Cross-modal Integration in Biology with Chemical Knowledge and Natural Language Associations},
  author={Pei, Qizhi and Zhang, Wei and Zhu, Jinhua and Wu, Kehan and Gao, Kaiyuan and Wu, Lijun and Xia, Yingce and Yan, Rui},
  booktitle={Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing},
  pages={1102--1123},
  year={2023},
  publisher={Association for Computational Linguistics},
  doi={10.18653/v1/2023.emnlp-main.70}
}

MG-BERT: Graph BERT for Molecular Property Prediction

Fri, 27 Mar 2026 00:00:00 +0000

A Graph-Aware BERT for Molecular Property Prediction

MG-BERT is a Method paper that adapts the BERT pretraining paradigm from NLP to molecular graphs. The primary contribution is a modified Transformer architecture that replaces global self-attention with bond-based local attention, allowing atoms to exchange information only through chemical bonds. This creates a deep message-passing network that avoids the oversmoothing problem of conventional graph neural networks (GNNs). Combined with a masked atom prediction pretraining strategy on 1.7 million unlabeled molecules from ChEMBL, MG-BERT learns context-sensitive atomic representations that transfer effectively to downstream property prediction tasks.

Data Scarcity in Molecular Property Prediction

Molecular property prediction is central to drug discovery, particularly for ADMET (Absorption, Distribution, Metabolism, Excretion, and Toxicity) endpoints. While deep learning has advanced many domains, molecular property prediction faces a persistent challenge: labeled data scarcity. ADMET measurements require expensive, time-consuming experiments, and typical datasets contain only hundreds to thousands of examples.

Prior approaches fall into three categories, each with limitations:

Feature engineering (molecular fingerprints, descriptors): Requires expert design, suffers from low scalability, and fixed representations cannot be optimized for specific tasks.
SMILES-based deep learning (CNNs, LSTMs, Transformers on SMILES strings): Must learn to parse molecular information from complex string syntax, increasing learning difficulty. Autoencoder-based methods (e.g., CDDD) learn fixed representations that cannot be fine-tuned.
Graph neural networks (GAT, GCN): Can learn directly from molecular topology, but are limited to 2-3 layers due to oversmoothing, restricting their capacity to capture deep-level patterns.

The BERT model from NLP demonstrated that self-supervised pretraining on large unlabeled corpora followed by fine-tuning on small labeled datasets can substantially improve downstream performance. SMILES-BERT applied this idea to SMILES strings directly, but suffered from interpretability issues due to auxiliary characters in the SMILES syntax. MG-BERT addresses these limitations by operating directly on molecular graphs.

Bond-Based Local Attention and Masked Atom Pretraining

The core innovation of MG-BERT has two components: a modified Transformer architecture for molecular graphs and a self-supervised pretraining strategy.

Architecture Modifications

The original BERT model uses three components: an embedding layer, Transformer encoder layers, and a task-specific output layer. MG-BERT makes three key modifications:

Atom embeddings replace word embeddings. The dictionary contains 16 tokens: 13 common atom types ([H], [C], [N], [O], [F], [S], [Cl], [P], [Br], [B], [I], [Si], [Se]), plus [UNK] for rare atoms, [MASK] for pretraining, and [GLOBAL] for graph-level readout.
No positional encoding. Unlike sequential text, atoms in a molecular graph have no inherent ordering, so positional embeddings are removed.
Local attention replaces global attention. The adjacency matrix of the molecular graph is used as a visibility matrix to modulate the attention scores. Each atom can only attend to atoms connected by chemical bonds. Formally, the attention is constrained so that:

$$A’_{ij} = \begin{cases} A_{ij} & \text{if bond exists between } i \text{ and } j \\ -\infty & \text{otherwise} \end{cases}$$

where $A_{ij}$ is the standard scaled dot-product attention score. This local message passing makes MG-BERT a variant of GNN, but one that can stack many layers (6 in the medium configuration) without oversmoothing, thanks to the residual connections inherited from the Transformer architecture.

Supernode for graph-level readout. A [GLOBAL] supernode is added to each molecular graph, connected to all atoms. This node aggregates information from the entire molecule and serves as the molecular representation for downstream prediction.

Masked Atom Prediction

The pretraining strategy mirrors BERT’s masked language model but operates on atoms:

15% of atoms in each molecule are randomly selected (at least one atom per molecule)
Of selected atoms: 80% are replaced with [MASK], 10% are randomly replaced with another atom type, and 10% remain unchanged
The model is trained to predict the original atom type at masked positions
Loss is computed only at masked positions

Model Configurations

Three model sizes were compared:

Configuration	Layers	Heads	Embedding Size	FFN Size	Recovery Accuracy
MG-BERT Small	3	2	128	256	95.27%
MG-BERT Medium	6	4	256	512	98.31%
MG-BERT Large	12	8	576	1152	98.35%

The medium configuration was selected for all experiments because it achieved the best downstream performance, despite the large model having slightly higher pretraining recovery accuracy. The authors attribute this to overfitting risk with the larger model.

Experimental Setup and Baselines

Pretraining

MG-BERT was pretrained on 1.7 million compounds randomly selected from ChEMBL, with 10% held out for evaluation (1.53M training molecules). Molecules were converted to 2D undirected graphs using RDKit, with hydrogen atoms explicitly included. The model was pretrained for 10 epochs using Adam with learning rate 1e-4 and batch size 256.

Fine-tuning Datasets

Sixteen datasets covering ADMET endpoints and common molecular properties were collected from ADMETlab and MoleculeNet:

Type	Dataset	Category	Size
Regression	Caco2	Absorption	979
Regression	logD	Physicochemical	10,354
Regression	logS	Physicochemical	5,045
Regression	PPB	Distribution	1,480
Regression	tox	Toxicity	7,295
Regression	ESOL	Physicochemical	1,128
Regression	FreeSolv	Physicochemical	642
Regression	Lipo	Physicochemical	4,200
Classification	Ames	Toxicity	6,719
Classification	BBB	Distribution	1,855
Classification	FDAMDD	Toxicity	795
Classification	H_HT	Toxicity	2,170
Classification	Pgp_inh	Absorption	2,125
Classification	Pgp_sub	Absorption	1,210
Classification	BACE	Biophysics	1,513
Classification	BBBP	Physiology	2,039

Datasets were split 8:1:1 (train:validation:test) with stratified sampling by SMILES length. Each experiment was repeated 10 times with random splits, reporting mean and standard deviation. Regression was evaluated by R-squared, classification by ROC-AUC. Early stopping with a maximum of 100 epochs was used.

Baselines

Five baselines were compared:

ECFP4-XGBoost: Extended connectivity fingerprints (diameter 4) with gradient-boosted trees
GAT: Graph Attention Network
GCN: Graph Convolutional Network
CDDD: Continuous and Data-Driven Descriptors (pretrained RNN encoder on SMILES with a fully connected network)
SMILES-BERT: Original BERT applied directly to SMILES strings

Ablation Studies

Two ablation studies were conducted:

Pretraining effectiveness: Comparing pretrained vs. non-pretrained MG-BERT under identical hyperparameters
Hydrogen atoms: Comparing MG-BERT with and without explicit hydrogen atoms in the molecular graph

Consistent Improvements Across ADMET Benchmarks

Main Results

MG-BERT consistently outperformed all baselines across all 16 datasets. Key results on the 11 ADMET datasets:

Dataset	ECFP4-XGBoost	GAT	GCN	CDDD	SMILES-BERT	MG-BERT
Caco2 (R2)	61.41	69.16	67.15	73.42	72.39	74.68
logD (R2)	70.84	84.62	86.22	85.85	86.31	87.46
logS (R2)	73.73	84.06	83.47	84.01	85.20	87.66
PPB (R2)	55.11	59.96	57.34	54.12	62.37	65.94
Ames (AUC)	87.21	86.38	87.04	86.82	87.69	89.33
BBB (AUC)	94.62	93.03	92.67	94.44	94.02	95.41
BBBP (AUC)	89.16	90.33	90.74	91.12	91.32	92.08

The overall improvement across all datasets was 28.1% (7.02% on classification, 21.28% on regression). Improvements were statistically significant at the 95% confidence level (paired t-test, P <= 0.001).

Pretraining Ablation

Pretraining improved performance by more than 2% on all datasets. The benefit was largest for small datasets: Caco2 improved by approximately 10 percentage points (64.79 to 74.68 R2), and FDAMDD improved by about 7.5 points (80.76 to 88.23 AUC). This confirms that self-supervised pretraining effectively addresses the labeled data scarcity problem.

Hydrogen Atom Ablation

Including explicit hydrogen atoms improved pretraining recovery accuracy from 92.25% to 98.31% and consistently improved downstream performance. The authors provide an intuitive explanation: hydrogen atoms help determine bond counts for neighboring atoms, which is critical for the masked atom recovery task. They also show that removing hydrogens can make structurally distinct molecules (e.g., benzene and cyclohexane) indistinguishable at the graph level.

Interpretability via Attention Visualization

The authors provide two forms of interpretability analysis:

t-SNE visualization of atomic representations: Pretrained atomic representations cluster by atom type and, more specifically, by local chemical environment (e.g., aromatic carbons separate from aliphatic carbons, C-N bonds from C-O bonds). This demonstrates that pretraining captures neighborhood context beyond simple atom identity.
Attention weight visualization: On the logD task, the supernode’s attention focuses on polar groups (which govern lipophilicity). On the Ames mutagenicity task, attention concentrates on known mutagenic structural alerts (acylchloride, nitrosamide, azide groups). This provides chemically meaningful explanations for predictions.

Limitations

The paper does not extensively discuss limitations, but several can be identified:

The model uses only 2D molecular topology (atom types and bonds) without 3D conformational information or bond-type features
The atom dictionary is limited to 13 common types plus [UNK], which may lose information for molecules containing rarer elements
Evaluation is limited to ADMET-focused datasets; broader chemical spaces (e.g., materials, catalysts) are not tested
The comparison baselines do not include other graph-based pretraining methods (e.g., the contemporaneous Strategies for Pre-training Graph Neural Networks by Hu et al.)

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pretraining	ChEMBL (random subset)	1.7M molecules (1.53M train)	10% held out for evaluation
Fine-tuning	ADMETlab + MoleculeNet	16 datasets (642-10,354 molecules)	8:1:1 splits, stratified by SMILES length

Algorithms

Optimizer: Adam (pretraining: lr=1e-4, batch=256; fine-tuning: lr from {1e-5, 5e-5, 1e-4}, batch from {16, 32, 64})
Pretraining epochs: 10
Fine-tuning: Up to 100 epochs with early stopping
Dropout: Optimized per task in range [0.0, 0.5]
Masking: 15% of atoms (80% [MASK], 10% random, 10% unchanged)

Models

Architecture: MG-BERT Medium (6 layers, 4 heads, embedding size 256, FFN size 512)
Molecule processing: RDKit for graph conversion with explicit hydrogens

Evaluation

Metric	Task Type	Notes
R-squared (R2)	Regression	Higher is better
ROC-AUC	Classification	Higher is better
Accuracy, RMSE	Both	Reported in supplementary Table S1

All results averaged over 10 random splits with standard deviations reported.

Hardware

The paper does not specify hardware requirements (GPU type, training time, or memory usage).

Artifacts

Artifact	Type	License	Notes
Molecular-graph-BERT	Code	Not specified	Jupyter Notebook implementation; last code push August 2021

Paper Information

Citation: Zhang, X.-C., Wu, C.-K., Yang, Z.-J., Wu, Z.-X., Yi, J.-C., Hsieh, C.-Y., Hou, T.-J., & Cao, D.-S. (2021). MG-BERT: leveraging unsupervised atomic representation learning for molecular property prediction. Briefings in Bioinformatics, 22(6), bbab152. https://doi.org/10.1093/bib/bbab152

@article{zhang2021mgbert,
  title={{MG-BERT}: leveraging unsupervised atomic representation learning for molecular property prediction},
  author={Zhang, Xiao-Chen and Wu, Cheng-Kun and Yang, Zhi-Jiang and Wu, Zhen-Xing and Yi, Jia-Cai and Hsieh, Chang-Yu and Hou, Ting-Jun and Cao, Dong-Sheng},
  journal={Briefings in Bioinformatics},
  volume={22},
  number={6},
  pages={bbab152},
  year={2021},
  publisher={Oxford University Press},
  doi={10.1093/bib/bbab152}
}

DMP: Dual-View Molecule Pre-training (SMILES+GNN)

Fri, 27 Mar 2026 00:00:00 +0000

A Dual-Branch Pre-training Method for Molecular Property Prediction

DMP (Dual-view Molecule Pre-training) is a Method paper that introduces a pre-training framework combining two complementary molecular encoders: a Transformer operating on SMILES strings and a Graph Neural Network (GNN) operating on molecular graphs. The two branches are trained jointly with masked language modeling (MLM) objectives plus a BYOL-style dual-view consistency loss. After pre-training on 10M PubChem molecules, either branch (or both) can be fine-tuned for downstream tasks. The authors recommend the Transformer branch based on empirical results. DMP achieves the best reported performance on 7 of 9 MoleculeNet classification tasks and 3 retrosynthesis benchmarks (at the time of the 2021 arXiv version).

Why Combine SMILES and Graph Views for Molecules

Prior molecule pre-training methods used either graph representations with GNNs or SMILES representations with Transformers, but not both. The authors observe that the two views are complementary: Transformers handle molecules with large atom distances (long chains) well, while GNNs handle molecules with many concatenated rings better. Neither model alone captures the full range of molecular structures effectively.

Existing GNN-based pre-training methods (Hu et al. 2020, MolCLR, GROVER) and SMILES-based methods (ChemBERTa, SMILES-BERT) each have blind spots dictated by their input representation. DMP addresses this by pre-training both views simultaneously and enforcing representation consistency between them, so each branch benefits from the structural knowledge of the other.

Dual-View Consistency with BYOL-Style Training

The core innovation is the dual-view consistency objective, inspired by Bootstrap Your Own Latent (BYOL). Given a molecule $M$ with SMILES representation $M_s$ and graph representation $M_g$, DMP obtains high-level features from each branch:

Transformer branch: A RoBERTa-base model encodes the SMILES sequence. The [CLS] token output serves as the molecule representation $f_s$.
GNN branch: A DeeperGCN network encodes the molecular graph. Mean+max pooling over atom representations yields $f_g$.

The dual-view consistency loss uses nonlinear projection heads $\psi_g, \psi_s$ and prediction heads $\rho_g, \rho_s$:

$$ p_g = \psi_g(f_g), \quad q_g = \rho_g(p_g); \quad p_s = \psi_s(f_s), \quad q_s = \rho_s(p_s) $$

The consistency loss maximizes cross-view cosine similarity with stop-gradient (SG) on the target:

$$ \ell_{\text{dual}}(\tilde{M}_g, \tilde{M}_s) = -\cos(q_s, \text{SG}(p_g)) - \cos(q_g, \text{SG}(p_s)) $$

where $\cos(p, q) = \frac{p^\top q}{|p|_2 |q|_2}$ and $\tilde{M}_g, \tilde{M}_s$ are the masked versions of the inputs. The stop-gradient prevents representation collapse without requiring negative samples or a momentum encoder.

The full training objective combines three losses:

MLM on Transformer: Recover masked tokens in SMILES sequences
MLM on GNN: Recover masked atoms in molecular graphs
Dual-view consistency: The BYOL-style loss above

Both MLM objectives and the consistency loss are necessary. Ablations show that removing MLM (using only dual-view loss) degrades performance, and using two branches of the same type (two Transformers or two GNNs) is less effective than the heterogeneous Transformer+GNN combination.

Experiments on MoleculeNet and Retrosynthesis

Pre-training Setup

DMP is pre-trained on 10M molecules from PubChem (matching prior work). The Transformer branch uses RoBERTa-base (12 layers, hidden dim 768, 87M parameters). The GNN branch uses DeeperGCN (12 layers, hidden dim 384, 7.4M parameters). Combined, DMP has 104.1M parameters. Training runs for 200K iterations on 8 V100 GPUs over 3.8 days with Adam optimizer (lr = 5e-4, weight decay 0.01).

Molecular Property Prediction (MoleculeNet)

DMP is evaluated on 6 binary classification tasks (BBBP, Tox21, ClinTox, HIV, BACE, SIDER) using official DeepChem splits, and on 3 additional tasks (BBBP, SIDER, ClinTox classification + ESOL, QM7, QM8 regression) using scaffold splits from GROVER.

Key results on DeepChem splits (ROC-AUC %):

Dataset	MolCLR	TF (MLM)	DMP_TF	DMP_TF+GNN
BBBP	73.6	74.9	78.1	77.8
Tox21	79.8	77.6	78.8	79.1
ClinTox	93.2	92.9	95.0	95.6
HIV	80.6	80.2	81.0	81.4
BACE	89.0	88.0	89.3	89.4
SIDER	68.0	68.4	69.2	69.8

On scaffold splits (comparison with GROVER and MPG):

Dataset	GROVER	MPG	DMP_TF
BBBP (AUC)	0.940	0.922	0.945
SIDER (AUC)	0.658	0.661	0.695
ClinTox (AUC)	0.944	0.963	0.968
ESOL (RMSE)	0.831	0.741	0.700
QM7 (MAE)	72.6	-	69.6
QM8 (MAE)	0.0125	-	0.0124

Retrosynthesis

DMP is tested on USPTO-50K (reaction type known/unknown) and USPTO-full. Using a “DMP fusion” approach (fusing pre-trained representations into a Transformer encoder-decoder for retrosynthesis), DMP improves top-1 accuracy by 2-3 points over the baseline Transformer across all settings:

Setting	Transformer	ChemBERTa fusion	DMP fusion
USPTO-50K (unknown)	42.3	43.9	46.1
USPTO-50K (known)	54.2	56.4	57.5
USPTO-full	42.9	-	45.0

For GNN-based retrosynthesis, replacing GLN’s GNN modules with DMP’s pre-trained GNN branch improves top-1 accuracy from 52.5% to 54.2% (unknown type) and from 64.2% to 66.5% (known type).

Representation Quality

t-SNE visualization of pre-trained representations shows that DMP produces better scaffold-based clustering than either GNN-only or Transformer-only pre-training. The Davies-Bouldin index improves from 3.56 (GNN) and 3.59 (Transformer) to 2.19 (DMP), indicating much tighter within-scaffold clusters.

Key Findings and Limitations

Key findings:

Combining heterogeneous views (SMILES + graph) during pre-training is more effective than using two branches of the same type. TF(x2) and GNN(x2) variants show smaller gains.
Both MLM and dual-view consistency loss contribute. Removing MLM (dual-view only) hurts performance, especially on BBBP (71.1 vs 78.1 with both losses).
The Transformer branch alone is recommended for downstream tasks, as it achieves strong results without adding GNN parameters at inference time.
Scaling pre-training data from 10M to 100M compounds yields marginal additional improvement.

Limitations acknowledged by the authors:

Training cost is higher than single-branch methods (3.8 days vs 2.5 days for TF-only on 8 V100s), since both branches must be trained jointly.
A fixed branch selection strategy is used at inference time. The authors note that a meta-controller for dynamic branch selection per molecule would be preferable.
The GNN branch uses simple atom masking without bond deletion or subgraph removal, leaving room for stronger graph-level pre-training objectives.

Relation to co-training: The authors clarify that DMP differs from classical co-training (Blum and Mitchell 1998) in that it does not require conditional independence between views and produces a pre-trained model rather than additional labeled data.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pre-training	PubChem subset	10M compounds	Same subset as MolCLR and ChemBERTa
Pre-training (large)	PubChem subset	100M compounds	Additional scale experiment
Evaluation (classification)	MoleculeNet (BBBP, Tox21, ClinTox, HIV, BACE, SIDER)	1.5K-41K molecules	Official DeepChem splits
Evaluation (regression)	MoleculeNet (ESOL, QM7, QM8)	Varies	Scaffold splits from GROVER
Evaluation (retrosynthesis)	USPTO-50K, USPTO-full	50K / 950K reactions	Splits from Dai et al. (2019)

Algorithms

Transformer branch: RoBERTa-base with MLM. SMILES tokenized using regex from Schwaller et al. (2019).
GNN branch: DeeperGCN with 12 layers, atom masking for MLM.
Dual-view loss: BYOL-style with 3-layer MLP projection heads and 2-layer MLP prediction heads, stop-gradient on targets.
Optimizer: Adam (lr=5e-4, beta1=0.9, beta2=0.98, epsilon=1e-6), weight decay 0.01, 10K warmup steps, linear decay.

Models

Component	Architecture	Parameters
Transformer branch	RoBERTa-base (12L, 768H, 12 heads)	87M
GNN branch	DeeperGCN (12L, 384H)	7.4M
DMP (total)	Transformer + GNN + projection/prediction heads	104.1M

Evaluation

Classification: ROC-AUC, averaged over 3 random seeds
Regression: RMSE (ESOL) or MAE (QM7, QM8)
Retrosynthesis: Top-k exact match accuracy (k=1,3,5,10,20,50)

Hardware

Pre-training: 8 NVIDIA V100 GPUs, batch size 12288 tokens, gradient accumulation 16x
Pre-training time: 3.8 days (DMP), 2.5 days (TF-only), 1.7 days (GNN-only)

Artifacts

No public code repository or pre-trained model weights were identified for this paper. The paper references GLN’s code repository (https://github.com/Hanjun-Dai/GLN) for the retrosynthesis baseline but does not release DMP-specific code.

Artifact	Type	License	Notes
GLN (baseline)	Code	MIT	Retrosynthesis baseline, not DMP code

Paper Information

Citation: Zhu, J., Xia, Y., Wu, L., Xie, S., Zhou, W., Qin, T., Li, H., & Liu, T.-Y. (2023). Dual-view Molecular Pre-training. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (pp. 3615-3627). https://doi.org/10.1145/3580305.3599317

@inproceedings{zhu2023dualview,
  title={Dual-view Molecular Pre-training},
  author={Zhu, Jinhua and Xia, Yingce and Wu, Lijun and Xie, Shufang and Zhou, Wengang and Qin, Tao and Li, Houqiang and Liu, Tie-Yan},
  booktitle={Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining},
  pages={3615--3627},
  year={2023},
  doi={10.1145/3580305.3599317}
}

SPMM: A Bidirectional Molecular Foundation Model

Thu, 26 Mar 2026 00:00:00 +0000

A Multimodal Foundation Model for Structure-Property Comprehension

This is a Method paper that introduces the Structure-Property Multi-Modal foundation model (SPMM), a transformer-based architecture that treats SMILES strings and molecular property vectors (PVs) as two separate modalities and learns to align them in a shared embedding space. The primary contribution is enabling bidirectional generation through a single pre-trained model: given a property vector, SPMM can generate molecules (inverse-QSAR), and given a SMILES string, it can predict all 53 properties simultaneously. The model also transfers to unimodal downstream tasks including MoleculeNet benchmarks and reaction prediction.

Bridging the Gap Between Molecular Structure and Properties

Existing chemical pre-trained models typically learn representations from a single modality (SMILES, graphs, or fingerprints) and fine-tune for specific downstream tasks. While some approaches have attempted multimodal learning by combining SMILES with graph representations or InChI strings, these modalities encode nearly identical structural information, limiting the potential for emergent cross-modal knowledge.

The key gap SPMM addresses is the lack of multimodal pre-training that incorporates genuinely complementary modalities. Prior conditional molecule generation methods could typically control only a small number of properties simultaneously and required retraining when target properties changed. The authors draw on successes in vision-language pre-training (VLP), where aligning image and text modalities has enabled rich bidirectional understanding, and apply similar ideas to molecular structure and property domains.

Treating Property Vectors as a Language

The core innovation in SPMM is treating a collection of 53 RDKit-computed molecular properties as a “language” where each property value is analogous to a word token. This design allows the model to attend to individual properties independently rather than treating the entire property vector as a single fixed-length condition.

Dual-Stream Architecture

SPMM follows the dual-stream VLP architecture. The model has three components:

SMILES Encoder: 6 BERT-base layers that encode tokenized SMILES (using a 300-subword BPE vocabulary) via self-attention
PV Encoder: 6 BERT-base layers that encode the 53 normalized property values (each passed through a linear layer) with learnable positional embeddings
Fusion Encoder: 6 BERT-base layers with cross-attention that combines both modalities, using one modality’s features as queries and the other as keys/values

Pre-training Objectives

The model is pre-trained with four complementary losses:

Contrastive Learning aligns SMILES and PV features in a shared embedding space. For [CLS] token outputs $\mathbf{S}_{cls}$ and $\mathbf{P}_{cls}$:

$$ \text{sim}(\mathbf{S}, \mathbf{P}) = \left(h_{S}(\mathbf{S}_{cls})\right)^{\top} h_{P}(\mathbf{P}_{cls}) $$

The intermodal similarities are computed with a learnable temperature $\tau$:

$$ s_{s2p} = \frac{\exp(\text{sim}(\mathbf{S}, \mathbf{P}) / \tau)}{\sum_{n=1}^{N} \exp(\text{sim}(\mathbf{S}, \mathbf{P}_{n}) / \tau)} $$

The contrastive loss uses cross-entropy with one-hot labels (1 for same-molecule pairs):

$$ L_{\text{contrastive}} = \frac{1}{2}\left(H(y_{s2p}, s_{s2p}) + H(y_{p2s}, s_{p2s}) + H(y_{s2s}, s_{s2s}) + H(y_{p2p}, s_{p2p})\right) $$

Next Word Prediction (NWP) trains autoregressive SMILES generation conditioned on the PV:

$$ L_{NWP} = \sum_{i=1}^{n} H\left(y_{n}^{NWP}, p^{NWP}(s_{n} \mid s_{0:n-1}, \mathbf{P})\right) $$

Next Property Prediction (NPP) applies the same autoregressive concept to property values, using mean-square-error loss:

$$ L_{NPP} = \sum_{i=1}^{n} \left(p_{n} - \hat{p}_{n}(p_{0:n-1}, \mathbf{S})\right)^{2} $$

SMILES-PV Matching (SPM) is a binary classification loss predicting whether a SMILES-PV pair originated from the same molecule, trained with hard-negative mining.

The overall pre-training loss combines all four:

$$ L = \widetilde{L}_{\text{contrastive}} + \widetilde{L}_{NWP} + L_{NPP} + L_{SPM} $$

where tildes indicate the use of momentum teacher distillation to soften one-hot labels, acknowledging that multiple valid SMILES-PV pairings may exist.

Random Property Masking

During pre-training, 50% of property values are randomly replaced with a special [UNK] token. This serves three purposes: preventing overfitting to specific properties, augmenting data, and enabling flexible inference where users can specify any subset of the 53 properties as generation conditions. The model can handle all $2^{53}$ possible property combinations at inference time despite never seeing most of them during training.

Experiments Across Bidirectional and Unimodal Tasks

PV-to-SMILES Generation (Conditional Molecule Design)

The authors evaluate SPMM on multiple generation scenarios using 1000 unseen PubChem PVs:

Sampling	Input PV	Validity	Uniqueness	Novelty	Norm. RMSE
Deterministic	1000 unseen PVs	0.995	0.999	0.961	0.216
Stochastic	Full PV (molecule 1)	0.974	0.905	0.998	0.185
Stochastic	Molar mass = 150	0.974	0.945	0.872	0.192
Stochastic	4 properties controlled	0.998	0.981	0.952	0.257
Stochastic	No control (all [UNK])	0.971	0.991	0.950	-

The normalized RMSE of 0.216 across 53 properties indicates that generated molecules closely match the input property conditions. The model can also perform unconditional generation (all properties masked) where outputs follow the pre-training distribution. The authors report that SPMM outperforms benchmark models including MolGAN, GraphVAE, and scaffold-based graph generative models in both conditional and unconditional settings (Supplementary Table 1).

SMILES-to-PV Generation (Multi-Property Prediction)

When given 1000 unseen ZINC15 molecules, SPMM predicts all 53 properties autoregressively with a mean $r^{2}$ of 0.924 across all properties.

MoleculeNet Benchmarks

Using only the SMILES encoder (6 BERT layers), SPMM achieves best or competitive performance on 9 MoleculeNet tasks:

Dataset	Metric	SPMM	Best Baseline	Baseline Model
ESOL	RMSE	0.817	0.798	ChemRL-GEM
LIPO	RMSE	0.681	0.660	ChemRL-GEM
FreeSolv	RMSE	1.868	1.877	ChemRL-GEM
BACE (reg)	RMSE	1.041	1.047	MolFormer
Clearance	RMSE	42.607	43.175	MolFormer
BBBP	AUROC	75.1%	73.6%	MolFormer
BACE (cls)	AUROC	84.4%	86.3%	MolFormer
ClinTox	AUROC	92.7%	91.2%	MolFormer
SIDER	AUROC	66.9%	67.2%	ChemRL-GEM

SPMM achieved best performance on 5 of 9 tasks, with notable gains on BBBP (75.1% vs. 73.6%) and ClinTox (92.7% vs. 91.2%). Without pre-training, all scores dropped substantially.

DILI Classification

On Drug-Induced Liver Injury prediction, SPMM achieved 92.6% AUROC, outperforming the 5-ensemble model of Ai et al. (90.4% AUROC) while using a single model.

Reaction Prediction

On USPTO-480k forward reaction prediction, SPMM achieved 91.5% top-1 accuracy, the highest among all models tested (including Chemformer at 91.3%). On USPTO-50k retro-reaction prediction, SPMM reached 53.4% top-1 accuracy, second only to Chemformer (54.3%) among string-based models.

Bidirectional Generation From a Single Pre-trained Model

SPMM demonstrates that multimodal pre-training with genuinely complementary modalities (structure and properties, rather than structurally redundant representations) enables a single foundation model to handle both generation directions and downstream unimodal tasks. Key findings include:

Flexible conditional generation: The [UNK] masking strategy allows controlling any subset of 53 properties at inference time without retraining, a capability not demonstrated by prior methods.
Interpretable cross-attention: Attention visualizations show that the model learns chemically meaningful structure-property relationships (e.g., hydrogen bonding properties attend to oxygen and nitrogen atoms; ring count properties attend to ring tokens).
Competitive unimodal transfer: Despite using only 6 BERT layers and 50M pre-training molecules (smaller than ChemBERTa-2’s 77M or Chemformer’s 100M), the SMILES encoder alone achieves best or second-best results on 5 of 9 MoleculeNet tasks and the highest forward reaction prediction accuracy among tested models.

Limitations

The authors acknowledge several limitations:

SMILES representation constraints: Implicit connectivity information in SMILES means small structural changes can cause drastic string changes. Graph representations could be a complementary alternative.
Stereochemistry blindness: All 53 RDKit properties used are invariant to stereochemistry, meaning different stereoisomers produce identical PVs. The contrastive loss then forces their SMILES encoder outputs to converge, which the authors identify as the primary factor limiting MoleculeNet performance on stereo-sensitive tasks.
No wet-lab validation: Generated molecules and predicted properties are not experimentally verified.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pre-training	PubChem	50M molecules	SMILES + 53 RDKit properties
Property prediction	MoleculeNet (9 tasks)	642-4200 per task	Scaffold split via DeepChem (8:1:1)
DILI classification	Ai et al. dataset	Not specified	Following published preparation
Forward reaction	USPTO-480k	479,035 pairs	Reactant-product pairs
Retro reaction	USPTO-50k	50,037 pairs	Product-reactant pairs, no reaction types used
SMILES-to-PV test	ZINC15	1000 molecules	Not in pre-training set

Algorithms

Tokenization: BPE with 300-subword dictionary
Property masking: 50% random replacement with [UNK] during pre-training
Momentum distillation: EMA parameter $\lambda = 0.995$, soft-label mixing $\alpha$ linearly warmed from 0 to 0.4 over first epoch
Contrastive queue: Size $k = 24{,}576$ for storing recent SMILES and PV instances
Beam search: $k = 2$ for PV-to-SMILES generation
SMILES augmentation: Random non-canonical augmentation with probability 0.5 for reaction tasks

Models

Architecture: 6 BERT-base encoder layers each for SMILES encoder, PV encoder, and fusion encoder (18 total layers)
Vocabulary: 300 BPE subwords for SMILES; 53 property tokens for PV
Pre-trained weights: Available via GitHub

Evaluation

Task	Metric	Value	Notes
PV-to-SMILES (deterministic)	Validity	99.5%	1000 unseen PubChem PVs
PV-to-SMILES (deterministic)	Normalized RMSE	0.216	Across 53 properties
SMILES-to-PV	Mean $r^{2}$	0.924	1000 ZINC15 molecules
Forward reaction (USPTO-480k)	Top-1 accuracy	91.5%	Best among all tested models
Retro reaction (USPTO-50k)	Top-1 accuracy	53.4%	Second-best string-based
DILI classification	AUROC	92.6%	Single model vs. 5-ensemble

Hardware

Pre-training: 8 NVIDIA A100 GPUs, approximately 52,000 batch iterations, roughly 12 hours
Batch size: 96
Optimizer: AdamW with weight decay 0.02
Learning rate: Warmed up to $10^{-4}$, cosine decay to $10^{-5}$

Artifacts

Artifact	Type	License	Notes
SPMM Source Code	Code	Apache-2.0	Official implementation with experimental scripts
SPMM Zenodo Archive	Code	Apache-2.0	Archived version for reproducibility
PubChem	Dataset	Public domain	50M molecules for pre-training
MoleculeNet	Dataset	Varies	Benchmark datasets via DeepChem

Paper Information

Citation: Chang, J., & Ye, J. C. (2024). Bidirectional generation of structure and properties through a single molecular foundation model. Nature Communications, 15, 2323. https://doi.org/10.1038/s41467-024-46440-3

@article{chang2024bidirectional,
  title={Bidirectional generation of structure and properties through a single molecular foundation model},
  author={Chang, Jinho and Ye, Jong Chul},
  journal={Nature Communications},
  volume={15},
  pages={2323},
  year={2024},
  doi={10.1038/s41467-024-46440-3}
}

nach0: A Multimodal Chemical and NLP Foundation Model

Thu, 26 Mar 2026 00:00:00 +0000

A Multi-Domain Encoder-Decoder for Chemistry and NLP

nach0 is a Method paper that introduces a unified encoder-decoder foundation model capable of handling both natural language processing (NLP) tasks and chemistry tasks within a single architecture. The primary contribution is demonstrating that a T5-based model pre-trained on scientific text, patents, and SMILES molecular strings can be instruction-tuned to perform molecular property prediction, reaction prediction, molecular generation, named entity recognition, question answering, and cross-domain translation (text-to-molecule and molecule-to-text) simultaneously. The model is available in base (250M parameters) and large (780M parameters) configurations.

Bridging Chemical and Linguistic Representations

Most existing biomedical language models (BioBERT, SciFive, BioMegatron) are trained exclusively on natural language text from sources like PubMed, omitting chemical structure information encoded in SMILES strings. Conversely, chemistry-specific models trained on SMILES data often lack the ability to process natural language instructions or perform NLP tasks. Models like Galactica and MolT5 attempted to bridge this gap by training on both natural language and chemical data, but they were not fine-tuned on a diverse set of chemical tasks using instruction tuning in a multi-task fashion.

nach0 addresses this by creating a shared representation space for both modalities and fine-tuning across a comprehensive set of tasks spanning three domains: NLP-only tasks, chemistry-only tasks, and cross-domain tasks that require translating between natural language and molecular representations.

Unified Text-to-Text Framework with SMILES Tokenization

The core innovation in nach0 is formulating all chemical and linguistic tasks as text-to-text problems within a single encoder-decoder transformer, combined with a specialized SMILES tokenization strategy.

SMILES Token Integration

Rather than treating SMILES as plain text, nach0 extends the T5 vocabulary with dedicated SMILES tokens. Each SMILES token is annotated with special symbols in the format , creating a distinct vocabulary space for molecular representations while preserving the natural language vocabulary from FLAN-T5. The embedding matrix is initialized by reusing learned embeddings from the pre-trained model for original tokens, with new chemical tokens initialized from the first embeddings.

Architecture

Both model sizes use the standard T5 encoder-decoder architecture:

Configuration	Parameters	Layers	Hidden Size	FFN Size	Attention Heads
Base	250M	12	768	3072	12
Large	780M	24	1024	4096	16

Pre-training Data

The model is pre-trained with a language modeling objective on three data sources:

Source	Documents	Tokens
PubMed abstracts (chemistry-filtered)	13M	355M
USPTO patent descriptions	119K	2.9B
ZINC molecular database	~100M	4.7B

Instruction Tuning

Following the approach of Raffel et al. and Chung et al., nach0 uses natural language prompts to formulate each task. For example, a retrosynthesis task might be phrased as “What reactants could be used to synthesize [SMILES]?” and a property prediction task as “Can [SMILES] penetrate the BBB?” This enables multi-task training across all domains with a single loss function and shared hyperparameters.

Training uses a batch size of 1024, learning rate of $1 \times 10^{-4}$, and weight decay of 0.01. Pre-training runs for one epoch, and fine-tuning for 10 epochs. Data mixing follows the examples-proportional mixing strategy from T5.

Multi-Task Evaluation Across NLP and Chemistry Benchmarks

nach0 is evaluated on a comprehensive set of benchmarks spanning three task categories.

Task Categories

NLP tasks: Named entity recognition (BC5CDR-Chemical, BC5CDR-Disease, NCBI-Disease, BC2GM, JNLPBA), PICO extraction (EBM PICO), textual entailment (MedNLI, SciTail), relation extraction (ChemProt, DDI, GAD), sentence similarity (BIOSSES), document classification (HoC), and question answering (PubMedQA, BioASQ, MedMCQA, MMLU).

Chemistry tasks: Molecular property prediction (ESOL, FreeSolv, Lipophilicity, BBBP, HIV, BACE from MoleculeNet; QM9 from Mol-Instructions), molecular generation (MOSES), forward reaction prediction, reagent prediction, and retrosynthesis (from Mol-Instructions/USPTO).

Cross-domain tasks: Description-guided molecule design and molecular description generation (from Mol-Instructions).

Baselines

nach0 is compared against FLAN-T5 (250M), SciFive (220M), and MolT5 (220M), all trained in multi-task fashion.

Key Results

On chemistry and cross-domain tasks, nach0 base consistently outperforms all base-sized baselines. Selected highlights from Table 3:

Task	Metric	MolT5	SciFive	FLAN	nach0 Base	nach0 Large
Forward reaction	Acc@1	27.0%	60.0%	59.0%	88.0%	89.9%
Retrosynthesis	Acc@1	15.0%	31.0%	31.0%	53.0%	56.3%
Reagent prediction	Acc@1	1.1%	3.8%	4.0%	6.3%	13.1%
BACE	BA	0.58	0.65	0.65	0.74	0.71
BBBP	BA	0.55	0.66	0.60	0.67	0.68
HFE (FreeSolv)	R2	-0.36	0.51	0.55	0.77	0.78
MOSES (FCD)	FCD/Test	0.521	0.578	0.529	0.311	0.304
Description-guided mol. design	BLEU-2	30.3%	44.2%	43.6%	49.0%	48.8%
Mol. description gen.	BLEU-2	35.6%	39.6%	38.6%	43.9%	41.7%

On NLP tasks, nach0 base performs comparably to FLAN base, with the two models trading wins across different tasks. nach0 large improves substantially over nach0 base on most tasks.

Ablation Study

The ablation study (Table 4) examines the impact of multi-task training across chemical task groups. Key findings:

nach0 trained on all chemical tasks jointly outperforms models trained on individual task groups (prediction-only, reaction-only, or generation-only) on the total set of metrics
The joint model shows lower novelty scores on MOSES compared to the generation-only model, but this reflects less overfitting to training data rather than worse performance
nach0 consistently outperforms MolT5 across all chemical task configurations, demonstrating the benefit of pre-training on both natural language and chemical data with specialized SMILES tokens

Case Studies

Two applied case studies demonstrate nach0 in drug discovery scenarios:

End-to-end drug discovery for diabetes mellitus: Using a sequence of prompts, nach0 identifies biological targets, analyzes mechanisms of action, generates molecular structures, proposes synthesis routes, and predicts molecular properties.
JAK3 inhibitor generation with Chemistry42: nach0 replaces 42 specialized generative models in Insilico Medicine’s Chemistry42 platform. In 45 minutes, nach0 generates 8 molecules satisfying all 2D and 3D requirements (hinge binding, active site binding), compared to a 0.04% discovery rate from a combinatorial generator over 24 hours. Chemistry42’s full pipeline (72 hours) still produces better structures since it uses reinforcement learning feedback and explicit structural constraints.

Comparison with ChatGPT

On a subset evaluation, fine-tuned nach0 base outperforms GPT-3.5-turbo on all tested tasks: EBM PICO (F1: 67.6% vs. 64.4%), MedMCQA-Open (BLEU-2: 6.3% vs. 1.7%), and molecular description generation (BLEU-2: 42.8% vs. 2.2%).

Competitive Multi-Task Performance with Clear Limitations

nach0 demonstrates that a single encoder-decoder model can achieve competitive results across both chemical and NLP tasks when pre-trained on mixed-modality data and fine-tuned with instruction tuning. The model’s strongest advantages appear on chemistry tasks (reaction prediction, property prediction, molecular generation), where specialized SMILES tokenization and chemical pre-training provide clear benefits over general-purpose models of similar scale.

Limitations Acknowledged by the Authors

Not at chemist expert level: Human evaluations indicate the model does not match domain expert performance. Key gaps include chemical reasoning, knowledge alignment with domain-specific knowledge graphs, and the ability to learn from expert feedback.
SMILES-only molecular representation: The model lacks 3D geometric information. SMILES notation is not one-to-one with molecular structures, and the model does not incorporate molecular graphs or 3D coordinates. The authors suggest SELFIES as a potential alternative representation.
Prompt sensitivity: Performance depends on prompt quality and specificity. Over-reliance on domain-specific prompts may limit response diversity.
Limited chemical diversity: Cross-domain datasets from Mol-Instructions primarily cover known drugs and chemical probes from PubChem, representing only a fraction of predicted chemical space.

Future Directions

The authors propose extending nach0 with protein sequence modalities (using Group SELFIES), expanding zero-shot evaluation capabilities, and integrating knowledge graph information through self-supervised approaches.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pre-training (text)	PubMed abstracts	13M docs, 355M tokens	Filtered for chemistry-related content
Pre-training (text)	USPTO patents	119K docs, 2.9B tokens	Patent descriptions
Pre-training (chemical)	ZINC	~100M docs, 4.7B tokens	Molecular SMILES strings
Fine-tuning (NLP)	17 NLP datasets	Varies	See Table 1 in paper
Fine-tuning (chemistry)	MoleculeNet, MOSES, Mol-Instructions	Varies	Predefined or random splits

Algorithms

Architecture: T5 encoder-decoder (base: 250M, large: 780M parameters)
Pre-training objective: Language modeling (masked span prediction)
Fine-tuning: Multi-task instruction tuning with examples-proportional mixing
Hyperparameters: batch size 1024, learning rate $1 \times 10^{-4}$, weight decay 0.01
Pre-training: 1 epoch; fine-tuning: 10 epochs

Models

Artifact	Type	License	Notes
nach0 Base (HuggingFace)	Model	CC-BY-NC-4.0	250M parameter encoder-decoder
nach0 Large (HuggingFace)	Model	CC-BY-NC-4.0	780M parameter encoder-decoder
nach0 GitHub Repository	Code	Not specified	Training and inference code

Evaluation

Evaluation spans 17+ NLP benchmarks and 10+ chemistry benchmarks. Metrics include F1 (NER, RE, classification), accuracy (QA, entailment, reaction prediction), balanced accuracy (molecular property classification), R2/RMSE (regression), BLEU-2 (generation), and FCD/SNN/validity/novelty (molecular generation via MOSES).

Hardware

Base models: NVIDIA A4000 and A5000 GPUs
Large models: NVIDIA DGX cloud platform
Training used tensor and pipeline parallelism via NeMo toolkit
Specific GPU counts and training times not reported

Paper Information

Citation: Livne, M., Miftahutdinov, Z., Tutubalina, E., Kuznetsov, M., Polykovskiy, D., Brundyn, A., Jhunjhunwala, A., Costa, A., Aliper, A., Aspuru-Guzik, A., & Zhavoronkov, A. (2024). nach0: Multimodal Natural and Chemical Languages Foundation Model. Chemical Science, 15(22), 8380-8389. https://doi.org/10.1039/D4SC00966E

@article{livne2024nach0,
  title={nach0: multimodal natural and chemical languages foundation model},
  author={Livne, Micha and Miftahutdinov, Zulfat and Tutubalina, Elena and Kuznetsov, Maksim and Polykovskiy, Daniil and Brundyn, Annika and Jhunjhunwala, Aastha and Costa, Anthony and Aliper, Alex and Aspuru-Guzik, Al{\'a}n and Zhavoronkov, Alex},
  journal={Chemical Science},
  volume={15},
  number={22},
  pages={8380--8389},
  year={2024},
  publisher={Royal Society of Chemistry},
  doi={10.1039/D4SC00966E}
}

Multimodal Molecular Models on Hunter Heidenreich | ML Research Scientist

MoMu: Bridging Molecular Graphs and Natural Language

Bridging Molecular Graphs and Natural Language Through Contrastive Learning

Why Single-Modality Models Are Insufficient for Molecular Understanding

Contrastive Pre-Training with Inter-Modal and Intra-Modal Objectives

Data Collection

Contrastive Training Objective

Zero-Shot Text-to-Graph Generation

Evaluation Across Four Downstream Tasks

Cross-Modal Retrieval

Molecule Captioning

Molecular Property Prediction

Zero-Shot Text-to-Graph Generation

Promising Multimodal Transfer with Clear Data Limitations

Limitations

Future Directions

Reproducibility Details

Data

Algorithms

Models

Evaluation

Hardware

Artifacts

Paper Information

MolFM: Trimodal Molecular Foundation Pre-training

Trimodal Pre-training for Molecular Understanding

Why Existing Molecular Models Fall Short

Cross-Modal Attention and Metric Learning Guarantees

Architecture

Pre-training Objectives

Theoretical Justifications

Experiments Across Four Downstream Tasks

Pre-training Data

Cross-Modal Retrieval

Molecule Captioning

Text-Based Molecule Generation

Molecular Property Prediction

Ablation Studies

Key Findings and Limitations

Reproducibility Details

Data

Algorithms

Models

Evaluation

Hardware

Artifacts

Paper Information

BioT5: Cross-Modal Integration of Biology and Chemistry

A Unified Pretraining Framework for Molecules, Proteins, and Text

Bridging the Gap Between Molecular Sequences and Scientific Knowledge

SELFIES, Separate Tokenization, and Multi-Task Pretraining

SELFIES for Robust Molecular Representation

Modality-Specific Tokenization

Multi-Task Pretraining Objectives

Evaluation Across 15 Downstream Tasks

Molecule Property Prediction (MoleculeNet)

Protein Property Prediction (PEER Benchmark)

Drug-Target Interaction Prediction

Molecule Captioning and Text-Based Molecule Generation

Protein-Protein Interaction Prediction

Key Findings, Limitations, and Future Directions

Reproducibility Details

Data

Algorithms

Models

Evaluation

Hardware

Artifacts

Paper Information

MG-BERT: Graph BERT for Molecular Property Prediction

A Graph-Aware BERT for Molecular Property Prediction

Data Scarcity in Molecular Property Prediction

Bond-Based Local Attention and Masked Atom Pretraining

Architecture Modifications

Masked Atom Prediction

Model Configurations

Experimental Setup and Baselines

Pretraining

Fine-tuning Datasets

Baselines