MolFM: Trimodal Molecular Foundation Pre-training

Trimodal Pre-training for Molecular Understanding

MolFM is a Method paper that introduces a multimodal molecular foundation model integrating three distinct sources of molecular knowledge: 2D molecular graphs, biomedical text, and knowledge graphs. The primary contribution is a pre-training framework that uses fine-grained cross-modal attention to fuse information across all three modalities, combined with theoretical justification from a deep metric learning perspective. MolFM achieves the best reported results (at time of publication) on cross-modal retrieval, molecule captioning, text-based molecule generation, and molecular property prediction.

Why Existing Molecular Models Fall Short

Prior multimodal molecular foundation models operate on at most two modalities (structures and text) and suffer from two key limitations. First, generative approaches like KV-PLM and MolT5 rely on 1D SMILES strings, which cannot capture complex topological and spatial molecular properties such as macrocycles. Contrastive approaches like MoMu and MoleculeSTM learn global alignment between molecule graphs and text but overlook fine-grained connections between specific substructures and textual descriptions.

Second, and more fundamentally, no prior model incorporates knowledge graphs as a third modality. Knowledge graphs encode global-level relationships among molecules, target ligands, diseases, and other biomedical entities. These relationships capture functional and structural similarity patterns that cannot be learned from individual molecule-text pairs alone. MolFM addresses both gaps by introducing cross-modal attention across all three modalities and providing theoretical guarantees about what the pre-training objectives learn.

Architecture

MolFM uses three pre-trained single-modal encoders:

Molecular graph encoder: A 5-layer GIN (1.8M parameters) initialized from GraphMVP, producing atom-level features $h_{SA}$ and a graph-level feature $h_{SM}$
Text encoder: A 6-layer transformer (61.8M parameters) initialized from KV-PLM’s first 6 layers, producing token features $h_T$
Knowledge graph encoder: A TransE model (12.6M parameters) trained on the knowledge graph for 500 epochs, producing entity features $h_K$

A multimodal encoder (61.8M parameters, 6 transformer layers with cross-attention) fuses the three modalities. The cross-attention uses text token features as queries and the concatenation of atom features and knowledge graph neighbor features as keys and values. For each molecule, the knowledge graph input is the molecule’s entity and $N=4$ randomly sampled one-hop neighbors.

Pre-training Objectives

MolFM combines four losses:

Structure-text contrastive (STC) aligns the global feature spaces of structure and text encoders using a symmetric InfoNCE loss:

$$\mathcal{L}_{stc} = -\frac{1}{2} \left[ \log \frac{\exp(s(z_S, z_T) / \tau)}{\sum_{S’ \in B} \exp(s(z_{S’}, z_T) / \tau)} + \log \frac{\exp(s(z_S, z_T) / \tau)}{\sum_{T’ \in B} \exp(s(z_S, z_{T’}) / \tau)} \right]$$

where $s(\cdot, \cdot)$ is cosine similarity and $\tau = 0.1$ is a temperature parameter.

Cross-modal matching (CMM) predicts whether a structure-text-knowledge triplet corresponds to the same molecule, using cross-entropy over the multimodal encoder’s CLS token:

$$\mathcal{L}_{cmm} = \sum_{(\tilde{S}, \tilde{T}, \tilde{K}) \in \tilde{B}} H\left[y_{cmm}(\tilde{S}, \tilde{T}, \tilde{K}),; p_{cmm}\left(\mathcal{M}_\theta(h_{\tilde{S}}, h_{\tilde{T}}, h_{\tilde{K}})\right)\right]$$

Masked language modeling (MLM) predicts masked text tokens conditioned on all three modalities:

$$\mathcal{L}_{mlm} = H\left[y_{mlm}(\hat{T}),; p_{mlm}\left(\mathcal{M}_\theta(h_S, h_{\hat{T}}, h_K)\right)\right]$$

Knowledge graph embedding (KGE) regularizes entity embeddings with a max-margin TransE loss:

$$\mathcal{L}_{kge} = \sum_{h \in K} \left[\max(0, d(h,r,t) - d(h,r,\tilde{t}) + \Delta) + \max(0, d(h,r,t) - d(\tilde{h},r,t) + \Delta)\right]$$

where $d(h,r,t) = | f(h) + g(r) - f(t) |_2$ and $\Delta = 0.2$.

The total pre-training loss is:

$$\mathcal{L} = \mathbb{E}_{(S,T,K)}\left[\mathcal{L}_{stc} + \mathcal{L}_{cmm} + \mathcal{L}_{mlm} + \mathcal{L}_{kge}\right]$$

Theoretical Justifications

The authors provide metric learning interpretations for each objective. For CMM, they show that the loss is proportional to assigning higher scores to matched triplets and lower scores to unmatched ones, aligning the feature space across all three modalities.

For KGE, two lemmas provide guarantees about structurally and functionally similar molecules:

Lemma 1 (Structural similarity): For a symmetric structural-similarity relation $r_s$, the KGE loss satisfies:

$$\mathcal{L}_{kge}(h, r_s, t) \propto 2|f(h) - f(t)| - \mathbb{E}_{\tilde{t}}|f(h) - f(\tilde{t})| - \mathbb{E}_{\tilde{h}}|f(\tilde{h}) - f(t)|$$

This shows KGE pulls structurally similar molecules closer while pushing dissimilar ones apart.

Lemma 2 (Functional similarity): For molecules $h$ and $t$ that interact with a common entity $o$, the distance between their embeddings is upper-bounded:

$$|f(h) - f(t)| \leq \alpha,\mathbb{E}_{(e_1, r, e_2) \sim \mathcal{I}}\left[\mathcal{L}_{kge}(e_1, r, e_2)\right] + C$$

where $\alpha \approx 1$ and $C \approx 0$. This guarantees that minimizing KGE also brings functionally similar molecules closer in the embedding space.

Experiments Across Four Downstream Tasks

Pre-training Data

MolFM pre-trains on 15K molecules from PubChem paired with 37M paragraphs from S2ORC. The knowledge graph contains 49K entities and 3.2M relations, constructed from DrugBank, BindingDB, and additional public databases with heuristic augmentation.

Evaluated on PCdes (paragraph-level) in zero-shot and fine-tuning settings. MolFM uses a re-ranking strategy that linearly combines cosine similarity with CMM logits over the top-$k$ retrieved candidates.

Mode	Model	S-T MRR	S-T R@1	S-T R@10	T-S MRR	T-S R@1	T-S R@10
Zero-shot	MoMu	9.89	5.08	18.93	10.33	4.90	20.69
Zero-shot	MolFM	21.42	13.90	36.21	23.63	16.14	39.54
Fine-tune	MoMu	34.29	24.47	53.84	34.53	24.87	54.25
Fine-tune	MolFM	39.56	29.76	58.63	39.34	29.39	58.49

MolFM achieves 12.13% and 5.04% absolute gains over MoMu under zero-shot and fine-tuning settings, respectively.

Molecule Captioning

Evaluated on ChEBI-20 using MolT5 decoders. MolFM’s structure encoder features are concatenated with the MolT5 encoder outputs.

Decoder	Encoder	BLEU-4	ROUGE-L	METEOR	Text2Mol
MolT5-base	MolT5-base	0.457	0.578	0.569	0.547
MolT5-base	MoMu	0.462	0.575	0.576	0.558
MolT5-base	GraphMVP	0.491	0.592	0.599	0.570
MolT5-base	MolFM	0.498	0.594	0.607	0.576

Text-Based Molecule Generation

Also on ChEBI-20 with MolT5 decoders. MolFM’s text features are projected and fed to the decoder.

Decoder	Encoder	Exact	Valid	Morgan FTS	Text2Mol
MolT5-base	MolT5-base	0.082	0.786	0.601	0.543
MolT5-base	MoMu	0.183	0.863	0.678	0.580
MolT5-base	MolFM	0.210	0.892	0.697	0.583

Molecular Property Prediction

On MoleculeNet (8 classification datasets), MolFM concatenates the structure feature and the multimodal encoder’s CLS feature to predict properties.

Model	BBBP	Tox21	ClinTox	HIV	BACE	Avg
GraphMVP	72.4	74.4	77.5	77.0	81.2	73.07
DeepEIK	72.1	72.4	89.7	75.0	80.5	73.27
MolFM (w/o T+K)	72.2	76.6	78.6	78.2	82.6	73.95
MolFM (w/ T+K)	72.9	77.2	79.7	78.8	83.9	74.62

With multimodal inputs, MolFM averages 74.62% ROC-AUC, a 1.55% absolute gain over GraphMVP.

Ablation Studies

Zero-shot retrieval ablations reveal that cross-modal attention to atoms and CMM are the most critical components. Removing either causes a sharp drop (approximately 3% on S-T retrieval). Knowledge graph incorporation yields a 1.5% average improvement, with both attention to neighbors and KGE contributing marginally.

Key Findings and Limitations

MolFM demonstrates that incorporating knowledge graphs as a third modality provides consistent improvements across all evaluated tasks. The theoretical analysis connecting pre-training objectives to deep metric learning provides interpretability for why the model works: STC and CMM align representations of the same molecule across modalities, while KGE pulls structurally and functionally similar molecules closer in the embedding space.

The cross-modal attention visualizations show that MolFM learns to associate specific atom substructures with relevant text tokens and knowledge graph entities. For example, the model correctly attends to functional groups mentioned in textual descriptions.

The authors acknowledge several limitations:

Data quality: The pre-training dataset (15K molecules) is small and may introduce biases
Cold-start problem: MolFM provides limited benefit for newly emerged molecules lacking text and knowledge graph information
Entity scope: The model focuses on molecules and does not incorporate proteins, genes, or cell lines, which could further improve biomedical understanding

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pre-training (molecules)	PubChem	15K molecules	Follows MoMu’s pre-training data
Pre-training (text)	S2ORC	37M paragraphs	Biomedical literature paragraphs
Knowledge graph	DrugBank, BindingDB, public DBs	49K entities, 3.2M relations	Constructed with heuristics from MoCL
Cross-modal retrieval	PCdes	Paragraph-level	Test split
Captioning/Generation	ChEBI-20	-	Following MolT5 splits
Property prediction	MoleculeNet	8 datasets	Classification tasks, ROC-AUC metric

Algorithms

Optimizer: AdamW with weight decay $1 \times 10^{-4}$
Learning rate: linear warmup to $1 \times 10^{-4}$ over 2,000 iterations, cosine annealing to $1 \times 10^{-5}$
Batch size: 128
Pre-training epochs: 300
Knowledge graph neighbors per molecule: $N = 4$
Temperature: $\tau = 0.1$
Margin: $\Delta = 0.2$

Models

Component	Architecture	Parameters	Initialization
Graph encoder	5-layer GIN	1.8M	GraphMVP
Text encoder	6-layer Transformer	61.8M	KV-PLM (first 6 layers)
Knowledge encoder	TransE	12.6M	Trained 500 epochs on KG
Multimodal encoder	6-layer Transformer + cross-attention	61.8M	KV-PLM (last 6 layers)
Total		~138M

Evaluation

Task	Metrics
Cross-modal retrieval	MRR, Recall@1/5/10
Molecule captioning	BLEU-2/4, ROUGE-1/2/L, METEOR, Text2Mol
Text-to-molecule generation	BLEU, Exact ratio, Validity, Levenshtein, Fingerprint Tanimoto (MACCS/RDKit/Morgan), Text2Mol
Property prediction	ROC-AUC per dataset

Hardware

4 NVIDIA A100 GPUs for pre-training

Artifacts

Artifact	Type	License	Notes
OpenBioMed	Code	MIT	Official implementation including MolFM

Paper Information

Citation: Luo, Y., Yang, K., Hong, M., Liu, X. Y., & Nie, Z. (2023). MolFM: A Multimodal Molecular Foundation Model. arXiv preprint arXiv:2307.09484.

@article{luo2023molfm,
  title={MolFM: A Multimodal Molecular Foundation Model},
  author={Luo, Yizhen and Yang, Kai and Hong, Massimo and Liu, Xing Yi and Nie, Zaiqing},
  journal={arXiv preprint arXiv:2307.09484},
  year={2023}
}

Trimodal Pre-training for Molecular Understanding#

Why Existing Molecular Models Fall Short#

Cross-Modal Attention and Metric Learning Guarantees#

Architecture#

Pre-training Objectives#

Theoretical Justifications#

Experiments Across Four Downstream Tasks#

Pre-training Data#

Cross-Modal Retrieval#

Molecule Captioning#

Text-Based Molecule Generation#

Molecular Property Prediction#

Ablation Studies#

Key Findings and Limitations#

Reproducibility Details#

Data#

Algorithms#

Models#

Evaluation#

Hardware#

Artifacts#

Paper Information#