Trimodal Pre-training for Molecular Understanding

MolFM is a Method paper that introduces a multimodal molecular foundation model integrating three distinct sources of molecular knowledge: 2D molecular graphs, biomedical text, and knowledge graphs. The primary contribution is a pre-training framework that uses fine-grained cross-modal attention to fuse information across all three modalities, combined with theoretical justification from a deep metric learning perspective. MolFM achieves the best reported results (at time of publication) on cross-modal retrieval, molecule captioning, text-based molecule generation, and molecular property prediction.

Why Existing Molecular Models Fall Short

Prior multimodal molecular foundation models operate on at most two modalities (structures and text) and suffer from two key limitations. First, generative approaches like KV-PLM and MolT5 rely on 1D SMILES strings, which cannot capture complex topological and spatial molecular properties such as macrocycles. Contrastive approaches like MoMu and MoleculeSTM learn global alignment between molecule graphs and text but overlook fine-grained connections between specific substructures and textual descriptions.

Second, and more fundamentally, no prior model incorporates knowledge graphs as a third modality. Knowledge graphs encode global-level relationships among molecules, target ligands, diseases, and other biomedical entities. These relationships capture functional and structural similarity patterns that cannot be learned from individual molecule-text pairs alone. MolFM addresses both gaps by introducing cross-modal attention across all three modalities and providing theoretical guarantees about what the pre-training objectives learn.

Cross-Modal Attention and Metric Learning Guarantees

Architecture

MolFM uses three pre-trained single-modal encoders:

  • Molecular graph encoder: A 5-layer GIN (1.8M parameters) initialized from GraphMVP, producing atom-level features $h_{SA}$ and a graph-level feature $h_{SM}$
  • Text encoder: A 6-layer transformer (61.8M parameters) initialized from KV-PLM’s first 6 layers, producing token features $h_T$
  • Knowledge graph encoder: A TransE model (12.6M parameters) trained on the knowledge graph for 500 epochs, producing entity features $h_K$

A multimodal encoder (61.8M parameters, 6 transformer layers with cross-attention) fuses the three modalities. The cross-attention uses text token features as queries and the concatenation of atom features and knowledge graph neighbor features as keys and values. For each molecule, the knowledge graph input is the molecule’s entity and $N=4$ randomly sampled one-hop neighbors.

Pre-training Objectives

MolFM combines four losses:

Structure-text contrastive (STC) aligns the global feature spaces of structure and text encoders using a symmetric InfoNCE loss:

$$\mathcal{L}_{stc} = -\frac{1}{2} \left[ \log \frac{\exp(s(z_S, z_T) / \tau)}{\sum_{S’ \in B} \exp(s(z_{S’}, z_T) / \tau)} + \log \frac{\exp(s(z_S, z_T) / \tau)}{\sum_{T’ \in B} \exp(s(z_S, z_{T’}) / \tau)} \right]$$

where $s(\cdot, \cdot)$ is cosine similarity and $\tau = 0.1$ is a temperature parameter.

Cross-modal matching (CMM) predicts whether a structure-text-knowledge triplet corresponds to the same molecule, using cross-entropy over the multimodal encoder’s CLS token:

$$\mathcal{L}_{cmm} = \sum_{(\tilde{S}, \tilde{T}, \tilde{K}) \in \tilde{B}} H\left[y_{cmm}(\tilde{S}, \tilde{T}, \tilde{K}),; p_{cmm}\left(\mathcal{M}_\theta(h_{\tilde{S}}, h_{\tilde{T}}, h_{\tilde{K}})\right)\right]$$

Masked language modeling (MLM) predicts masked text tokens conditioned on all three modalities:

$$\mathcal{L}_{mlm} = H\left[y_{mlm}(\hat{T}),; p_{mlm}\left(\mathcal{M}_\theta(h_S, h_{\hat{T}}, h_K)\right)\right]$$

Knowledge graph embedding (KGE) regularizes entity embeddings with a max-margin TransE loss:

$$\mathcal{L}_{kge} = \sum_{h \in K} \left[\max(0, d(h,r,t) - d(h,r,\tilde{t}) + \Delta) + \max(0, d(h,r,t) - d(\tilde{h},r,t) + \Delta)\right]$$

where $d(h,r,t) = | f(h) + g(r) - f(t) |_2$ and $\Delta = 0.2$.

The total pre-training loss is:

$$\mathcal{L} = \mathbb{E}_{(S,T,K)}\left[\mathcal{L}_{stc} + \mathcal{L}_{cmm} + \mathcal{L}_{mlm} + \mathcal{L}_{kge}\right]$$

Theoretical Justifications

The authors provide metric learning interpretations for each objective. For CMM, they show that the loss is proportional to assigning higher scores to matched triplets and lower scores to unmatched ones, aligning the feature space across all three modalities.

For KGE, two lemmas provide guarantees about structurally and functionally similar molecules:

Lemma 1 (Structural similarity): For a symmetric structural-similarity relation $r_s$, the KGE loss satisfies:

$$\mathcal{L}_{kge}(h, r_s, t) \propto 2|f(h) - f(t)| - \mathbb{E}_{\tilde{t}}|f(h) - f(\tilde{t})| - \mathbb{E}_{\tilde{h}}|f(\tilde{h}) - f(t)|$$

This shows KGE pulls structurally similar molecules closer while pushing dissimilar ones apart.

Lemma 2 (Functional similarity): For molecules $h$ and $t$ that interact with a common entity $o$, the distance between their embeddings is upper-bounded:

$$|f(h) - f(t)| \leq \alpha,\mathbb{E}_{(e_1, r, e_2) \sim \mathcal{I}}\left[\mathcal{L}_{kge}(e_1, r, e_2)\right] + C$$

where $\alpha \approx 1$ and $C \approx 0$. This guarantees that minimizing KGE also brings functionally similar molecules closer in the embedding space.

Experiments Across Four Downstream Tasks

Pre-training Data

MolFM pre-trains on 15K molecules from PubChem paired with 37M paragraphs from S2ORC. The knowledge graph contains 49K entities and 3.2M relations, constructed from DrugBank, BindingDB, and additional public databases with heuristic augmentation.

Cross-Modal Retrieval

Evaluated on PCdes (paragraph-level) in zero-shot and fine-tuning settings. MolFM uses a re-ranking strategy that linearly combines cosine similarity with CMM logits over the top-$k$ retrieved candidates.

ModeModelS-T MRRS-T R@1S-T R@10T-S MRRT-S R@1T-S R@10
Zero-shotMoMu9.895.0818.9310.334.9020.69
Zero-shotMolFM21.4213.9036.2123.6316.1439.54
Fine-tuneMoMu34.2924.4753.8434.5324.8754.25
Fine-tuneMolFM39.5629.7658.6339.3429.3958.49

MolFM achieves 12.13% and 5.04% absolute gains over MoMu under zero-shot and fine-tuning settings, respectively.

Molecule Captioning

Evaluated on ChEBI-20 using MolT5 decoders. MolFM’s structure encoder features are concatenated with the MolT5 encoder outputs.

DecoderEncoderBLEU-4ROUGE-LMETEORText2Mol
MolT5-baseMolT5-base0.4570.5780.5690.547
MolT5-baseMoMu0.4620.5750.5760.558
MolT5-baseGraphMVP0.4910.5920.5990.570
MolT5-baseMolFM0.4980.5940.6070.576

Text-Based Molecule Generation

Also on ChEBI-20 with MolT5 decoders. MolFM’s text features are projected and fed to the decoder.

DecoderEncoderExactValidMorgan FTSText2Mol
MolT5-baseMolT5-base0.0820.7860.6010.543
MolT5-baseMoMu0.1830.8630.6780.580
MolT5-baseMolFM0.2100.8920.6970.583

Molecular Property Prediction

On MoleculeNet (8 classification datasets), MolFM concatenates the structure feature and the multimodal encoder’s CLS feature to predict properties.

ModelBBBPTox21ClinToxHIVBACEAvg
GraphMVP72.474.477.577.081.273.07
DeepEIK72.172.489.775.080.573.27
MolFM (w/o T+K)72.276.678.678.282.673.95
MolFM (w/ T+K)72.977.279.778.883.974.62

With multimodal inputs, MolFM averages 74.62% ROC-AUC, a 1.55% absolute gain over GraphMVP.

Ablation Studies

Zero-shot retrieval ablations reveal that cross-modal attention to atoms and CMM are the most critical components. Removing either causes a sharp drop (approximately 3% on S-T retrieval). Knowledge graph incorporation yields a 1.5% average improvement, with both attention to neighbors and KGE contributing marginally.

Key Findings and Limitations

MolFM demonstrates that incorporating knowledge graphs as a third modality provides consistent improvements across all evaluated tasks. The theoretical analysis connecting pre-training objectives to deep metric learning provides interpretability for why the model works: STC and CMM align representations of the same molecule across modalities, while KGE pulls structurally and functionally similar molecules closer in the embedding space.

The cross-modal attention visualizations show that MolFM learns to associate specific atom substructures with relevant text tokens and knowledge graph entities. For example, the model correctly attends to functional groups mentioned in textual descriptions.

The authors acknowledge several limitations:

  1. Data quality: The pre-training dataset (15K molecules) is small and may introduce biases
  2. Cold-start problem: MolFM provides limited benefit for newly emerged molecules lacking text and knowledge graph information
  3. Entity scope: The model focuses on molecules and does not incorporate proteins, genes, or cell lines, which could further improve biomedical understanding

Reproducibility Details

Data

PurposeDatasetSizeNotes
Pre-training (molecules)PubChem15K moleculesFollows MoMu’s pre-training data
Pre-training (text)S2ORC37M paragraphsBiomedical literature paragraphs
Knowledge graphDrugBank, BindingDB, public DBs49K entities, 3.2M relationsConstructed with heuristics from MoCL
Cross-modal retrievalPCdesParagraph-levelTest split
Captioning/GenerationChEBI-20-Following MolT5 splits
Property predictionMoleculeNet8 datasetsClassification tasks, ROC-AUC metric

Algorithms

  • Optimizer: AdamW with weight decay $1 \times 10^{-4}$
  • Learning rate: linear warmup to $1 \times 10^{-4}$ over 2,000 iterations, cosine annealing to $1 \times 10^{-5}$
  • Batch size: 128
  • Pre-training epochs: 300
  • Knowledge graph neighbors per molecule: $N = 4$
  • Temperature: $\tau = 0.1$
  • Margin: $\Delta = 0.2$

Models

ComponentArchitectureParametersInitialization
Graph encoder5-layer GIN1.8MGraphMVP
Text encoder6-layer Transformer61.8MKV-PLM (first 6 layers)
Knowledge encoderTransE12.6MTrained 500 epochs on KG
Multimodal encoder6-layer Transformer + cross-attention61.8MKV-PLM (last 6 layers)
Total~138M

Evaluation

TaskMetrics
Cross-modal retrievalMRR, Recall@1/5/10
Molecule captioningBLEU-2/4, ROUGE-1/2/L, METEOR, Text2Mol
Text-to-molecule generationBLEU, Exact ratio, Validity, Levenshtein, Fingerprint Tanimoto (MACCS/RDKit/Morgan), Text2Mol
Property predictionROC-AUC per dataset

Hardware

  • 4 NVIDIA A100 GPUs for pre-training

Artifacts

ArtifactTypeLicenseNotes
OpenBioMedCodeMITOfficial implementation including MolFM

Paper Information

Citation: Luo, Y., Yang, K., Hong, M., Liu, X. Y., & Nie, Z. (2023). MolFM: A Multimodal Molecular Foundation Model. arXiv preprint arXiv:2307.09484.

@article{luo2023molfm,
  title={MolFM: A Multimodal Molecular Foundation Model},
  author={Luo, Yizhen and Yang, Kai and Hong, Massimo and Liu, Xing Yi and Nie, Zaiqing},
  journal={arXiv preprint arXiv:2307.09484},
  year={2023}
}