Bridging Molecular Graphs and Natural Language Through Contrastive Learning

MoMu (Molecular Multimodal foundation model) is a Method paper that proposes a multimodal pre-training approach to associate molecular graphs with natural language descriptions. The primary contribution is a dual-encoder architecture, consisting of a Graph Isomorphism Network (GIN) for molecular graphs and a BERT-based text encoder, jointly trained through contrastive learning on weakly-correlated graph-text pairs collected from scientific literature. The pre-trained model supports four downstream capabilities: cross-modal retrieval (graph-to-text and text-to-graph), molecule captioning, zero-shot text-to-graph molecule generation, and molecular property prediction.

Why Single-Modality Models Are Insufficient for Molecular Understanding

Existing AI models for molecular tasks generally operate on a single modality and learn a single cognitive ability. Language-based models process SMILES strings or natural language texts and handle tasks like property prediction from strings, literature comprehension, or SMILES-based generation. Graph-based models use molecular graph representations and handle graph-level property prediction or graph generation. Neither category connects structural information from molecular graphs with the rich semantic knowledge encoded in scientific texts.

Prior work by Zeng et al. (KV-PLM) jointly modeled molecule-related texts and SMILES strings, but SMILES representations have inherent drawbacks: they are one-dimensional and may lose structural information, they cannot capture structural similarities between molecules, and a single molecule can have multiple valid SMILES representations. Molecular graphs, by contrast, are more intuitive and better reveal functional structures. Human experts learn molecular knowledge by associating both graphical representations and textual descriptions, yet no prior model bridged these two modalities directly.

The key challenge is the scarcity of paired molecular graph-text data compared to general image-text datasets. Additionally, learning specialized molecular knowledge requires foundational cognitive abilities in both the graph and text domains, making training from scratch infeasible with limited data.

Contrastive Pre-Training with Inter-Modal and Intra-Modal Objectives

MoMu consists of two encoders initialized from pre-trained unimodal models: a GIN graph encoder initialized from GraphCL self-supervised weights, and a BERT text encoder initialized from either Sci-BERT (yielding MoMu-S) or KV-PLM (yielding MoMu-K).

Data Collection

The authors collect approximately 15,613 molecular graph-document pairs by:

  1. Gathering names, synonyms, and SMILES for the top 50K compounds in PubChem
  2. Converting SMILES to molecular graphs using the OGB smiles2graph function
  3. Retrieving related text from the S2ORC corpus (136M+ papers) by querying with molecule names, filtering to Medicine, Biology, Chemistry, and Computer Science fields
  4. Restricting retrieval to abstract, introduction, and conclusion sections to avoid experimental data artifacts

Contrastive Training Objective

For each graph-text pair in a mini-batch of $N$ pairs, MoMu applies two graph augmentations (node dropping and subgraph extraction) to create two augmented graphs, and randomly samples two sentences from the document. This produces $2N$ graph representations ${z_1^G, \tilde{z}_1^G, \ldots, z_N^G, \tilde{z}_N^G}$ and $2N$ text representations ${z_1^T, \tilde{z}_1^T, \ldots, z_N^T, \tilde{z}_N^T}$.

The cross-modal contrastive loss for a pair $(z_i^G, z_i^T)$ is:

$$ \ell_i^{(z_i^G, z_i^T)} = -\log \frac{\exp(\text{sim}(z_i^G, z_i^T) / \tau)}{\sum_{j=1}^{N} \exp(\text{sim}(z_i^G, z_j^T) / \tau)} $$

where $\tau$ is the temperature parameter and $\text{sim}(\cdot, \cdot)$ projects both representations into a shared 256-dimensional space before computing cosine similarity. The total cross-modal loss includes four contrastive terms for each pair: $(z_i^G, z_i^T)$, $(\tilde{z}_i^G, z_i^T)$, $(z_i^G, \tilde{z}_i^T)$, and $(\tilde{z}_i^G, \tilde{z}_i^T)$.

An intra-modal graph contrastive loss further strengthens the graph encoder:

$$ \ell_i^{(z_i^G, \tilde{z}_i^G)} = -\log \frac{\exp(\text{sim}(z_i^G, \tilde{z}_i^G) / \tau)}{\sum_{j=1}^{N} \exp(\text{sim}(z_i^G, \tilde{z}_j^G) / \tau)} $$

Zero-Shot Text-to-Graph Generation

MoMu enables a zero-shot generation pipeline by combining the pre-trained MoMu encoders with MoFlow, a flow-based molecular generator. Given an input text description $x^T$, the method:

  1. Samples a latent variable $q$ from MoFlow’s Gaussian prior $P(q)$
  2. Generates a molecular graph through MoFlow’s reverse flows: $\hat{E} = f_g^{-1}(q_e)$ and $\hat{V} = f_c^{-1}(q_v \mid GN(\hat{E}))$
  3. Feeds $\hat{V}$ (using soft atom type probabilities instead of hard assignments) into MoMu’s graph encoder
  4. Optimizes $q$ to maximize the cosine similarity between the resulting graph and text representations:

$$ \ell_q = -\text{sim}(z^G, z^T) / \tau $$

All MoMu and MoFlow parameters are frozen; only $q$ is updated via Adam for up to 500 iterations. The final molecule is obtained by applying argmax to the optimized probability matrices $\hat{V}$ and $\hat{E}$.

Evaluation Across Four Downstream Tasks

Cross-Modal Retrieval

MoMu is evaluated on the PCdes dataset (15K SMILES-description pairs from PubChem, split 10,500/1,500/3,000 for train/val/test). Retrieval is performed in mini-batches of 64 pairs, reporting top-1 accuracy and Recall@20.

Graph-to-Text Retrieval (PCdes, fine-tuned):

MethodSentence AccSentence R@20Paragraph AccParagraph R@20
Sci-BERT50.3862.1162.5760.67
KV-PLM53.7966.6364.8163.87
KV-PLM*55.9268.5977.9275.93
MoMu-S58.6480.5980.6279.11
MoMu-K58.7481.2981.0980.15

Text-to-Graph Retrieval (PCdes, fine-tuned):

MethodSentence AccSentence R@20Paragraph AccParagraph R@20
Sci-BERT50.1268.0261.7560.77
KV-PLM54.2271.8064.9564.27
KV-PLM*55.6174.7777.0375.47
MoMu-S55.4476.9280.2279.02
MoMu-K54.9478.2981.4580.62

In zero-shot retrieval (on a separate test set of 5,562 pairs not seen during pre-training), MoMu achieves approximately 39-46% accuracy compared to below 2% for Sci-BERT and KV-PLM, demonstrating strong generalization.

Molecule Captioning

MoMu’s graph features are appended to MolT5’s encoder inputs through a learned MLP mapping module on the ChEBI-20 dataset. Results show improvements in BLEU, METEOR, and Text2Mol scores when incorporating graph features, though ROUGE-L slightly drops. The graph structural information leads to more accurate captions for complex molecular structures.

Molecular Property Prediction

The pre-trained graph encoder from MoMu is fine-tuned on eight MoleculeNet datasets using scaffold splitting and ROC-AUC evaluation (10 runs).

DatasetNo Pre-TrainGraphCLMoMu-SMoMu-K
BBBP65.869.770.570.1
Tox2174.073.975.675.6
ToxCast63.462.463.463.0
SIDER57.360.560.560.4
ClinTox58.076.079.977.4
MUV71.869.870.571.1
HIV75.378.575.976.2
BACE70.175.476.777.1
Average66.9670.7871.6371.36

MoMu-S achieves the best average ROC-AUC (71.63%) across all eight datasets, outperforming GraphCL (70.78%), the self-supervised method used to initialize MoMu’s graph encoder. MoMu outperforms GraphCL on six of eight datasets. Notably, MoMu-S and MoMu-K perform comparably, indicating that KV-PLM’s SMILES-based knowledge does not transfer well to graph-based representations.

Zero-Shot Text-to-Graph Generation

The method generates molecules from three types of text descriptions:

  1. High-level vague descriptions (e.g., “The molecule is beautiful”): MoMu generates diverse, interpretable molecules where “beautiful” tends to produce locally symmetric and stretched graphs, “versatile” produces molecules with varied elements and functional groups, and “strange” produces cluttered, irregular structures.
  2. Functional descriptions (e.g., “fluorescent molecules”, “high water solubility and barrier permeability with low toxicity”): MoMu successfully generates molecules with appropriate functional groups and properties. For the solubility/permeability/toxicity query, MoMu generates molecules that satisfy three of three evaluable properties.
  3. Structural descriptions (e.g., “molecules containing nucleophilic groups”): MoMu generates diverse molecules with appropriate functional groups (amino, hydroxyl, carbonyl, halogen atoms).

Promising Multimodal Transfer with Clear Data Limitations

MoMu demonstrates that contrastive pre-training on weakly-correlated graph-text data can bridge molecular graphs and natural language in a shared representation space. The key findings are:

  1. Cross-modal alignment works with limited data: With only 15K graph-text pairs (far fewer than the millions used in vision-language models like CLIP), MoMu achieves meaningful cross-modal retrieval and enables zero-shot generation.
  2. Multimodal supervision improves graph representations: The graph encoder supervised by text descriptions outperforms self-supervised methods (GraphCL, AttrMasking, ContextPred) on average across molecular property prediction benchmarks.
  3. SMILES knowledge does not transfer to graphs: MoMu-S and MoMu-K perform comparably across all tasks, showing that structural information learned from one-dimensional SMILES strings does not readily generalize to graph neural networks.

Limitations

The authors acknowledge several important limitations:

  • Data scarcity: 15K graph-text pairs is substantially smaller than general image-text datasets, potentially leaving the common space insufficiently aligned.
  • Noisy supervision: Retrieved texts may mention a molecule by name without describing its properties or structure, leading to spurious correlations.
  • Generator constraints: The zero-shot generation method is limited by MoFlow’s capacity (maximum 38 atoms, 9 element types from ZINC250K training).
  • Property coverage: Generation quality degrades for molecular properties that appear infrequently or not at all in the training texts.

Future Directions

The authors propose four avenues: (1) collecting larger-scale multimodal molecular data including 3D conformations, (2) using strongly-correlated paired data with more advanced generators, (3) developing interpretable tools for the learned cross-modal space, and (4) wet-lab validation of generated molecules.


Reproducibility Details

Data

PurposeDatasetSizeNotes
Pre-trainingCollected graph-text pairs (PubChem + S2ORC)15,613 pairs~37M paragraphs total; top 50K PubChem compounds
Cross-modal retrievalPCdes15K pairs (10.5K/1.5K/3K split)SMILES-description pairs from PubChem
Molecule captioningChEBI-20~33K pairsUsed with MolT5
Text-to-graph generationZINC250K (MoFlow)250K moleculesPre-trained generator, max 38 atoms
Property predictionMoleculeNet (8 datasets)VariesBBBP, Tox21, ToxCast, SIDER, ClinTox, MUV, HIV, BACE

Algorithms

  • Graph augmentations: Node dropping (10% ratio) and subgraph extraction (80% of original size via random walk)
  • Contrastive learning: InfoNCE loss with temperature $\tau = 0.1$, following the DeClip paradigm with both inter-modal and intra-modal objectives
  • Zero-shot generation: Adam optimizer on latent variable $q$ for up to 500 iterations; formal charges prohibited in output

Models

  • Graph encoder: GIN with 5 layers, 300-dimensional hidden size, initialized from GraphCL checkpoint
  • Text encoder: BERT-base (768 hidden size), initialized from Sci-BERT or KV-PLM
  • Projection heads: Two MLPs projecting graph (300-dim) and text (768-dim) features to 256-dimensional shared space
  • Optimizer: AdamW, learning rate 0.0001, weight decay 1e-5, 300 epochs, batch size 256

Evaluation

TaskMetricBest ResultNotes
G-T Retrieval (PCdes)Accuracy / R@2081.09 / 80.15 (paragraph)MoMu-K, fine-tuned
T-G Retrieval (PCdes)Accuracy / R@2081.45 / 80.62 (paragraph)MoMu-K, fine-tuned
Zero-shot G-T RetrievalAccuracy~46%vs. ~1.4% for baselines
Property PredictionROC-AUC (avg)71.63%MoMu-S, 8 MoleculeNet datasets
Molecule CaptioningText2MolImproved over MolT5MoMu + MolT5-large

Hardware

  • Pre-training: 8x NVIDIA Tesla V100 PCIe 32GB GPUs
  • Framework: PyTorch

Artifacts

ArtifactTypeLicenseNotes
MoMu codeCodeNot specifiedPre-training and downstream task code
GraphTextRetrievalCodeNot specifiedData collection and cross-modal retrieval code
Pre-training datasetDatasetNot specifiedHosted on Baidu Pan (Chinese cloud storage)

Paper Information

Citation: Su, B., Du, D., Yang, Z., Zhou, Y., Li, J., Rao, A., Sun, H., Lu, Z., & Wen, J.-R. (2022). A Molecular Multimodal Foundation Model Associating Molecule Graphs with Natural Language. arXiv preprint arXiv:2209.05481.

@article{su2022momu,
  title={A Molecular Multimodal Foundation Model Associating Molecule Graphs with Natural Language},
  author={Su, Bing and Du, Dazhao and Yang, Zhao and Zhou, Yujie and Li, Jiangmeng and Rao, Anyi and Sun, Hao and Lu, Zhiwu and Wen, Ji-Rong},
  journal={arXiv preprint arXiv:2209.05481},
  year={2022}
}