Multimodal Molecular Models

This group covers models that go beyond a single molecular representation by jointly learning from multiple modalities. These models enable capabilities that unimodal encoders cannot, including text-guided molecular retrieval, cross-modal captioning, and joint property prediction across chemical and biological domains.

Paper	Year	Modalities	Key Idea
MG-BERT	2021	Graph + SMILES	GNN message passing integrated with BERT pretraining
MoMu	2022	Graph + text	Contrastive pre-training bridging molecular graphs and natural language
DMP	2023	SMILES + graph	Dual-view consistency learning over SMILES and GNN encoders
BioT5	2023	Molecule + protein + text	T5 model for cross-modal biology and chemistry
MolFM	2023	Graph + text + KG	Trimodal fusion of graphs, text, and knowledge graphs
SPMM	2024	Structure + properties	Bidirectional alignment of molecular structures and property vectors
nach0	2024	SMILES + text + patents	Multi-task instruction tuning over chemistry and NLP

All Notes

Computational Chemistry

BioT5 architecture showing SELFIES molecules, amino acid proteins, and scientific text feeding into a T5 encoder-decoder

BioT5: Cross-Modal Integration of Biology and Chemistry

BioT5 uses SELFIES representations and separate tokenization to pre-train a unified T5 model across molecules, proteins, and text, achieving state-of-the-art results on 10 of 15 downstream tasks.

Computational Chemistry

MolFM trimodal architecture fusing 2D graph, knowledge graph, and biomedical text via cross-modal attention

MolFM: Trimodal Molecular Foundation Pre-training

MolFM pre-trains a multimodal encoder that fuses 2D molecular graphs, biomedical text, and knowledge graph entities through fine-grained cross-modal attention, achieving strong gains on cross-modal retrieval, molecule captioning, text-based generation, and property prediction.

Computational Chemistry

MoMu architecture showing contrastive alignment between molecular graph and scientific text modalities

MoMu: Bridging Molecular Graphs and Natural Language

MoMu pre-trains dual graph and text encoders on 15K molecule graph-text pairs using contrastive learning, enabling cross-modal retrieval, molecule captioning, zero-shot text-to-graph generation, and improved molecular property prediction.

Computational Chemistry

Diagram showing dual-view molecule pre-training with a SMILES Transformer branch and a GNN branch connected by a consistency loss

DMP: Dual-View Molecule Pre-training (SMILES+GNN)

DMP combines a SMILES Transformer and a GNN branch during pre-training, using masked language modeling plus a BYOL-inspired dual-view consistency loss to learn complementary molecular representations.

Computational Chemistry

Bar chart comparing MG-BERT vs GNN baselines on six MoleculeNet classification tasks

MG-BERT: Graph BERT for Molecular Property Prediction

MG-BERT combines GNN-style local attention with BERT’s masked pretraining on molecular graphs, learning context-sensitive atomic representations that improve ADMET property prediction across 11 benchmark datasets.

Computational Chemistry

Bar chart comparing nach0 vs T5-base across molecular captioning, Q/A, reaction prediction, retrosynthesis, and generation

nach0: A Multimodal Chemical and NLP Foundation Model

nach0 unifies natural language and SMILES-based chemical tasks in a single encoder-decoder model, achieving competitive results across molecular property prediction, reaction prediction, molecular generation, and biomedical NLP benchmarks.

Computational Chemistry

Bar chart showing SPMM supports bidirectional tasks: molecule to property, property to molecule, molecule optimization, and property interpolation

SPMM: A Bidirectional Molecular Foundation Model

SPMM pre-trains a dual-stream transformer on SMILES and 53 molecular property vectors using contrastive learning and cross-attention, enabling bidirectional structure-property generation, property prediction, and reaction prediction through a single model.

All Notes#

All Notes