This group covers models that go beyond a single molecular representation by jointly learning from multiple modalities. These models enable capabilities that unimodal encoders cannot, including text-guided molecular retrieval, cross-modal captioning, and joint property prediction across chemical and biological domains.

PaperYearModalitiesKey Idea
MG-BERT2021Graph + SMILESGNN message passing integrated with BERT pretraining
MoMu2022Graph + textContrastive pre-training bridging molecular graphs and natural language
DMP2023SMILES + graphDual-view consistency learning over SMILES and GNN encoders
BioT52023Molecule + protein + textT5 model for cross-modal biology and chemistry
MolFM2023Graph + text + KGTrimodal fusion of graphs, text, and knowledge graphs
SPMM2024Structure + propertiesBidirectional alignment of molecular structures and property vectors
nach02024SMILES + text + patentsMulti-task instruction tuning over chemistry and NLP

All Notes