Molecular-Representation

MolFM trimodal architecture fusing 2D graph, knowledge graph, and biomedical text via cross-modal attention

MolFM: Trimodal Molecular Foundation Pre-training

MolFM pre-trains a multimodal encoder that fuses 2D molecular graphs, biomedical text, and knowledge graph entities through fine-grained cross-modal attention, achieving strong gains on cross-modal retrieval, molecule captioning, text-based generation, and property prediction.

Molecular Representations

MoMu architecture showing contrastive alignment between molecular graph and scientific text modalities

MoMu: Bridging Molecular Graphs and Natural Language

MoMu pre-trains dual graph and text encoders on 15K molecule graph-text pairs using contrastive learning, enabling cross-modal retrieval, molecule captioning, zero-shot text-to-graph generation, and improved molecular property prediction.

Molecular Representations

Diagram showing dual-view molecule pre-training with a SMILES Transformer branch and a GNN branch connected by a consistency loss

DMP: Dual-View Molecule Pre-training (SMILES+GNN)

DMP combines a SMILES Transformer and a GNN branch during pre-training, using masked language modeling plus a BYOL-inspired dual-view consistency loss to learn complementary molecular representations.

Molecular Simulation

Bar chart comparing MAT average ROC-AUC against D-MPNN, GCN, and Weave baselines

MAT: Graph-Augmented Transformer for Molecules (2020)

Molecule Attention Transformer (MAT) augments Transformer self-attention with inter-atomic distances and graph adjacency, achieving strong property prediction across diverse molecular tasks with minimal hyperparameter tuning after self-supervised pretraining.

Predictive Chemistry

Bar chart showing RMSE improvement from SMILES augmentation across ESOL, FreeSolv, and lipophilicity datasets

Maxsmi: SMILES Augmentation for Property Prediction

A systematic study of SMILES augmentation strategies for molecular property prediction, showing that augmentation consistently improves CNN and RNN performance and that prediction variance across SMILES correlates with model uncertainty.

Molecular Representations

Bar chart comparing Mol2vec ESOL RMSE against ECFP4, MACCS, and Neural Fingerprint baselines

Mol2vec: Unsupervised ML with Chemical Intuition

Mol2vec treats molecular substructures as words and compounds as sentences, training Word2vec on 19.9M molecules to produce dense embeddings that capture chemical intuition and enable competitive property prediction.

Predictive Chemistry

Bar chart showing MTL-BERT combining pretraining, multitask learning, and SMILES enumeration for best improvement

MTL-BERT: Multitask BERT for Property Prediction

MTL-BERT pretrains a BERT model on 1.7M unlabeled SMILES, then fine-tunes jointly on 60 ADMET and molecular property tasks using SMILES enumeration as data augmentation in all phases.

Molecular Representations

Bar chart comparing SMILES tokens vs Atom-in-SMILES across molecular generation, retrosynthesis, and reaction prediction

Atom-in-SMILES: Better Tokens for Chemical Models

Introduces Atom-in-SMILES (AIS), a tokenization scheme that encodes local chemical environments into SMILES tokens, improving prediction quality across canonicalization, retrosynthesis, and property prediction tasks.

Molecular Generation

Bar chart showing BindGPT RL achieves best Vina binding scores compared to baselines

BindGPT: GPT for 3D Molecular Design and Docking

BindGPT formulates 3D molecular design as autoregressive text generation over combined SMILES and XYZ tokens, using large-scale pre-training and reinforcement learning to achieve competitive pocket-conditioned molecule generation.

Molecular Representations

Bar chart comparing CDDD virtual screening AUC against ECFP4, Mol2vec, Seq2seq FP, and VAE baselines

CDDD: Learning Descriptors by Translating SMILES

Winter et al. propose CDDD, a translation-based encoder-decoder that learns continuous molecular descriptors by translating between equivalent chemical representations like SMILES and InChI, pretrained on 72 million compounds.

Computational Chemistry

Bar chart showing GPT-4 relative performance across eight chemistry tasks grouped by understanding, reasoning, and explaining capabilities

ChemLLMBench: Benchmarking LLMs on Chemistry Tasks

A comprehensive benchmark evaluating GPT-4, GPT-3.5, Davinci-003, Llama, and Galactica on eight practical chemistry tasks, revealing that LLMs are competitive on classification and text tasks but struggle with SMILES-dependent generation.

Molecular Representations

Bar chart comparing SMILES and DeepSMILES error types, showing DeepSMILES eliminates parenthesis errors

DeepSMILES: Adapting SMILES Syntax for Machine Learning

DeepSMILES replaces paired parentheses and ring closure symbols in SMILES with a postfix notation and single ring-size digits, making it easier for generative models to produce syntactically valid molecular strings.