Molecular Representations
Caffeine molecular structure with its InChIKey identifier

InChI: The International Chemical Identifier

InChI (International Chemical Identifier) is an open standard from IUPAC that represents molecular structures as hierarchical, layered strings optimized for database interoperability, unique identification, and web search via its hashed InChIKey.

Molecular Representations
Overview of six categories of materials representations for machine learning

Materials Representations for ML Review

A comprehensive review of how solid-state materials can be numerically represented for machine learning, spanning structural features, graph neural networks, compositional descriptors, transfer learning, and generative models for inverse design.

Molecular Representations
BioT5 architecture showing SELFIES molecules, amino acid proteins, and scientific text feeding into a T5 encoder-decoder

BioT5: Cross-Modal Integration of Biology and Chemistry

BioT5 uses SELFIES representations and separate tokenization to pre-train a unified T5 model across molecules, proteins, and text, achieving state-of-the-art results on 10 of 15 downstream tasks.

Molecular Representations
MolFM trimodal architecture fusing 2D graph, knowledge graph, and biomedical text via cross-modal attention

MolFM: Trimodal Molecular Foundation Pre-training

MolFM pre-trains a multimodal encoder that fuses 2D molecular graphs, biomedical text, and knowledge graph entities through fine-grained cross-modal attention, achieving strong gains on cross-modal retrieval, molecule captioning, text-based generation, and property prediction.

Molecular Representations
MoMu architecture showing contrastive alignment between molecular graph and scientific text modalities

MoMu: Bridging Molecular Graphs and Natural Language

MoMu pre-trains dual graph and text encoders on 15K molecule graph-text pairs using contrastive learning, enabling cross-modal retrieval, molecule captioning, zero-shot text-to-graph generation, and improved molecular property prediction.

Molecular Representations
Diagram showing dual-view molecule pre-training with a SMILES Transformer branch and a GNN branch connected by a consistency loss

DMP: Dual-View Molecule Pre-training (SMILES+GNN)

DMP combines a SMILES Transformer and a GNN branch during pre-training, using masked language modeling plus a BYOL-inspired dual-view consistency loss to learn complementary molecular representations.

Molecular Representations
Bar chart comparing MG-BERT vs GNN baselines on six MoleculeNet classification tasks

MG-BERT: Graph BERT for Molecular Property Prediction

MG-BERT combines GNN-style local attention with BERT’s masked pretraining on molecular graphs, learning context-sensitive atomic representations that improve ADMET property prediction across 11 benchmark datasets.

Molecular Representations
Bar chart comparing Mol2vec ESOL RMSE against ECFP4, MACCS, and Neural Fingerprint baselines

Mol2vec: Unsupervised ML with Chemical Intuition

Mol2vec treats molecular substructures as words and compounds as sentences, training Word2vec on 19.9M molecules to produce dense embeddings that capture chemical intuition and enable competitive property prediction.

Molecular Representations
Bar chart comparing SMILES tokens vs Atom-in-SMILES across molecular generation, retrosynthesis, and reaction prediction

Atom-in-SMILES: Better Tokens for Chemical Models

Introduces Atom-in-SMILES (AIS), a tokenization scheme that encodes local chemical environments into SMILES tokens, improving prediction quality across canonicalization, retrosynthesis, and property prediction tasks.

Molecular Representations
Bar chart comparing CDDD virtual screening AUC against ECFP4, Mol2vec, Seq2seq FP, and VAE baselines

CDDD: Learning Descriptors by Translating SMILES

Winter et al. propose CDDD, a translation-based encoder-decoder that learns continuous molecular descriptors by translating between equivalent chemical representations like SMILES and InChI, pretrained on 72 million compounds.

Molecular Representations
Bar chart comparing SMILES and DeepSMILES error types, showing DeepSMILES eliminates parenthesis errors

DeepSMILES: Adapting SMILES Syntax for Machine Learning

DeepSMILES replaces paired parentheses and ring closure symbols in SMILES with a postfix notation and single ring-size digits, making it easier for generative models to produce syntactically valid molecular strings.

Molecular Representations
Bar chart comparing Group SELFIES vs SELFIES on MOSES benchmark metrics

Group SELFIES: Fragment-Based Molecular Strings

Group SELFIES extends SELFIES with group tokens representing functional groups and substructures, maintaining chemical robustness while improving distribution learning and molecular generation quality.