Molecular Representations
Bar chart comparing binding affinity scores across SMILES, AIS, and SMI+AIS hybrid tokenization strategies

SMI+AIS: Hybridizing SMILES with Environment Tokens

Proposes SMI+AIS, a hybrid molecular representation combining standard SMILES tokens with chemical-environment-aware Atom-In-SMILES tokens, demonstrating improved molecular generation for drug design targets.

Molecular Representations
Diagram showing Transformer encoder-decoder architecture converting SMILES strings into molecular fingerprints

SMILES Transformer: Low-Data Molecular Fingerprints

A Transformer-based encoder-decoder pre-trained on 861K SMILES from ChEMBL24 produces 1024-dimensional molecular fingerprints that outperform ECFP and graph convolutions on 5 of 10 MoleculeNet tasks in low-data settings.

Molecular Representations
Bar chart comparing Atom Pair Encoding vs BPE tokenization on MoleculeNet classification tasks

SMILES vs SELFIES Tokenization for Chemical LMs

Introduces Atom Pair Encoding (APE), a chemistry-aware tokenizer for SMILES and SELFIES, and shows it consistently outperforms Byte Pair Encoding in RoBERTa-based molecular property classification on BBBP, HIV, and Tox21 benchmarks.

Molecular Representations
Bar chart comparing SMILES-BERT accuracy against baselines on HIV, LogP, and PCBA tasks

SMILES-BERT: BERT-Style Pre-Training for Molecules

SMILES-BERT pre-trains a Transformer encoder on 18M+ SMILES from ZINC using a masked recovery task, then fine-tunes for molecular property prediction, outperforming prior methods on three datasets.

Predictive Chemistry
Bar chart comparing SMILES2Vec and Graph Conv scores across five MoleculeNet tasks

SMILES2Vec: Interpretable Chemical Property Prediction

SMILES2Vec is a deep RNN that learns chemical features directly from SMILES strings using a Bayesian-optimized CNN-GRU architecture. It matches graph convolution baselines on toxicity and activity prediction, and its explanation mask identifies chemically meaningful functional groups with 88% accuracy.

Molecular Representations
Visualization of tokenizer vocabulary coverage across chemical space

Smirk: Complete Tokenization for Molecular Models

Introduces Smirk and Smirk-GPE tokenizers that fully cover the OpenSMILES specification, proposes n-gram language models as low-cost proxies for evaluating tokenizer quality, and benchmarks 34 tokenizers across intrinsic and extrinsic metrics.

Molecular Representations
Bar chart showing SMILES Pair Encoding reduces mean sequence length from 40 to 6 tokens

SPE: Data-Driven SMILES Substructure Tokenization

Introduces SMILES Pair Encoding (SPE), a data-driven tokenization algorithm that learns high-frequency SMILES substrings from ChEMBL to produce shorter, chemically interpretable token sequences for deep learning.

Molecular Representations
Bar chart showing SPMM supports bidirectional tasks: molecule to property, property to molecule, molecule optimization, and property interpolation

SPMM: A Bidirectional Molecular Foundation Model

SPMM pre-trains a dual-stream transformer on SMILES and 53 molecular property vectors using contrastive learning and cross-attention, enabling bidirectional structure-property generation, property prediction, and reaction prediction through a single model.

Molecular Representations
Diagram showing the t-SMILES pipeline from molecular graph fragmentation to binary tree traversal producing a string representation

t-SMILES: Tree-Based Fragment Molecular Encoding

t-SMILES represents molecules by fragmenting them into substructures, building full binary trees, and traversing them breadth-first to produce SMILES-type strings that reduce nesting depth and outperform SMILES, DeepSMILES, and SELFIES on generation benchmarks.

Molecular Representations
Diagram showing sequence-to-sequence translation from chemical names to SMILES with atom count constraints

Transformer Name-to-SMILES with Atom Count Losses

This paper applies a Transformer sequence-to-sequence model to predict SMILES strings from chemical compound names (Synonyms). Two enhancements, an atom-count constraint loss and SMILES/InChI multi-task learning, improve F-measure over rule-based and vanilla Transformer baselines.

Predictive Chemistry
Bar chart comparing Transformer-CNN RMSE against RF, SVM, CNN, and CDDD baselines

Transformer-CNN: SMILES Embeddings for QSAR Modeling

Transformer-CNN extracts dynamic SMILES embeddings from a Transformer trained on SMILES canonicalization and feeds them to a TextCNN for QSAR modeling, achieving strong results across 18 benchmarks with built-in LRP interpretability.

Molecular Generation
Bar chart comparing Char-RNN and Molecular VAE on validity and novelty metrics

VAE for Automatic Chemical Design (2018 Seminal)

This foundational paper introduces a variational autoencoder (VAE) that encodes SMILES strings into a continuous latent space, allowing gradient-based optimization of molecular properties. Joint training with a property predictor organizes the latent space by chemical properties, and Bayesian optimization over the latent surface discovers drug-like molecules with improved QED and synthetic accessibility.