Molecular-Representation

Diagram showing the dual formulation of S4 models with convolution during training and recurrence during generation for SMILES-based molecular design

S4 Structured State Space Models for De Novo Drug Design

This paper introduces structured state space sequence (S4) models to chemical language modeling, showing they combine the strengths of LSTMs (efficient recurrent generation) and GPTs (holistic sequence learning) for de novo molecular design.

Molecular Representations

Diagram showing SMILES string flowing through encoder to fixed-length fingerprint vector and back through decoder

Seq2seq Fingerprint: Unsupervised Molecular Embedding

A GRU-based sequence-to-sequence model that learns fixed-length molecular fingerprints by translating SMILES strings to themselves, enabling unsupervised representation learning for drug discovery tasks.

Molecular Representations

Bar chart comparing SMI-TED ROC-AUC scores against ChemBERTa, ChemBERTa-2, MoLFormer, and GROVER on BBBP and HIV

SMI-TED: Encoder-Decoder Foundation Models for Chemistry

SMI-TED introduces encoder-decoder chemical foundation models (289M parameters) pre-trained on 91 million PubChem molecules, achieving strong results across property prediction, reaction yield, and molecule generation benchmarks.

Molecular Representations

Bar chart comparing binding affinity scores across SMILES, AIS, and SMI+AIS hybrid tokenization strategies

SMI+AIS: Hybridizing SMILES with Environment Tokens

Proposes SMI+AIS, a hybrid molecular representation combining standard SMILES tokens with chemical-environment-aware Atom-In-SMILES tokens, demonstrating improved molecular generation for drug design targets.

Molecular Representations

Diagram showing Transformer encoder-decoder architecture converting SMILES strings into molecular fingerprints

SMILES Transformer: Low-Data Molecular Fingerprints

A Transformer-based encoder-decoder pre-trained on 861K SMILES from ChEMBL24 produces 1024-dimensional molecular fingerprints that outperform ECFP and graph convolutions on 5 of 10 MoleculeNet tasks in low-data settings.

Molecular Representations

Bar chart comparing Atom Pair Encoding vs BPE tokenization on MoleculeNet classification tasks

SMILES vs SELFIES Tokenization for Chemical LMs

Introduces Atom Pair Encoding (APE), a chemistry-aware tokenizer for SMILES and SELFIES, and shows it consistently outperforms Byte Pair Encoding in RoBERTa-based molecular property classification on BBBP, HIV, and Tox21 benchmarks.

Molecular Representations

Bar chart comparing SMILES-BERT accuracy against baselines on HIV, LogP, and PCBA tasks

SMILES-BERT: BERT-Style Pre-Training for Molecules

SMILES-BERT pre-trains a Transformer encoder on 18M+ SMILES from ZINC using a masked recovery task, then fine-tunes for molecular property prediction, outperforming prior methods on three datasets.

Predictive Chemistry

Bar chart comparing SMILES2Vec and Graph Conv scores across five MoleculeNet tasks

SMILES2Vec: Interpretable Chemical Property Prediction

SMILES2Vec is a deep RNN that learns chemical features directly from SMILES strings using a Bayesian-optimized CNN-GRU architecture. It matches graph convolution baselines on toxicity and activity prediction, and its explanation mask identifies chemically meaningful functional groups with 88% accuracy.

Molecular Representations

Visualization of tokenizer vocabulary coverage across chemical space

Smirk: Complete Tokenization for Molecular Models

Introduces Smirk and Smirk-GPE tokenizers that fully cover the OpenSMILES specification, proposes n-gram language models as low-cost proxies for evaluating tokenizer quality, and benchmarks 34 tokenizers across intrinsic and extrinsic metrics.

Molecular Representations

Bar chart showing SMILES Pair Encoding reduces mean sequence length from 40 to 6 tokens

SPE: Data-Driven SMILES Substructure Tokenization

Introduces SMILES Pair Encoding (SPE), a data-driven tokenization algorithm that learns high-frequency SMILES substrings from ChEMBL to produce shorter, chemically interpretable token sequences for deep learning.

Molecular Representations

Bar chart showing SPMM supports bidirectional tasks: molecule to property, property to molecule, molecule optimization, and property interpolation

SPMM: A Bidirectional Molecular Foundation Model

SPMM pre-trains a dual-stream transformer on SMILES and 53 molecular property vectors using contrastive learning and cross-attention, enabling bidirectional structure-property generation, property prediction, and reaction prediction through a single model.

Computational Chemistry

Bar chart showing scientific LLM taxonomy across five modalities: textual, molecular, protein, genomic, and multimodal

Survey of Scientific LLMs in Bio and Chem Domains

This survey systematically reviews scientific LLMs (Sci-LLMs) across five modalities: textual, molecular, protein, genomic, and multimodal, analyzing architectures, datasets, evaluation methods, and open challenges for AI-driven scientific discovery.