Computational Chemistry
Diagram showing the CaR pipeline from SMILES to ChatGPT-generated captions to fine-tuned RoBERTa predictions

LLM4Mol: ChatGPT Captions as Molecular Representations

Proposes Captions as Representations (CaR), where ChatGPT generates textual explanations for SMILES strings that are then used to fine-tune small language models for molecular property prediction.

Molecular Generation
Bar chart showing language model validity rates across XYZ, CIF, and PDB 3D chemical file formats

LMs Generate 3D Molecules from XYZ, CIF, PDB Files

Demonstrates that standard transformer language models, trained with next-token prediction on sequences from XYZ, CIF, and PDB files, can generate valid 3D molecules, crystals, and protein binding sites competitive with domain-specific 3D generative models.

Molecular Representations
Bar chart showing MolBERT ablation: combining MLM, PhysChem, and SMILES equivalence tasks gives best improvement

MolBERT: Auxiliary Tasks for Molecular BERT Models

MolBERT pre-trains a BERT model on SMILES strings using masked language modeling, SMILES equivalence, and physicochemical property prediction as auxiliary tasks, achieving state-of-the-art results on virtual screening and QSAR benchmarks.

Predictive Chemistry
Diagram showing ULMFiT-style three-stage pipeline adapted for molecular property prediction

MolPMoFiT: Inductive Transfer Learning for QSAR

MolPMoFiT applies ULMFiT-style transfer learning to QSAR modeling, pre-training an AWD-LSTM on one million ChEMBL molecules and fine-tuning for property prediction on small datasets.

Molecular Representations
Bar chart comparing nach0 vs T5-base across molecular captioning, Q/A, reaction prediction, retrosynthesis, and generation

nach0: A Multimodal Chemical and NLP Foundation Model

nach0 unifies natural language and SMILES-based chemical tasks in a single encoder-decoder model, achieving competitive results across molecular property prediction, reaction prediction, molecular generation, and biomedical NLP benchmarks.

Molecular Representations
Encoder-decoder architecture diagram for translating chemical names between English and Chinese with performance comparison bar chart

Neural Machine Translation of Chemical Nomenclature

This paper applies character-level CNN and LSTM encoder-decoder networks to translate chemical names between English and Chinese, comparing them against an existing rule-based tool.

Molecular Generation
Distribution plot showing original QM9 logP shifted toward +6 and -6 targets via gradient-based dreaming

PASITHEA: Gradient-Based Molecular Design via Dreaming

PASITHEA adapts deep dreaming from computer vision to molecular design, directly optimizing SELFIES-encoded molecules for target chemical properties via gradient-based inversion of a trained regression network.

Molecular Generation
Bar chart showing PrefixMol Vina scores across different conditioning modes: target, property, combined, and scaffold

PrefixMol: Prefix Embeddings for Drug Molecule Design

PrefixMol prepends learnable condition vectors to a GPT transformer for SMILES generation, enabling joint control over binding pocket targeting and chemical properties like QED, SA, and LogP.

Molecular Generation
Bar chart comparing docking scores of generated vs known ligands for CDK2 and EGFR targets

Protein-to-Drug Molecule Translation via Transformer

Applies the Transformer architecture to generate drug-like molecules conditioned on protein amino acid sequences, treating target-specific de novo drug design as a sequence-to-sequence translation problem.

Molecular Generation
Diagram showing the dual formulation of S4 models with convolution during training and recurrence during generation for SMILES-based molecular design

S4 Structured State Space Models for De Novo Drug Design

This paper introduces structured state space sequence (S4) models to chemical language modeling, showing they combine the strengths of LSTMs (efficient recurrent generation) and GPTs (holistic sequence learning) for de novo molecular design.

Molecular Representations
Diagram showing SMILES string flowing through encoder to fixed-length fingerprint vector and back through decoder

Seq2seq Fingerprint: Unsupervised Molecular Embedding

A GRU-based sequence-to-sequence model that learns fixed-length molecular fingerprints by translating SMILES strings to themselves, enabling unsupervised representation learning for drug discovery tasks.

Molecular Representations
Bar chart comparing SMI-TED ROC-AUC scores against ChemBERTa, ChemBERTa-2, MoLFormer, and GROVER on BBBP and HIV

SMI-TED: Encoder-Decoder Foundation Models for Chemistry

SMI-TED introduces encoder-decoder chemical foundation models (289M parameters) pre-trained on 91 million PubChem molecules, achieving strong results across property prediction, reaction yield, and molecule generation benchmarks.