Molecular Representations
Bar chart showing MolBERT ablation: combining MLM, PhysChem, and SMILES equivalence tasks gives best improvement

MolBERT: Auxiliary Tasks for Molecular BERT Models

MolBERT pre-trains a BERT model on SMILES strings using masked language modeling, SMILES equivalence, and physicochemical property prediction as auxiliary tasks, achieving state-of-the-art results on virtual screening and QSAR benchmarks.

Molecular Representations
Bar chart comparing nach0 vs T5-base across molecular captioning, Q/A, reaction prediction, retrosynthesis, and generation

nach0: A Multimodal Chemical and NLP Foundation Model

nach0 unifies natural language and SMILES-based chemical tasks in a single encoder-decoder model, achieving competitive results across molecular property prediction, reaction prediction, molecular generation, and biomedical NLP benchmarks.

Molecular Representations
Encoder-decoder architecture diagram for translating chemical names between English and Chinese with performance comparison bar chart

Neural Machine Translation of Chemical Nomenclature

This paper applies character-level CNN and LSTM encoder-decoder networks to translate chemical names between English and Chinese, comparing them against an existing rule-based tool.

Computational Chemistry
Conceptual diagram showing natural language prompts flowing into code generation for chemistry tasks

NLP Models That Automate Programming for Chemistry

Hocky and White argue that NLP models capable of generating code from natural language prompts will fundamentally alter how chemists interact with scientific software, reducing barriers to computational research and reshaping programming pedagogy.

Molecular Generation
Bar chart showing PrefixMol Vina scores across different conditioning modes: target, property, combined, and scaffold

PrefixMol: Prefix Embeddings for Drug Molecule Design

PrefixMol prepends learnable condition vectors to a GPT transformer for SMILES generation, enabling joint control over binding pocket targeting and chemical properties like QED, SA, and LogP.

Molecular Generation
Bar chart comparing docking scores of generated vs known ligands for CDK2 and EGFR targets

Protein-to-Drug Molecule Translation via Transformer

Applies the Transformer architecture to generate drug-like molecules conditioned on protein amino acid sequences, treating target-specific de novo drug design as a sequence-to-sequence translation problem.

Molecular Generation
Horizontal bar chart showing REINVENT 4 unified framework supporting seven generative model types

REINVENT 4: Open-Source Generative Molecule Design

Overview of REINVENT 4, an open-source generative molecular design framework from AstraZeneca that unifies RNN and transformer generators within reinforcement learning, transfer learning, and curriculum learning optimization algorithms.

Molecular Generation
Bar chart comparing RNN and Transformer Wasserstein distances across drug-like, peptide-like, and polymer-like generation tasks

RNNs vs Transformers for Molecular Generation Tasks

Compares RNN-based and Transformer-based chemical language models across three molecular generation tasks of increasing complexity, finding that RNNs excel at local features while Transformers handle large molecules better.

Molecular Representations
Bar chart comparing SMI-TED ROC-AUC scores against ChemBERTa, ChemBERTa-2, MoLFormer, and GROVER on BBBP and HIV

SMI-TED: Encoder-Decoder Foundation Models for Chemistry

SMI-TED introduces encoder-decoder chemical foundation models (289M parameters) pre-trained on 91 million PubChem molecules, achieving strong results across property prediction, reaction yield, and molecule generation benchmarks.

Molecular Representations
Diagram showing Transformer encoder-decoder architecture converting SMILES strings into molecular fingerprints

SMILES Transformer: Low-Data Molecular Fingerprints

A Transformer-based encoder-decoder pre-trained on 861K SMILES from ChEMBL24 produces 1024-dimensional molecular fingerprints that outperform ECFP and graph convolutions on 5 of 10 MoleculeNet tasks in low-data settings.

Molecular Representations
Bar chart comparing Atom Pair Encoding vs BPE tokenization on MoleculeNet classification tasks

SMILES vs SELFIES Tokenization for Chemical LMs

Introduces Atom Pair Encoding (APE), a chemistry-aware tokenizer for SMILES and SELFIES, and shows it consistently outperforms Byte Pair Encoding in RoBERTa-based molecular property classification on BBBP, HIV, and Tox21 benchmarks.

Molecular Representations
Bar chart comparing SMILES-BERT accuracy against baselines on HIV, LogP, and PCBA tasks

SMILES-BERT: BERT-Style Pre-Training for Molecules

SMILES-BERT pre-trains a Transformer encoder on 18M+ SMILES from ZINC using a masked recovery task, then fine-tunes for molecular property prediction, outperforming prior methods on three datasets.