Computational Chemistry
Diagram showing the t-SMILES pipeline from molecular graph fragmentation to binary tree traversal producing a string representation

t-SMILES: Tree-Based Fragment Molecular Encoding

t-SMILES represents molecules by fragmenting them into substructures, building full binary trees, and traversing them breadth-first to produce SMILES-type strings that reduce nesting depth and outperform SMILES, DeepSMILES, and SELFIES on generation benchmarks.

Computational Chemistry
Taxonomy of transformer-based chemical language models organized by architecture type

Transformer CLMs for SMILES: Literature Review 2024

A comprehensive review of transformer-based chemical language models operating on SMILES, categorizing encoder-only (BERT variants), decoder-only (GPT variants), and encoder-decoder models with analysis of tokenization strategies, pre-training approaches, and future directions.

Computational Chemistry
Diagram showing sequence-to-sequence translation from chemical names to SMILES with atom count constraints

Transformer Name-to-SMILES with Atom Count Losses

This paper applies a Transformer sequence-to-sequence model to predict SMILES strings from chemical compound names (Synonyms). Two enhancements, an atom-count constraint loss and SMILES/InChI multi-task learning, improve F-measure over rule-based and vanilla Transformer baselines.

Computational Chemistry
Bar chart comparing Transformer-CNN RMSE against RF, SVM, CNN, and CDDD baselines

Transformer-CNN: SMILES Embeddings for QSAR Modeling

Transformer-CNN extracts dynamic SMILES embeddings from a Transformer trained on SMILES canonicalization and feeds them to a TextCNN for QSAR modeling, achieving strong results across 18 benchmarks with built-in LRP interpretability.

Computational Chemistry
Overview of 16 transformer models for molecular property prediction organized by architecture type

Transformers for Molecular Property Prediction Review

Sultan et al. review 16 sequence-based transformer models for molecular property prediction, systematically analyzing seven design decisions (database selection, chemical language, tokenization, positional encoding, model size, pre-training objectives, and fine-tuning strategy) and identifying a critical need for standardized evaluation practices.

Computational Chemistry
Bar chart comparing Char-RNN and Molecular VAE on validity and novelty metrics

VAE for Automatic Chemical Design (2018 Seminal)

This foundational paper introduces a variational autoencoder (VAE) that encodes SMILES strings into a continuous latent space, allowing gradient-based optimization of molecular properties. Joint training with a property predictor organizes the latent space by chemical properties, and Bayesian optimization over the latent surface discovers drug-like molecules with improved QED and synthetic accessibility.

Computational Chemistry
Horizontal bar chart showing X-MOL achieves best performance across five molecular tasks

X-MOL: Pre-training on 1.1B Molecules for SMILES

X-MOL applies large-scale Transformer pre-training on 1.1 billion molecules with a generative SMILES-to-SMILES strategy, then fine-tunes for five molecular analysis tasks including property prediction, reaction analysis, and de novo generation.

Computational Chemistry
Bar chart showing retrieval accuracy of chemical language models across four SMILES augmentation types

AMORE: Testing ChemLLM Robustness to SMILES Variants

Introduces AMORE, an embedding-based retrieval framework that evaluates whether chemical language models can recognize the same molecule across different SMILES representations. Results show current models are not robust to identity-preserving augmentations.

Computational Chemistry
Diagram showing back translation workflow with forward and reverse models mapping between source and target molecular domains, augmented by unlabeled ZINC molecules

Back Translation for Semi-Supervised Molecule Generation

Adapts back translation from NLP to molecular generation, using unlabeled molecules from ZINC to create synthetic training pairs that improve property optimization and retrosynthesis prediction across Transformer and graph-based architectures.

Computational Chemistry
Heatmap showing LLM accuracy across nine chemistry coding task categories for four models, with green indicating high accuracy and red indicating low accuracy

Benchmarking Chemistry Knowledge in Code-Gen LLMs

A benchmark of 84 chemistry coding tasks evaluating code-generating LLMs like Codex, showing 72% accuracy with prompt engineering strategies that improve performance by 30 percentage points.

Computational Chemistry
Bar chart comparing LLM, DeBERTa, GCN, and GIN performance on three OGB molecular classification benchmarks

Benchmarking LLMs for Molecular Property Prediction

Benchmarks large language models on six molecular property prediction datasets, finding that LLMs lag behind GNNs but can augment ML models when used collaboratively.

Computational Chemistry
Bar chart comparing fixed molecular representations (RF, SVM, XGBoost) against learned representations (MolBERT, GROVER) across six property prediction benchmarks under scaffold split

Benchmarking Molecular Property Prediction at Scale

This study trains over 62,000 models to systematically evaluate molecular representations and models for property prediction, finding that traditional ML on fixed descriptors often outperforms deep learning approaches.