Notes on language models for chemistry, from text-to-text translation (SMILES to IUPAC) to vision-language models integrating molecular images and text.
This section covers the application of language model architectures to chemistry. Most notes here focus on models that treat molecular strings (SMILES, IUPAC names, InChI) as sequences, including encoder-only transformers like ChemBERTa and sequence-to-sequence models like ChemFormer and STOUT for structure-to-name translation. A growing set of notes also covers multimodal approaches: models like ChemVLM and InstructMol that connect 2D molecular images or graphs with natural language, enabling tasks like captioning, question answering, and structure retrieval.
ChemBERTa-3: Open Source Chemical Foundation Models
ChemBERTa-3 provides a unified, scalable infrastructure for pretraining and benchmarking chemical foundation models. It addresses reproducibility gaps in previous studies like MoLFormer through standardized scaffold splitting and open-source tooling.
ChemDFM-R: Chemical Reasoning LLM with Atomized Knowledge
ChemDFM-R is a 14B-parameter chemical reasoning model that integrates a 101B-token dataset of atomized chemical knowledge. Using a mix-sourced distillation strategy and domain-specific reinforcement learning, it outperforms similarly sized models and DeepSeek-R1 on ChemEval.
ChemBERTa-2: Scaling Molecular Transformers to 77M
This work investigates the scaling hypothesis for molecular transformers, training RoBERTa models on 77M SMILES from PubChem. It compares Masked Language Modeling (MLM) against Multi-Task Regression (MTR) pretraining, finding that MTR yields better downstream performance but is computationally heavier.
GP-MoLFormer: Molecular Generation via Transformers
This methodological paper proposes a linear-attention transformer decoder trained on 1.1 billion molecules. It introduces pair-tuning for efficient property optimization and establishes empirical scaling laws relating inference compute to generation novelty.
ChemBERTa: Molecular Property Prediction via Transformers
This paper introduces ChemBERTa, a RoBERTa-based model pretrained on 77M SMILES strings. It systematically evaluates the impact of pretraining dataset size, tokenization strategies, and input representations (SMILES vs. SELFIES) on downstream MoleculeNet tasks, finding that performance scales positively with data size.
Chemformer: A Pre-trained Transformer for Comp Chem
This paper introduces Chemformer, a BART-based sequence-to-sequence model pre-trained on 100M molecules using a ‘combined’ masking and augmentation task. It achieves top-1 accuracy on reaction prediction benchmarks while significantly reducing training time through transfer learning.
ChemDFM-X: Multimodal Foundation Model for Chemistry
ChemDFM-X is a multimodal chemical foundation model that integrates five non-text modalities (2D graphs, 3D conformations, images, MS2 spectra, IR spectra) into a single LLM decoder. It overcomes data scarcity by generating a 7.6M instruction-tuning dataset through approximate calculations and model predictions, establishing strong baseline performance across multiple modalities.
InstructMol: Multi-Modal Molecular LLM for Drug Discovery
InstructMol integrates a pre-trained molecular graph encoder (MoleculeSTM) with a Vicuna-7B LLM using a linear projector. It employs a two-stage training process (alignment pre-training followed by task-specific instruction tuning with LoRA) to excel at property prediction, description generation, and reaction analysis.
MERMaid: Multimodal Chemical Reaction Mining from PDFs
MERMaid leverages fine-tuned vision models and VLM reasoning to mine chemical reaction data directly from PDF figures and tables. By handling context inference and coreference resolution, it builds high-fidelity knowledge graphs with 87% end-to-end accuracy.
Multimodal Search in Chemical Documents and Reactions
This paper presents a multimodal search system that facilitates passage-level retrieval of chemical reactions and molecular structures by linking diagrams, text, and reaction records extracted from scientific PDFs.
STOUT V2.0: Transformer-Based SMILES to IUPAC Translation
STOUT V2.0 uses Transformers trained on ~1 billion SMILES-IUPAC pairs to accurately translate chemical structures into systematic names (and vice-versa), outperforming its RNN predecessor.
STOUT: SMILES to IUPAC Names via Neural Machine Translation
STOUT (SMILES-TO-IUPAC-name translator) uses neural machine translation to convert chemical line notations to IUPAC names and vice versa, achieving ~90% BLEU score. It addresses the lack of open-source tools for algorithmic IUPAC naming.