LLMs for Chemistry

This section covers large language models and vision-language models applied to chemistry. These differ from chemical language models (ChemBERTa, MoLFormer, etc.) in that they build on general-purpose LLM or VLM backbones rather than learning representations directly from molecular string notations.

Foundation Models & Domain-Specific LLMs

Models built or fine-tuned specifically for chemical reasoning and molecular understanding.

Year	Paper	Focus
2022	Galactica	Large-scale scientific LLM from Meta AI trained on curated scientific corpora
2024	ChemLLM	Framework for building chemistry-focused LLMs with structured chemical instruction data
2024	LlaSMol	Instruction-tuned LLMs (Llama-based) for core chemistry tasks
2024	Fine-Tuning GPT-3 for Molecular Properties	GPT-3 fine-tuning for molecular property prediction
2024	Fine-Tuning GPT-3 for Predictive Chemistry	GPT-3 fine-tuning for predictive chemistry tasks (yield, selectivity)
2024	PharmaGPT	Domain-specific LLMs for pharmaceutical and chemical applications
2025	ChemDFM-R	Chemical reasoning LLM with atomized step-by-step knowledge decomposition

Multimodal Models

Models that integrate molecular graphs, images, spectra, or documents with text.

Year	Paper	Focus
2024	ChemDFM-X	Multimodal foundation model aligning molecular graphs and text
2025	ChemVLM	Vision-language model for chemical image understanding
2025	InstructMol	Multi-modal molecular LLM bridging graphs, SMILES, and text for drug discovery
2025	MERMaid	Multimodal extraction of chemical reactions from scientific PDFs
2025	Multimodal Search in Chemical Documents	Cross-modal retrieval across chemical documents and reaction diagrams

Agentic & Tool-Augmented Systems

LLM agents that autonomously plan and execute chemistry workflows using external tools.

Year	Paper	Focus
2023	Coscientist	Autonomous multi-agent system for chemical research with robotic lab integration
2024	ChemCrow	LLM augmented with 18 chemistry tools for synthesis planning and safety

Drug Discovery & Molecular Optimization

LLM-based approaches for drug editing, molecule optimization, and compound QA.

Year	Paper	Focus
2023	DrugChat	Conversational QA over drug molecule graphs
2023	LLM4Mol	Using ChatGPT-generated captions as molecular representations
2024	ChatDrug	Conversational drug editing with retrieval-augmented generation
2024	DrugAssist	Interactive LLM-guided molecule optimization

Benchmarks & Evaluation

Datasets and evaluation frameworks for assessing LLM performance on chemistry tasks.

Year	Paper	Focus
2023	ChemLLMBench	Eight-task benchmark for LLM chemistry capabilities
2023	Code-Gen Chemistry Assessment	Evaluating chemistry knowledge in code-generation LLMs
2024	Benchmarking LLMs for Molecular Prediction	Systematic comparison of LLMs on molecular property prediction
2024	ChemEval	Fine-grained, multi-level evaluation framework for chemistry LLMs
2024	ChemSafetyBench	Safety-focused benchmark for chemistry LLMs
2025	ChemBench	Large-scale evaluation comparing LLMs against human chemistry experts
2025	MaCBench	Multimodal benchmark for chemistry and materials science

Surveys & Perspectives

Broad reviews and position papers on the role of LLMs in chemistry.

Year	Paper	Focus
2022	NLP Models That Automate Programming for Chemistry	Early perspective on NLP and code generation for chemical workflows
2024	Survey of Scientific LLMs in Bio and Chem	Comprehensive survey of LLM applications across biology and chemistry

Computational Chemistry

ChatDrug pipeline from prompt design through ChatGPT to domain feedback and edited molecule output

ChatDrug: Conversational Drug Editing with ChatGPT

ChatDrug is a parameter-free framework that combines ChatGPT with retrieval-augmented domain feedback and iterative conversation to edit drugs across small molecules, peptides, and proteins.

Computational Chemistry

ChemCrow architecture with GPT-4 central planner connected to 18 chemistry tools via ReAct reasoning

ChemCrow: Augmenting LLMs with 18 Chemistry Tools

ChemCrow augments GPT-4 with 18 chemistry tools to autonomously plan and execute syntheses, discover novel chromophores, and solve diverse chemical reasoning tasks.

Computational Chemistry

ChemLLM pipeline from ChemData structured templates through fine-tuned InternLM2 to ChemBench evaluation

ChemLLM: A Chemical Large Language Model Framework

ChemLLM presents a comprehensive framework for chemistry-specific language modeling, including a 7M-sample instruction tuning dataset (ChemData), a 4,100-question benchmark (ChemBench), and a two-stage fine-tuned model that matches GPT-4 on core chemical tasks.

Computational Chemistry

Coscientist architecture with GPT-4 planner orchestrating web search, code execution, document search, and robot lab API modules

Coscientist: Autonomous Chemistry with LLM Agents

Introduces Coscientist, a GPT-4-driven AI system that autonomously designs and executes chemical experiments using web search, code execution, and robotic lab automation.

Computational Chemistry

DrugAssist workflow from user instruction through LoRA fine-tuned Llama2 to optimized molecule output

DrugAssist: Interactive LLM Molecule Optimization

DrugAssist fine-tunes Llama2-7B-Chat on over one million molecule pairs for interactive, dialogue-based molecule optimization across six molecular properties.

Computational Chemistry

DrugChat architecture showing GNN encoder, linear adaptor, and Vicuna LLM for conversational drug analysis

DrugChat: Conversational QA on Drug Molecule Graphs

DrugChat is a prototype system that bridges molecular graph neural networks with large language models for interactive, multi-turn question answering about drug compounds. It trains only a lightweight linear adaptor between a frozen GNN encoder and Vicuna-13B using 143K curated QA pairs from ChEMBL and PubChem.

Computational Chemistry

Pipeline diagram showing natural language chemistry questions flowing through fine-tuned GPT-3 to chemical predictions across molecules, materials, and reactions

Fine-Tuning GPT-3 for Predictive Chemistry Tasks

Jablonka et al. show that fine-tuning GPT-3 on natural language chemistry questions achieves competitive or superior performance to dedicated ML models across 15 benchmarks, with particular strength in low-data settings and inverse molecular design.

Computational Chemistry

Visualization of Galactica corpus composition and benchmark performance comparing Galactica 120B against baselines

Galactica: A Curated Scientific LLM from Meta AI

Galactica trains a decoder-only Transformer on a curated 106B-token scientific corpus spanning papers, proteins, and molecules, achieving strong results on scientific QA, mathematical reasoning, and citation prediction.

Computational Chemistry

SMolInstruct dataset feeding into four base models for chemistry instruction tuning

LlaSMol: Instruction-Tuned LLMs for Chemistry Tasks

LlaSMol fine-tunes Mistral, Llama 2, and other open-source LLMs on SMolInstruct, a 3.3M-sample instruction tuning dataset covering 14 chemistry tasks. The Mistral-based model outperforms GPT-4 and Claude 3 Opus across all tasks.

Computational Chemistry

PharmaGPT two-stage training from domain continued pretraining to weighted supervised fine-tuning with RLHF

PharmaGPT: Domain-Specific LLMs for Pharma and Chem

PharmaGPT is a suite of domain-specific LLMs (13B and 70B parameters) built on LLaMA with continued pretraining on biopharmaceutical and chemical data, achieving strong results on NAPLEX and Chinese pharmacist exams.

Computational Chemistry

Bar chart showing GPT-4 relative performance across eight chemistry tasks grouped by understanding, reasoning, and explaining capabilities

ChemLLMBench: Benchmarking LLMs on Chemistry Tasks

A comprehensive benchmark evaluating GPT-4, GPT-3.5, Davinci-003, Llama, and Galactica on eight practical chemistry tasks, revealing that LLMs are competitive on classification and text tasks but struggle with SMILES-dependent generation.

Computational Chemistry

Bar chart comparing GPT-3 ada and GNN accuracy across molecular classification tasks

Fine-Tuning GPT-3 for Molecular Property Prediction

This paper fine-tunes GPT-3’s ada model on SMILES strings for classifying electronic properties (HOMO, LUMO) of organic semiconductor molecules, finding competitive accuracy with graph neural networks and exploring robustness through ablation studies.

Foundation Models & Domain-Specific LLMs#

Multimodal Models#

Agentic & Tool-Augmented Systems#

Drug Discovery & Molecular Optimization#

Benchmarks & Evaluation#

Surveys & Perspectives#

Foundation Models & Domain-Specific LLMs

Multimodal Models

Agentic & Tool-Augmented Systems

Drug Discovery & Molecular Optimization

Benchmarks & Evaluation

Surveys & Perspectives