Computational Chemistry
SMolInstruct dataset feeding into four base models for chemistry instruction tuning

LlaSMol: Instruction-Tuned LLMs for Chemistry Tasks

LlaSMol fine-tunes Mistral, Llama 2, and other open-source LLMs on SMolInstruct, a 3.3M-sample instruction tuning dataset covering 14 chemistry tasks. The Mistral-based model outperforms GPT-4 and Claude 3 Opus across all tasks.

Computational Chemistry
PharmaGPT two-stage training from domain continued pretraining to weighted supervised fine-tuning with RLHF

PharmaGPT: Domain-Specific LLMs for Pharma and Chem

PharmaGPT is a suite of domain-specific LLMs (13B and 70B parameters) built on LLaMA with continued pretraining on biopharmaceutical and chemical data, achieving strong results on NAPLEX and Chinese pharmacist exams.

Computational Chemistry
Three-stage progression from task-specific transformers through multimodal models to LLM chemistry agents

Transformers and LLMs for Chemistry Drug Discovery

A review chapter tracing three stages of transformer adoption in chemistry: task-specific single-modality models (reaction prediction, retrosynthesis), multimodal approaches bridging spectra and text, and LLM-powered agents like ChemCrow for general chemical reasoning.

Computational Chemistry
Bar chart showing GPT-4 relative performance across eight chemistry tasks grouped by understanding, reasoning, and explaining capabilities

ChemLLMBench: Benchmarking LLMs on Chemistry Tasks

A comprehensive benchmark evaluating GPT-4, GPT-3.5, Davinci-003, Llama, and Galactica on eight practical chemistry tasks, revealing that LLMs are competitive on classification and text tasks but struggle with SMILES-dependent generation.

Computational Chemistry
Bar chart comparing GPT-3 ada and GNN accuracy across molecular classification tasks

Fine-Tuning GPT-3 for Molecular Property Prediction

This paper fine-tunes GPT-3’s ada model on SMILES strings for classifying electronic properties (HOMO, LUMO) of organic semiconductor molecules, finding competitive accuracy with graph neural networks and exploring robustness through ablation studies.

Computational Chemistry
Bar chart comparing small and big foundation models surveyed across property prediction, MLIPs, inverse design, and multi-domain chemistry applications

Foundation Models in Chemistry: A 2025 Perspective

This perspective from Choi et al. reviews foundation models in chemistry, categorizing them as ‘small’ (domain-specific, e.g., property prediction, MLIPs, inverse design) and ‘big’ (multi-domain, e.g., multimodal and LLM-based). It surveys pretraining strategies, key architectures (GNNs and language models), and outlines future directions for scaling, efficiency, and interpretability.

Computational Chemistry
Diagram showing the CaR pipeline from SMILES to ChatGPT-generated captions to fine-tuned RoBERTa predictions

LLM4Mol: ChatGPT Captions as Molecular Representations

Proposes Captions as Representations (CaR), where ChatGPT generates textual explanations for SMILES strings that are then used to fine-tune small language models for molecular property prediction.

Computational Chemistry
Bar chart showing vision language model performance across chemistry tasks including equipment identification, molecule matching, spectroscopy, and laboratory safety

MaCBench: Multimodal Chemistry and Materials Benchmark

MaCBench evaluates frontier vision language models across 1,153 chemistry and materials science tasks spanning data extraction, experimental execution, and data interpretation, uncovering fundamental limitations in spatial reasoning and cross-modal integration.

Computational Chemistry
Conceptual diagram showing natural language prompts flowing into code generation for chemistry tasks

NLP Models That Automate Programming for Chemistry

Hocky and White argue that NLP models capable of generating code from natural language prompts will fundamentally alter how chemists interact with scientific software, reducing barriers to computational research and reshaping programming pedagogy.

Computational Chemistry
Bar chart showing scientific LLM taxonomy across five modalities: textual, molecular, protein, genomic, and multimodal

Survey of Scientific LLMs in Bio and Chem Domains

This survey systematically reviews scientific LLMs (Sci-LLMs) across five modalities: textual, molecular, protein, genomic, and multimodal, analyzing architectures, datasets, evaluation methods, and open challenges for AI-driven scientific discovery.

Computational Chemistry
Heatmap showing LLM accuracy across nine chemistry coding task categories for four models, with green indicating high accuracy and red indicating low accuracy

Benchmarking Chemistry Knowledge in Code-Gen LLMs

A benchmark of 84 chemistry coding tasks evaluating code-generating LLMs like Codex, showing 72% accuracy with prompt engineering strategies that improve performance by 30 percentage points.

Computational Chemistry
Bar chart comparing LLM, DeBERTa, GCN, and GIN performance on three OGB molecular classification benchmarks

Benchmarking LLMs for Molecular Property Prediction

Benchmarks large language models on six molecular property prediction datasets, finding that LLMs lag behind GNNs but can augment ML models when used collaboratively.