Notes on general-purpose and multimodal large language models applied to chemical reasoning, molecular understanding, and document extraction.
This section covers large language models and vision-language models applied to chemistry. These differ from chemical language models (ChemBERTa, MoLFormer, etc.) in that they build on general-purpose LLM or VLM backbones rather than learning representations directly from molecular string notations. Topics include multimodal models integrating molecular graphs, images, or spectra with text (ChemVLM, ChemDFM-X, InstructMol), chemical reasoning LLMs (ChemDFM-R), and systems for extracting or retrieving chemical information from scientific literature (MERMaid).
ChemDFM-R: Chemical Reasoning LLM with Atomized Knowledge
ChemDFM-R is a 14B-parameter chemical reasoning model that integrates a 101B-token dataset of atomized chemical knowledge. Using a mix-sourced distillation strategy and domain-specific reinforcement learning, it outperforms similarly sized models and DeepSeek-R1 on ChemEval.
ChemDFM-X: Multimodal Foundation Model for Chemistry
ChemDFM-X is a multimodal chemical foundation model that integrates five non-text modalities (2D graphs, 3D conformations, images, MS2 spectra, IR spectra) into a single LLM decoder. It overcomes data scarcity by generating a 7.6M instruction-tuning dataset through approximate calculations and model predictions, establishing strong baseline performance across multiple modalities.
InstructMol: Multi-Modal Molecular LLM for Drug Discovery
InstructMol integrates a pre-trained molecular graph encoder (MoleculeSTM) with a Vicuna-7B LLM using a linear projector. It employs a two-stage training process (alignment pre-training followed by task-specific instruction tuning with LoRA) to excel at property prediction, description generation, and reaction analysis.
MERMaid: Multimodal Chemical Reaction Mining from PDFs
MERMaid leverages fine-tuned vision models and VLM reasoning to mine chemical reaction data directly from PDF figures and tables. By handling context inference and coreference resolution, it builds high-fidelity knowledge graphs with 87% end-to-end accuracy.
Multimodal Search in Chemical Documents and Reactions
This paper presents a multimodal search system that facilitates passage-level retrieval of chemical reactions and molecular structures by linking diagrams, text, and reaction records extracted from scientific PDFs.
ChemVLM: A Multimodal Large Language Model for Chemistry
A 2025 AAAI paper introducing ChemVLM, a domain-specific multimodal LLM (26B parameters). It achieves state-of-the-art performance on chemical OCR, reasoning benchmarks, and molecular understanding tasks by combining vision and language models trained on curated chemistry data.