Paper Information
Citation: Li, J., et al. (2025). ChemVLM: Exploring the Power of Multimodal Large Language Models in Chemistry Area. Proceedings of the AAAI Conference on Artificial Intelligence, 39(1), 415-423. https://doi.org/10.1609/aaai.v39i1.32020
Publication: AAAI 2025
Additional Resources:
What kind of paper is this?
This paper is a superposition of Method (primary) and Resource (secondary).
It is primarily a Method paper because it proposes ChemVLM, a novel multimodal architecture specifically tailored for the chemical domain, utilizing a “ViT-MLP-LLM” framework. The authors introduce a specific two-stage training strategy to align visual features with chemical text representations.
Secondarily, it is a Resource paper as it introduces a comprehensive suite of three new datasets: ChemOCR, MMCR-Bench, and MMChemBench, developed to rigorously evaluate multimodal capabilities in chemistry, covering OCR, reasoning, and property prediction.
What is the motivation?
The primary motivation is the limitation of existing models in handling the multimodal nature of chemistry.
- Visual Data Gap: Chemical tasks heavily rely on visual information (molecular structures, reactions) which purely text-based chemical LLMs cannot process.
- Limitations of Generalist Models: General multimodal models (like GPT-4V or LLaVA) lack specialized chemical domain knowledge, leading to hallucinations or misinterpretations.
- Inadequacy of OCR Tools: Traditional chemical OCR tools (like MolScribe) excel at modality conversion (Image-to-SMILES) but fail at complex reasoning tasks.
What is the novelty here?
- Specialized Architecture: ChemVLM integrates InternViT-6B (a robust vision transformer) with ChemLLM-20B (a domain-specific LLM trained on chemical data) via an MLP projector.
- Bilingual Multimodal Dataset: The model is trained on a carefully curated dataset including molecular structures, reactions, and exam questions, enriched with style transfer techniques (e.g., handwritten styles, graph paper backgrounds) to improve robustness.
- New Benchmarks: The introduction of three tailored benchmarks to assess distinct capabilities:
- ChemOCR: For evaluating image-to-SMILES conversion.
- MMCR-Bench: Derived from college entrance exams to test complex multimodal reasoning.
- MMChemBench: For molecule captioning and property prediction.
What experiments were performed?
The authors benchmarked ChemVLM against both open-source (LLaVA, Qwen-VL, InternVL) and proprietary (GPT-4V) models across three primary domains:
- Chemical OCR: Evaluated on 1,000 image-text pairs from ChemOCR. The metric used was Tanimoto similarity between the generated and ground-truth SMILES.
- Multimodal Chemical Reasoning (MMCR): Tested on MMCR-Bench (1,000 exam questions), ScienceQA, and CMMU. Performance was scored based on accuracy for multiple-choice and fill-in-the-blank questions.
- Multimodal Molecule Understanding: Evaluated on MMChemBench for molecule captioning and property prediction.
- Generalization: Tested on non-chemistry subjects within the CMMU framework (Biology, Physics, Math) to assess cross-domain competence.
What were the outcomes and conclusions drawn?
- SOTA Performance: ChemVLM achieved state-of-the-art results on MMCR-Bench (41.7%), surpassing GPT-4V (40.1%) and significantly outperforming other open-source models.
- Superior Understanding: In molecule understanding tasks (captioning and prediction), ChemVLM showed drastic improvements over baselines, scoring 80.9% on prediction compared to GPT-4V’s 38.6%.
- Strong OCR Capabilities: While specialized OCR tools (MolScribe) still lead in pure structure recognition, ChemVLM outperformed all generalist MLLMs, achieving a Tanimoto similarity of 71.0% vs. GPT-4V’s 15.0%.
- Generalization: The model demonstrated strong performance in related scientific fields like Biology and Physics, suggesting the training strategy imparts robust scientific reasoning capabilities.
Reproducibility Details
Data
The training and evaluation data relied on a mix of open-source repositories and custom curation.
| Purpose | Dataset | Source/Notes |
|---|---|---|
| Training (Molecule) | DECIMER HDM | 7,000+ hand-drawn molecular images. |
| Training (Molecule) | MolScribe Data | Scanned/photographed images from literature. |
| Training (Molecule) | Synthetic | Generated via ChemDraw, RDKit, and Indigo with style transfer (blurring, rotation, handwritten styles). |
| Training (Reaction) | PEACE & USPTO-50K | Inorganic and organic reaction schemes. |
| Training (Reasoning) | Exam Questions | 200,000 questions from OpenDataLab (Chinese education level), covering multiple choice and fill-in-the-blanks. |
| Evaluation | ChemOCR | 1,000 bilingual image-text pairs for SMILES recognition. |
| Evaluation | MMCR-Bench | 1,000 multimodal chemistry exam questions. |
| Evaluation | MMChemBench | Extension of ChemBench for captioning and property prediction. |
Preprocessing: Images were augmented using RanDepict for style variation. Text data (SMILES) was validated and cleaned. Prompts were diversified using GPT-4 to generate different linguistic styles.
Algorithms
- Architecture: “ViT-MLP-LLM” structure.
- Vision Encoder: InternViT-6B, processing images at $448 \times 448$ resolution. Images are segmented into tiles (max 12).
- Projector: Multi-Layer Perceptron (MLP) initialized randomly to map visual features to text embedding space.
- LLM: ChemLLM-20B, a domain-specific model.
- Training Strategy: Two-stage supervised fine-tuning.
- Modal Alignment: Freeze LLM and base Vision Encoder. Train only the MLP projector and LoRA layers of the Vision Encoder. Uses diverse multimodal data.
- Supervised Fine-Tuning (SFT): Unfreeze all parameters. Use LoRA for LLM and Vision Encoder to reduce cost; fully train the projector. Data includes specialized chemistry and general corpora.
- Optimization:
- Optimizer: AdamW
- Context Length: 2048 tokens
- Chat Template: InternLM2 dialogue schema
Models
- ChemVLM-26B: The primary model released. It combines the 6B parameter vision encoder and the 20B parameter language model.
- Baselines: Comparisons were made against GPT-4V, Qwen-VL-Chat, LLaVA-v1.5-13B, InternVL-v1.5, and Yi-VL-Plus.
Evaluation
Performance was measured across three distinct task types:
| Metric | Task | Method |
|---|---|---|
| Tanimoto Similarity | ChemOCR | Comparison of generated SMILES vs. ground truth using RDKit. Reports Average Similarity and [email protected] (exact match). |
| Accuracy | MMCR (Reasoning) | +1 point for correct multiple-choice/fill-in-the-blank; 0 otherwise. Scored via Qwen-max API prompting. |
| Prediction Score | Property Prediction | Evaluated on MMChemBench subsets. |
Hardware
- Compute: Training utilized 16 NVIDIA A100 (80GB) GPUs.
- Configuration:
- Batch size: 4
- Gradient Accumulation: 4 iterations
- Precision: Deepspeed bfloat16 (bf16) with ZeRO-3 offloading strategy