ChemEval: Fine-Grained LLM Evaluation for Chemistry

A Hierarchical Benchmark for Chemistry LLMs

ChemEval is a Resource paper that introduces a comprehensive, hierarchical benchmark for evaluating large language models on chemical tasks. The benchmark spans four progressive levels of difficulty (Advanced Knowledge Question Answering, Literature Understanding, Molecular Understanding, and Scientific Knowledge Deduction), encompasses 13 capability dimensions, and contains 62 distinct tasks with 3,160 evaluation instances. It covers both text-only and multimodal settings, making it one of the most extensive chemistry-specific LLM evaluation frameworks to date.

Gaps in Existing Chemistry Benchmarks

Prior benchmarks for chemistry LLMs had several shortcomings:

General benchmarks (MMLU, XieZhi, C-Eval) include some chemistry questions but lack the depth needed for meaningful evaluation of domain expertise.
SciEVAL covers scientific tasks broadly but treats chemistry superficially with overly simplistic questions.
ChemLLMBench (Guo et al., 2023) includes only 8 task categories derived from existing public datasets, offering insufficient breadth.
ChemBench (Mirza et al., 2024) provides 7,000 samples but relies exclusively on multiple-choice questions and lacks open-ended evaluation for tasks like synthesis pathway recommendation.
MaCBench (Alampara et al., 2025) introduces multimodal evaluation but remains limited in task diversity.

None of these benchmarks address LLMs’ ability to extract chemical information from text and tables, and none provide a graduated, multi-level assessment of chemical competence from basic knowledge through to advanced scientific reasoning.

A Four-Level Hierarchical Evaluation Framework

ChemEval’s core innovation is its hierarchical structure that mirrors how chemical expertise develops, from foundational knowledge through applied scientific reasoning.

Level 1: Advanced Knowledge Question Answering

This level assesses fundamental chemical knowledge through 15 tasks across two dimensions:

Objective Questions (ObjQA): multiple choice, fill-in-the-blank, and true/false tasks spanning seven core chemistry disciplines (organic, inorganic, materials, analytical, biochemistry, physical, and polymer chemistry).
Subjective Questions (SubjQA): short answer and calculation tasks requiring detailed reasoning and explanation.

Level 2: Literature Understanding

This level evaluates the ability to interpret chemical literature through 19 tasks across three dimensions:

Information Extraction (InfoE): 11 tasks covering named entity recognition, relationship classification, substrate extraction, additive/solvent/temperature/time extraction, product extraction, characterization method extraction, catalysis type extraction, and yield extraction.
Inductive Generation (InducGen): abstract generation, research outline generation, topic classification, and reaction type recognition.
Molecular Name Recognition (MNR): molecular formula recognition, chemical reaction equation recognition, 2D molecular structure recognition, and synthetic pathway analysis (multimodal tasks).

Level 3: Molecular Understanding

This level tests molecular-level comprehension through 15 tasks across four dimensions:

Molecular Name Generation (MNGen): generating SMILES from text descriptions.
Molecular Name Translation (MNTrans): IUPAC to molecular formula, SMILES to molecular formula, IUPAC to SMILES, SMILES to IUPAC, and SMILES/SELFIES interconversion.
Molecular Property Prediction (MPP): classification (ClinTox, HIV inhibition, polarity) and regression (lipophilicity, boiling point).
Molecular Description (MolDesc): physicochemical property prediction from molecular structures and various spectral inputs (IR, Raman, UV-Vis, diffraction, mass spectrum, NMR).

Level 4: Scientific Knowledge Deduction

The most advanced level covers 13 tasks across four dimensions:

Retrosynthetic Analysis (ReSyn): substrate recommendation, synthetic pathway recommendation, and synthetic difficulty evaluation.
Reaction Condition Recommendation (RCRec): ligand, reagent, solvent, catalyst, temperature, and time recommendation.
Reaction Outcome Prediction (ROP): product prediction, yield prediction, and reaction rate prediction.
Reaction Mechanism Analysis (RMA): intermediate derivation.

Data Construction

The benchmark combines open-source datasets (ChemRxnExtractor, Mol-Instructions, ChemLLMBench, SMolInstruct) with domain-expert data curated from approximately 500 university-level chemistry textbooks and 9,000 real-world experimental records. Expert-crafted questions were written from scratch to prevent data leakage. A three-tier quality assurance pipeline (annotation by undergraduate students, review by graduate students, final audit by chemistry faculty) ensures correctness.

The text subset contains 1,960 instances (18 open-source tasks, 24 in-house tasks), while the multimodal subset contains 1,200 instances (12 open-source tasks, 30 in-house tasks).

Experimental Setup and Model Comparison

Models Evaluated

ChemEval evaluates a broad set of models under both zero-shot and 3-shot settings:

General LLMs: OpenAI-o1, OpenAI-o3-mini, GPT-4o, Claude-3.7-Sonnet (thinking and non-thinking modes), Gemini-2.5-Pro, Grok3, DeepSeek-V3, DeepSeek-R1, Qwen2.5 (7B/14B/32B/72B), LLaMA3.3-8B.

Chemistry-specific LLMs: ChemDFM, LlaSMol, ChemLLM, ChemSpark.

Multimodal LLMs (for multimodal tasks): GPT-4o, Claude-3.7-Sonnet, Qwen-VL Max, Phi-Vision-3.5, Gemini-2.5-Pro, GLM-4V.

Evaluation Metrics

The benchmark employs task-appropriate metrics: F1 score, Accuracy, BLEU, Exact Match, Normalized RMSE, Tanimoto similarity (with valid output ratio), LLM Score (judged by GPT-4o), L2 Score for molecular formula similarity, and Overlap for range prediction.

Key Results (Zero-Shot Text Tasks)

Level	Top General LLM	Score	Top Chemistry LLM	Score
Knowledge QA (MCTask)	Gemini-2.5-Pro	87.60%	ChemCrow	58.00%
Literature (CNER)	Gemini-2.5-Pro	68.30 F1	ChemSpark	71.44 F1
Molecular (MolNG)	Gemini-2.5-Pro	71.11 Tan.	ChemSpark	74.81 Tan.
Molecular (IUPAC2SMILES)	Gemini-2.5-Pro	61.33 Tan.	ChemSpark	87.54 Tan.
Scientific (SubRec)	OpenAI-o3-mini	4.67 F1	ChemSpark	12.37 F1
Scientific (CatRec)	All models	0.00 F1	ChemSpark	0.20 F1

Key Findings and Performance Patterns

General vs. Chemistry-Specific LLMs

General-purpose LLMs excel at Advanced Knowledge QA and Literature Understanding, benefiting from strong document comprehension and instruction-following abilities. Chemistry-specialized models (particularly ChemSpark) outperform in tasks demanding domain-specific molecular knowledge, such as molecular name translation and reaction condition recommendation. However, specialized models show notably weaker instruction-following capability and suffer from catastrophic forgetting of general language abilities during fine-tuning. For example, ChemLLM scores 0.00 on multiple information extraction tasks where general LLMs achieve 60-95%.

Impact of Few-Shot Learning

General LLMs tend to benefit from few-shot prompting, particularly for subjective QA and literature understanding tasks. OpenAI-o1 improved on 9 of 10 evaluated tasks. In contrast, chemistry-specialized models often show performance degradation with few-shot examples, likely due to loss of in-context learning capabilities during task-specific fine-tuning. ChemSpark decreased on 7 of 10 tasks in the 3-shot setting.

Impact of Model Scaling

Experiments with Qwen2.5 at 7B, 14B, 32B, and 72B parameters show that scaling improves performance on knowledge QA and literature understanding tasks. However, molecular understanding and scientific knowledge deduction tasks show minimal improvement, and some tasks (e.g., molecular property classification) even decline at the largest scale. Tasks requiring specialized chemical knowledge, like IUPAC-to-SMILES conversion and catalyst recommendation, remain near zero regardless of model size.

Thinking Models

Comparing OpenAI-o1 vs. GPT-4o and DeepSeek-R1 vs. DeepSeek-V3, thinking models show comparable overall performance to their non-thinking counterparts. They occasionally excel on specific tasks (e.g., reaction product prediction) but do not consistently outperform across chemical tasks. The authors conclude that the primary bottleneck is insufficient domain-specific knowledge, not reasoning depth.

Multimodal Tasks

Multimodal LLMs handle basic tasks like molecular formula recognition well (GLM-4V and Qwen-VL Max: 100% accuracy) but struggle with advanced challenges. Synthetic pathway analysis yielded 0% F1 across all models. 2D molecular structure recognition produced Tanimoto scores below 21% for all models tested. The performance gap between basic recognition and advanced chemical reasoning is substantial.

Limitations

The authors acknowledge several limitations:

Limited instances per task: with 62 task types and 3,160 total instances, individual tasks may have as few as 20 samples.
Static, single-turn evaluation: the benchmark does not assess dynamic interaction, tool use, or agentic workflows.
No chemistry-specific multimodal models tested: only general-purpose VLMs were evaluated on multimodal tasks.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Evaluation (text)	ChemEval text subset	1,960 instances	18 open-source + 24 in-house tasks
Evaluation (multimodal)	ChemEval multimodal subset	1,200 instances	12 open-source + 30 in-house tasks
Source (open-source)	ChemRxnExtractor, Mol-Instructions, ChemLLMBench, SMolInstruct	Various	Adapted for ChemEval format
Source (expert)	~500 textbooks, ~9,000 experimental records	Various	Novel questions crafted by domain experts

Algorithms

Evaluation prompts: task-specific instructions designed for formatted output, with 0-shot and 3-shot variants.
Decoding: greedy decoding for all LLM inference.
LLM-as-judge: GPT-4o used for LLM Score metric on subjective tasks.

Evaluation

Key metrics by task type:

Metric	Task Types	Notes
Accuracy	MCTask, TFTask, MolPC, SubE, etc.	Standard classification accuracy
F1 Score	CNER, CERC, extraction tasks, reaction prediction	Precision-recall harmonic mean
BLEU	SMILES2IUPAC	N-gram overlap with brevity penalty
Exact Match	SMILES2IUPAC	Strict string match
Tanimoto Similarity	Molecular generation/translation tasks	Fingerprint-based molecular similarity
NRMSE	Regression tasks (property, temperature, time)	Normalized prediction error
LLM Score	Subjective QA, abstract generation, pathway rec.	GPT-4o evaluation (0-100)
L2 Score	Molecular formula tasks	$1 / (1 + \text{L2 distance})$ between formulas
Overlap	Rate prediction	Intersection/union of predicted vs. reference ranges

Hardware

Chemistry-specific models run on two NVIDIA A40 48GB GPUs.
General models accessed via official APIs.

Artifacts

Artifact	Type	License	Notes
ChemEval Benchmark	Code + Data	Other (custom)	Evaluation framework and task data

Paper Information

Citation: Huang, Y., Zhang, R., He, X., Zhi, X., Wang, H., Chen, N., Liu, Z., Li, X., Xu, F., Liu, D., Liang, H., Li, Y., Cui, J., Xu, Y., Wang, S., Liu, Q., Lian, D., Liu, G., & Chen, E. (2024). ChemEval: A Comprehensive Multi-Level Chemical Evaluation for Large Language Models. arXiv preprint arXiv:2409.13989.

@article{huang2024chemeval,
  title={ChemEval: A Comprehensive Multi-Level Chemical Evaluation for Large Language Models},
  author={Huang, Yuqing and Zhang, Rongyang and He, Xuesong and Zhi, Xuyang and Wang, Hao and Chen, Nuo and Liu, Zongbo and Li, Xin and Xu, Feiyang and Liu, Deguang and Liang, Huadong and Li, Yi and Cui, Jian and Xu, Yin and Wang, Shijin and Liu, Qi and Lian, Defu and Liu, Guiquan and Chen, Enhong},
  journal={arXiv preprint arXiv:2409.13989},
  year={2024},
  doi={10.48550/arXiv.2409.13989}
}

A Hierarchical Benchmark for Chemistry LLMs#

Gaps in Existing Chemistry Benchmarks#

A Four-Level Hierarchical Evaluation Framework#

Level 1: Advanced Knowledge Question Answering#

Level 2: Literature Understanding#

Level 3: Molecular Understanding#

Level 4: Scientific Knowledge Deduction#

Data Construction#

Experimental Setup and Model Comparison#

Models Evaluated#

Evaluation Metrics#

Key Results (Zero-Shot Text Tasks)#

Key Findings and Performance Patterns#

General vs. Chemistry-Specific LLMs#

Impact of Few-Shot Learning#

Impact of Model Scaling#

Thinking Models#

Multimodal Tasks#

Limitations#

Reproducibility Details#

Data#

Algorithms#

Evaluation#

Hardware#

Artifacts#

Paper Information#