A Resource for Chemistry Instruction Tuning

This is a Resource paper that contributes both a large-scale instruction tuning dataset (SMolInstruct) and a family of fine-tuned LLMs (LlaSMol) for chemistry tasks. The primary contribution is SMolInstruct, a dataset of 3.3 million samples across 14 chemistry tasks, paired with systematic experiments showing that instruction-tuned open-source LLMs can substantially outperform GPT-4 and Claude 3 Opus on chemistry benchmarks. The dataset construction methodology, quality control pipeline, and careful data splitting are central to the paper’s value.

Why LLMs Struggle with Chemistry Tasks

Prior work demonstrated that general-purpose LLMs perform poorly on chemistry tasks. Guo et al. (2023) found that GPT-4, while outperforming other LLMs, falls far short of task-specific deep learning models, particularly on tasks requiring precise understanding of SMILES representations. Fang et al. (2023) attempted instruction tuning with Mol-Instructions, but the resulting models still performed well below task-specific baselines.

These results raised a fundamental question: are LLMs inherently limited for chemistry, or is the problem simply insufficient training data? The authors argue it is the latter. Previous instruction tuning datasets suffered from limited scale (Mol-Instructions had 1.3M samples with fewer task types), lower quality (numerous low-quality molecular descriptions, mislabeled reactants/reagents in reaction data), and suboptimal design choices (using SELFIES instead of canonical SMILES, inconsistent data splitting that allowed leakage).

SMolInstruct: A Comprehensive Chemistry Instruction Dataset

The core innovation is the SMolInstruct dataset, which addresses the limitations of prior datasets through three design principles:

Scale and comprehensiveness. SMolInstruct contains 3.3M samples across 14 tasks organized into four categories:

  • Name conversion (4 tasks): IUPAC-to-formula, IUPAC-to-SMILES, SMILES-to-formula, SMILES-to-IUPAC, sourced from PubChem
  • Property prediction (6 tasks): ESOL, Lipo, BBBP, ClinTox, HIV, SIDER, sourced from MoleculeNet
  • Molecule description (2 tasks): molecule captioning and molecule generation, sourced from ChEBI-20 and Mol-Instructions
  • Chemical reactions (2 tasks): forward synthesis and retrosynthesis, sourced from USPTO-full

Quality control. The authors apply rigorous curation: invalid SMILES are filtered using RDKit, mislabeled reactants/reagents in USPTO-full are corrected by comparing atom mappings with products, low-quality molecular descriptions are removed using pattern-based rules, and duplicates are eliminated.

Careful data splitting. To prevent data leakage across related tasks (e.g., forward synthesis and retrosynthesis share the same reactions), the authors ensure matched samples across reverse tasks are placed together in either training or evaluation sets. Samples with identical inputs but different outputs are also grouped together to prevent exaggerated performance estimates.

Additionally, all SMILES representations are canonicalized, and special tags (e.g., <SMILES>...</SMILES>) encapsulate different information types within the instruction templates.

Experimental Setup: Four Base Models and Comprehensive Baselines

The authors fine-tune four open-source LLMs using LoRA (applied to all attention and FFN linear layers, with rank and alpha both set to 16):

  • Galactica 6.7B: pretrained on scientific text including chemistry data
  • Llama 2 7B: general-purpose LLM
  • Code Llama 7B: code-focused variant of Llama 2
  • Mistral 7B: general-purpose LLM

Training uses 8-bit AdamW with learning rate 1e-4, cosine scheduler, and 3 epochs. Only 0.58% of parameters are fine-tuned (approximately 41.9M parameters). Beam search is used at inference.

Baselines include:

  • General LLMs without fine-tuning: GPT-4, Claude 3 Opus, and the four base models
  • Chemistry-specific LLMs: Molinst (Llama 2 tuned on Mol-Instructions), ChemLLM
  • Task-specific non-LLM models: STOUT for name conversion, Uni-Mol for property prediction, MolT5 for molecule description, RSMILES and Molecular Transformer for reaction prediction

Main Results

Task CategoryBest LlaSMolGPT-4Improvement
Name conversion (NC-I2F, EM%)87.9 (Mistral)8.7+79.2
Name conversion (NC-I2S, EM%)70.1 (Mistral)3.3+66.8
Property prediction (PP-ESOL, RMSE)1.150 (Mistral)2.570-1.42 (lower is better)
Property prediction (PP-BBBP, Acc%)74.6 (Mistral)62.9+11.7
Molecule captioning (METEOR)0.452 (Mistral)0.188+0.264
Molecule generation (FTS%)61.7 (Mistral)42.6+19.1
Forward synthesis (EM%)63.3 (Mistral)1.6+61.7
Retrosynthesis (EM%)32.9 (Mistral)0.0+32.9

LlaSMolMistral consistently outperforms all other LLMs and the other LlaSMol variants. It also surpasses task-specific SoTA models on PP-ClinTox (93.1 vs. 92.4) and PP-SIDER (70.7 vs. 70.0), though it has not yet matched SoTA on most other tasks.

Ablation Study

The ablation study examines three variants:

  1. Without canonicalization: Performance drops on most tasks, with substantial decreases on forward synthesis (63.3 to 53.7 EM%) and retrosynthesis (32.9 to 23.8 EM%), confirming that canonicalized SMILES reduce learning difficulty.

  2. Using SELFIES instead of SMILES: While SELFIES achieves slightly higher validity (100% vs. 99.7% on some tasks), it results in worse performance overall. SELFIES strings are typically longer than SMILES, making them harder for models to process accurately. This finding contradicts claims from prior work (Fang et al., 2023) that SELFIES should be preferred.

  3. Training on Mol-Instructions instead of SMolInstruct: Using the same base model (Mistral) and identical training settings, the Mol-Instructions-trained model performs drastically worse, achieving near-zero accuracy on name conversion and property prediction tasks, and much lower performance on shared tasks (MC, MG, FS, RS).

Additional Analysis

Multi-task training generally outperforms single-task training, with particularly large improvements on PP-ESOL (RMSE 20.616 to 1.150) and molecule generation (FTS 33.1% to 61.7%). Increasing the number of trainable LoRA parameters from 6.8M (0.09%) to 173.0M (2.33%) leads to consistent performance improvements across most tasks, suggesting further gains are possible with more extensive fine-tuning.

Key Findings and Limitations

The paper establishes several findings:

  1. LLMs can perform chemistry tasks effectively when provided with sufficient high-quality instruction tuning data. This refutes the notion that LLMs are fundamentally limited for chemistry.

  2. The choice of base model matters considerably. Mistral 7B outperforms Llama 2, Code Llama, and Galactica despite identical training, suggesting that general language understanding transfers well to chemistry.

  3. Canonical SMILES outperform both non-canonical SMILES and SELFIES for LLM-based chemistry, a practical recommendation for future work.

  4. Dataset quality is more important than model architecture. The same base model trained on SMolInstruct vastly outperforms the same model trained on Mol-Instructions.

The authors acknowledge several limitations. The evaluation metrics for molecule captioning and generation (METEOR, FTS) measure text similarity rather than chemical correctness. The paper does not evaluate generalization to tasks beyond the 14 training tasks. LlaSMol models do not yet outperform task-specific SoTA models on most tasks, though the gap has narrowed substantially with only 0.58% of parameters fine-tuned.


Reproducibility Details

Data

PurposeDatasetSizeNotes
TrainingSMolInstruct3.29M samples14 tasks, canonical SMILES, publicly available on HuggingFace
EvaluationSMolInstruct test split33,061 samplesCareful splitting to prevent leakage across tasks
NC tasksPubChem~300K moleculesIUPAC names, SMILES, molecular formulas
PP tasksMoleculeNet~78K samples6 datasets (ESOL, Lipo, BBBP, ClinTox, HIV, SIDER)
MC/MG tasksChEBI-20 + Mol-Instructions~60K samplesQuality-filtered molecular descriptions
FS/RS tasksUSPTO-full~1.9M samplesCleaned, with corrected reactant/reagent labels

Algorithms

  • Fine-tuning: LoRA with rank=16, alpha=16, applied to all attention and FFN linear layers
  • Optimizer: 8-bit AdamW, learning rate 1e-4, cosine scheduler
  • Training: 3 epochs, max input length 512 tokens
  • Inference: Beam search with beam size = num_return_sequences + 3

Models

ModelBaseParametersLoRA Parameters
LlaSMolGalacticaGalactica 6.7B6.7B41.9M (0.58%)
LlaSMolLlama2Llama 2 7B7B41.9M (0.58%)
LlaSMolCodeLlamaCode Llama 7B7B41.9M (0.58%)
LlaSMolMistralMistral 7B7B41.9M (0.58%)

All models and the dataset are publicly released on HuggingFace.

Evaluation

MetricTask(s)Notes
Exact Match (EM)NC, MG, FS, RSMolecular identity comparison via RDKit
Fingerprint Tanimoto Similarity (FTS)MG, FS, RSMorgan fingerprints
METEORMCText similarity metric
RMSEPP-ESOL, PP-LipoRegression tasks
AccuracyPP-BBBP, PP-ClinTox, PP-HIV, PP-SIDERBinary classification
ValidityNC-I2S, MG, FS, RSRatio of valid SMILES outputs

Hardware

The paper does not specify exact GPU hardware or training times. Training uses the HuggingFace Transformers library with LoRA, and inference is conducted on the Ohio Supercomputer Center.

Artifacts

ArtifactTypeLicenseNotes
LlaSMol CodeCodeMITTraining, evaluation, and inference scripts
SMolInstructDatasetCC-BY-4.03.3M samples across 14 chemistry tasks
LlaSMol-Mistral-7BModelCC-BY-4.0Best-performing model (LoRA adapters)
LlaSMol-Galactica-6.7BModelCC-BY-4.0LoRA adapters for Galactica
LlaSMol-Llama2-7BModelCC-BY-4.0LoRA adapters for Llama 2
LlaSMol-CodeLlama-7BModelCC-BY-4.0LoRA adapters for Code Llama

Paper Information

Citation: Yu, B., Baker, F. N., Chen, Z., Ning, X., & Sun, H. (2024). LlaSMol: Advancing large language models for chemistry with a large-scale, comprehensive, high-quality instruction tuning dataset. arXiv preprint arXiv:2402.09391.

@article{yu2024llamsmol,
  title={LlaSMol: Advancing Large Language Models for Chemistry with a Large-Scale, Comprehensive, High-Quality Instruction Tuning Dataset},
  author={Yu, Botao and Baker, Frazier N. and Chen, Ziqi and Ning, Xia and Sun, Huan},
  journal={arXiv preprint arXiv:2402.09391},
  year={2024}
}