Computational Chemistry
Bar chart showing scientific LLM taxonomy across five modalities: textual, molecular, protein, genomic, and multimodal

Survey of Scientific LLMs in Bio and Chem Domains

This survey systematically reviews scientific LLMs (Sci-LLMs) across five modalities: textual, molecular, protein, genomic, and multimodal, analyzing architectures, datasets, evaluation methods, and open challenges for AI-driven scientific discovery.

Computational Chemistry
Heatmap showing LLM accuracy across nine chemistry coding task categories for four models, with green indicating high accuracy and red indicating low accuracy

Benchmarking Chemistry Knowledge in Code-Gen LLMs

A benchmark of 84 chemistry coding tasks evaluating code-generating LLMs like Codex, showing 72% accuracy with prompt engineering strategies that improve performance by 30 percentage points.

Computational Chemistry
Bar chart comparing LLM, DeBERTa, GCN, and GIN performance on three OGB molecular classification benchmarks

Benchmarking LLMs for Molecular Property Prediction

Benchmarks large language models on six molecular property prediction datasets, finding that LLMs lag behind GNNs but can augment ML models when used collaboratively.

Computational Chemistry
Radar chart comparing LLM and human chemist performance across chemistry topics in ChemBench

ChemBench: Evaluating LLM Chemistry Against Experts

ChemBench introduces an automated benchmark of 2,700+ chemistry questions to evaluate LLMs against human expert chemists, revealing that frontier models outperform domain experts on average while struggling with basic tasks and confidence calibration.

Computational Chemistry
Hierarchical pyramid showing ChemEval's four evaluation levels from basic knowledge QA to scientific knowledge deduction

ChemEval: Fine-Grained LLM Evaluation for Chemistry

ChemEval is a four-level, 62-task benchmark for evaluating LLMs across chemical knowledge, literature understanding, molecular reasoning, and scientific deduction, revealing that general LLMs excel at comprehension while chemistry-specific models perform better on domain tasks.

Computational Chemistry
Bar chart comparing LLM safety and quality scores across chemistry benchmark tasks

ChemSafetyBench: Benchmarking LLM Safety in Chemistry

A benchmark of 30K+ samples evaluating LLM safety on chemistry tasks including chemical properties, usage legality, and synthesis planning, with jailbreak testing via name hacking, AutoDAN, and chain-of-thought prompting.

Machine Learning
Log-log plot comparing scaling laws across six architectures showing the vanilla Transformer has the steepest slope

Scaling Laws vs Model Architectures: Inductive Bias

Tay et al. systematically compare scaling laws across ten diverse architectures (Transformers, Switch Transformers, Performers, MLP-Mixers, and others), finding that the vanilla Transformer has the best scaling coefficient and that the best-performing architecture changes across compute regions.

Document Processing
Chart showing the trade-off between accuracy and throughput in document automation

The Reliability Trap: The Limits of 99% Accuracy

We explore the ‘Silent Failure’ mode of LLMs in production: the limits of 99% accuracy for reliability, how confidence decays in long documents, and why standard calibration techniques struggle to fix it.

Computational Chemistry
Chemical structures and molecular representations feeding into a neural network model that processes atomized chemical knowledge

ChemDFM-R: Chemical Reasoning LLM with Atomized Knowledge

ChemDFM-R is a 14B-parameter chemical reasoning model that integrates a 101B-token dataset of atomized chemical knowledge. Using a mix-sourced distillation strategy and domain-specific reinforcement learning, it outperforms similarly sized models and DeepSeek-R1 on ChemEval.

Computational Chemistry
ChemDFM-X architecture showing five modalities (2D graphs, 3D conformations, images, MS2 spectra, IR spectra) feeding through separate encoders into unified LLM decoder

ChemDFM-X: Multimodal Foundation Model for Chemistry

ChemDFM-X is a multimodal chemical foundation model that integrates five non-text modalities (2D graphs, 3D conformations, images, MS2 spectra, IR spectra) into a single LLM decoder. It overcomes data scarcity by generating a 7.6M instruction-tuning dataset through approximate calculations and model predictions, establishing strong baseline performance across multiple modalities.

Computational Chemistry
InstructMol architecture showing molecular graph and text inputs feeding through two-stage training to produce property predictions, descriptions, and reactions

InstructMol: Multi-Modal Molecular LLM for Drug Discovery

InstructMol integrates a pre-trained molecular graph encoder (MoleculeSTM) with a Vicuna-7B LLM using a linear projector. It employs a two-stage training process (alignment pre-training followed by task-specific instruction tuning with LoRA) to excel at property prediction, description generation, and reaction analysis.

Computational Chemistry
MERMaid pipeline diagram showing PDF processing through VisualHeist segmentation, DataRaider VLM mining, and KGWizard graph construction to produce chemical knowledge graphs

MERMaid: Multimodal Chemical Reaction Mining from PDFs

MERMaid leverages fine-tuned vision models and VLM reasoning to mine chemical reaction data directly from PDF figures and tables. By handling context inference and coreference resolution, it builds high-fidelity knowledge graphs with 87% end-to-end accuracy.