Text-Based Crystal Property Prediction with LLMs

LLM-Prop is a Method paper that proposes using the encoder portion of T5 (a general-purpose language model) fine-tuned on crystal text descriptions to predict physical and electronic properties of crystalline materials. The primary contribution is demonstrating that text-based representations of crystals, generated by Robocrystallographer, can serve as effective inputs for property prediction, outperforming graph neural network (GNN) baselines on several tasks despite using a non-domain-specific pre-trained model with fewer parameters.

Why Text Instead of Crystal Graphs?

Graph neural networks have been the dominant approach for crystal property prediction. Models like CGCNN, MEGNet, and ALIGNN represent crystals as graphs where atoms are nodes and bonds are edges. However, GNNs face several fundamental challenges for crystals:

  1. Periodicity encoding: Crystals have repetitive unit cell arrangements that are distinct from standard molecular graphs, and GNNs struggle to encode this periodicity efficiently.
  2. Information incorporation: Critical structural information like bond angles, space group symmetry, and Wyckoff sites is difficult to incorporate into graph representations.
  3. Expressiveness: Graphs may lack the expressiveness needed to convey complex crystal information relevant to property prediction.

Meanwhile, textual descriptions of crystals (generated by tools like Robocrystallographer) naturally encode space group information, bond geometries, coordination environments, and symmetry details in human-readable form. Despite this richness, text-based approaches for crystal property prediction had been largely unexplored.

Core Innovation: T5 Encoder with Careful Fine-Tuning

The key insight of LLM-Prop is to take a pre-trained encoder-decoder model (T5-small) and discard the decoder entirely, using only the encoder with a linear prediction head. This design has several advantages:

  • Cutting the network in half (from ~60M to ~37M parameters) allows processing of longer input sequences
  • Longer sequences mean more crystal information can be included
  • The encoder-only approach avoids T5’s known weakness at regression in text-to-text format

The framework applies several preprocessing strategies to the crystal text descriptions:

  1. Stopword removal: Standard English stopwords are removed, except digits and symbols carrying chemical information
  2. Numerical token replacement: Bond distances are replaced with a [NUM] token and bond angles with [ANG], reducing sequence length while preserving structural cues
  3. [CLS] token prepending: A classification token is added at the start, and its learned embedding is used as input to the prediction layer
  4. Label scaling: For regression tasks, targets are normalized using z-score, min-max, or log normalization

The normalization schemes are defined as:

$$ \hat{Y}_{i}(\text{z-score}) = \frac{Y_{i} - \mu}{\sigma} $$

$$ \hat{Y}_{i}(\text{min-max}) = \frac{Y_{i} - Y_{\min}}{Y_{\max} - Y_{\min}} $$

$$ \hat{Y}_{i}(\text{log-norm}) = \log(Y_{i} + 1) $$

The tokenizer is also retrained on the crystal text corpus with a vocabulary size of 32k, and the special tokens [NUM], [ANG], and [CLS] are added to the vocabulary.

Experimental Setup and Baselines

Dataset: TextEdge

The authors collected data from the Materials Project database (as of November 2022), yielding 144,931 crystal structure-description pairs split into 125,098 training, 9,945 validation, and 9,888 test samples. Crystal text descriptions were generated using Robocrystallographer. The dataset covers six prediction tasks:

TaskTypeMetric
Band gap (eV)RegressionMAE (lower is better)
Unit cell volume (A^3/cell)RegressionMAE (lower is better)
Formation energy per atom (eV/atom)RegressionMAE (lower is better)
Energy per atom (eV/atom)RegressionMAE (lower is better)
Energy above hull (eV/atom)RegressionMAE (lower is better)
Is-gap-directClassificationAUC (higher is better)

Baselines

Seven baselines were compared:

  • GNN-based: CGCNN, MEGNet, ALIGNN, DeeperGATGNN
  • Classic ML: XGBoost, Random Forest (on Robocrystallographer features)
  • Text-based: MatBERT (domain-specific pre-trained BERT, ~110M parameters)

All models were trained and evaluated on the same dataset splits for fair comparison. GNN models were retrained on the new data rather than using results from older, smaller Materials Project versions.

Main Results: LLM-Prop vs. GNN Baselines

When using crystal text descriptions as input, LLM-Prop achieved:

ModelBand gap (eV)Volume (A^3/cell)FEPA (eV/atom)EPA (eV/atom)Ehull (eV/atom)Is-gap-direct (AUC)
CGCNN0.293188.8340.0460.0820.0400.830
MEGNet0.304297.9480.0770.0560.051N/A
ALIGNN0.250129.5800.0270.0590.0280.678
DeeperGATGNN0.291111.8570.0810.1160.045N/A
LLM-Prop (Descr.)0.23139.2520.0560.0670.0470.857

LLM-Prop outperformed the best GNN baseline (ALIGNN) by approximately 8% on band gap prediction, 65% on volume prediction, and 3% on band gap classification (Is-gap-direct). For formation energy per atom, energy per atom, and energy above hull, ALIGNN retained an advantage.

LLM-Prop vs. MatBERT

LLM-Prop also outperformed MatBERT (a domain-specific pre-trained BERT) across all tasks despite having roughly 3x fewer parameters. The table below shows the best result for each model across the three input preprocessing strategies (w/ Numbers, w/o Numbers, w/ [NUM]&[ANG]):

ModelBand gap (eV)Volume (A^3/cell)FEPA (eV/atom)EPA (eV/atom)Ehull (eV/atom)Is-gap-direct (AUC)
MatBERT (best)0.25854.9690.0710.0980.0500.722
LLM-Prop (best)0.23139.1380.0560.0670.0470.857

Note: LLM-Prop’s best band gap (0.231) comes from the “w/o Numbers” configuration, while the best volume (39.138) comes from “w/ Numbers”. The best Is-gap-direct AUC (0.857) uses the “[NUM]&[ANG]” configuration.

Ablation Studies

The contribution of each preprocessing strategy was evaluated:

ConfigurationBand gapVolumeIs-gap-direct (AUC)
LLM-Prop (baseline)0.25669.3520.796
+ modified tokenizer0.24778.6320.785
+ label scaling0.24244.515N/A
+ [CLS] token0.23139.5200.842
+ [NUM] token0.25186.0900.793
+ [ANG] token0.24264.9650.810
- stopwords0.25256.5930.779
LLM-Prop+all (no space group)0.23597.4570.705
LLM-Prop+all0.22942.2590.857

The [CLS] token provided the single largest improvement across all tasks. Label scaling was critical for volume prediction (reducing MAE from 69.352 to 44.515). Removing space group information from descriptions degraded volume prediction dramatically (from 42.259 to 97.457), confirming that space group symmetry is a key factor.

Data Efficiency and Transfer Learning

LLM-Prop achieved SOTA results on band gap and volume prediction with only about 90k training samples (35k fewer than baselines). For volume prediction specifically, LLM-Prop outperformed all GNN baselines with just 30k training samples.

Transfer learning experiments showed that LLM-Prop transferred well between band gap and volume prediction tasks:

ModelVolume-to-Band gap (Test)Band gap-to-Volume (Test)
CGCNN-transfer0.295182.997
ALIGNN-transfer0.322136.164
MatBERT-transfer0.26654.289
LLM-Prop-transfer0.24450.753

Key Findings, Limitations, and Future Directions

Key findings:

  • Text descriptions of crystals carry rich structural information (space groups, Wyckoff sites, coordination geometries) that is difficult to encode in graphs but naturally expressed in text
  • A carefully fine-tuned general-purpose LLM encoder can outperform domain-specific pre-trained models, challenging the assumption that in-domain pre-training is always necessary
  • Removing numerical information (bond distances and angles) from descriptions often improves performance, because current LLMs treat numbers as regular tokens without understanding their quantitative meaning
  • Longer input sequences correlate with better performance, with 888 tokens as the default maximum on the hardware used

Limitations acknowledged by the authors:

  • The origin of LLM-Prop’s performance advantage over GNNs is not fully understood. It remains unclear whether the boost comes from additional structured information in text or from the different data modality itself
  • LLM-Prop cannot perform zero-shot predictions since T5 was not pre-trained on materials science data
  • The approach depends on Robocrystallographer to generate text descriptions, adding a preprocessing dependency
  • Current LLMs’ inability to reason about numerical values limits the use of quantitative information in descriptions

Future directions suggested by the authors include investigating techniques to use CIF files directly as LLM inputs, developing new GNN architectures that incorporate space group and Wyckoff site information, and further exploring which information in crystal descriptions contributes most to each property prediction task.


Reproducibility Details

Data

PurposeDatasetSizeNotes
Training/EvalTextEdge144,931 crystalsFrom Materials Project (Nov 2022), text generated by Robocrystallographer
Training splitTextEdge125,098Random split
Validation splitTextEdge9,945Random split
Test splitTextEdge9,888Random split

Algorithms

  • Optimizer: Adam with one-cycle learning rate scheduler
  • Learning rate: 1e-3 for LLM-Prop, 5e-5 for MatBERT
  • Dropout: 0.2 for LLM-Prop, 0.5 for MatBERT
  • Batch size: 64 (888 tokens) or 16 (2000 tokens) for LLM-Prop
  • Epochs: 200-300 depending on task
  • Loss: MAE for regression, BCE for classification
  • Evaluation: MAE for regression, AUC for classification
  • Each model run 5 times on test set, averaged MAE reported

Models

  • Base model: T5-small encoder (~60M parameters total, ~37M after discarding decoder and adding prediction head)
  • Vocabulary size: 32k (retrained tokenizer)
  • Max input tokens: 888 (default) or 2000
  • Special tokens: [CLS], [NUM], [ANG]

Artifacts

ArtifactTypeLicenseNotes
LLM-PropCodeMITOfficial implementation
TextEdge + CheckpointsDataset + ModelNot specifiedBenchmark dataset and trained model checkpoints

Hardware

  • GPUs: NVIDIA RTX A6000
  • Training time: ~40 minutes per epoch for LLM-Prop
  • Inference: ~1 minute for 10,000 materials on one GPU

Paper Information

Citation: Rubungo, A. N., Arnold, C. B., Rand, B. P., & Dieng, A. B. (2025). LLM-Prop: predicting the properties of crystalline materials using large language models. npj Computational Materials, 11, 186. https://doi.org/10.1038/s41524-025-01536-2

@article{rubungo2025llmprop,
  title={LLM-Prop: predicting the properties of crystalline materials using large language models},
  author={Rubungo, Andre Niyongabo and Arnold, Craig B. and Rand, Barry P. and Dieng, Adji Bousso},
  journal={npj Computational Materials},
  volume={11},
  number={1},
  pages={186},
  year={2025},
  publisher={Nature Publishing Group},
  doi={10.1038/s41524-025-01536-2}
}