GPT-3 as a General-Purpose Chemistry Predictor

This is an Empirical paper that systematically benchmarks fine-tuned GPT-3 against dedicated machine learning models across 15 chemistry and materials science prediction tasks. The primary contribution is demonstrating that a general-purpose large language model, with no chemistry-specific architecture or featurization, can match or outperform specialized ML approaches, particularly when training data is limited. The paper also demonstrates inverse molecular design through simple prompt inversion.

Why General-Purpose LLMs for Chemistry

Machine learning in chemistry typically requires domain-specific feature engineering: molecular fingerprints, graph neural network architectures, or hand-crafted descriptors tailored to each application. Developing these approaches demands specialized expertise and significant effort for each new problem. The small datasets common in experimental chemistry further complicate matters, as many sophisticated ML approaches require large training sets to learn meaningful representations.

Large language models like GPT-3, trained on vast internet text corpora, had shown surprising capability at tasks they were not explicitly trained for. The key question motivating this work was whether these general-purpose models could also answer scientific questions for which we lack answers, given that most chemistry problems can be represented in text form. For example: “If I change the metal in my metal-organic framework, will it be stable in water?”

Prior chemical language models (e.g., Transformer-CNN, Regression Transformer, SELFormer) were pre-trained on chemistry-specific corpora. In contrast, this work investigates models trained primarily on general internet text, examining whether the implicit chemical knowledge encoded during pre-training, combined with task-specific fine-tuning, can substitute for explicit chemical featurization.

Language-Interfaced Fine-Tuning for Chemistry

The core innovation is “language-interfaced fine-tuning” (LIFT): reformulating chemistry prediction tasks as natural language question-answering. Training examples take the form of question-completion pairs, where questions describe the chemical system in text and completions provide the target property. For example:

  • Classification: “What is the phase of Co1Cu1Fe1Ni1V1?” with completion “0” (multi-phase)
  • Regression: Property values are rounded to a fixed precision, converting continuous prediction into a text generation problem
  • Inverse design: Questions and completions are simply swapped, asking “What is a molecule with property X?” and expecting a SMILES string as completion

The fine-tuning uses OpenAI’s API with the smallest ada variant of GPT-3, with uniform hyperparameters across all tasks (8 epochs, learning rate multiplier of 0.02). No optimization of prompt structure, tokenization, or training schedule was performed, making the approach deliberately simple.

For regression, since language models generate discrete tokens rather than continuous values, the authors round target values to a fixed precision (e.g., 1% for Henry coefficients). This converts regression into a form of classification over numeric strings, with the assumption that GPT-3 can interpolate between these discretized values.

The approach also extends to open-source models. The authors demonstrate that GPT-J-6B can be fine-tuned using parameter-efficient techniques (LoRA, 8-bit quantization) on consumer hardware, and provide the chemlift Python package for this purpose.

Benchmarks Across Molecules, Materials, and Reactions

Datasets and Tasks

The evaluation spans three chemical domains with 15 total benchmarks:

Molecules:

  • Photoswitch transition wavelength prediction (2022)
  • Free energy of solvation (FreeSolv, 2014)
  • Aqueous solubility (ESOL, 2004)
  • Lipophilicity (ChEMBL, 2012)
  • HOMO-LUMO gap (QMugs, 2022)
  • Organic photovoltaic power conversion efficiency (2018)

Materials:

Reactions:

Baselines

The baselines include both traditional ML and deep learning approaches:

  • Non-DL: XGBoost with molecular descriptors/fragprints, Gaussian Process Regression (GPR), random forests, n-Gram models, Automatminer, differential reaction fingerprints (DRFP)
  • Deep learning: MolCLR, ModNet, CrabNet, TabPFN

Data Efficiency Analysis

To compare data efficiency, the authors fit power law curves to learning curves for all models and measure the “data efficiency factor”: how much more (or fewer) data the best baseline needs to match GPT-3’s performance in the low-data regime.

DomainBenchmarkData Efficiency vs. Non-DLvs. DL Baseline
MoleculesPhotoswitch wavelength1.1x (n-Gram)1.2x (TabPFN)
MoleculesSolvation free energy3.1x (GPR)1.3x (TabPFN)
MoleculesSolubility1.0x (XGBoost)0.002x (MolCLR)
MoleculesLipophilicity3.43x (GPR)0.97x (TabPFN)
MoleculesHOMO-LUMO gap4.3x (XGBoost)0.62x (TabPFN)
MaterialsHEA phase24x (RF)9.0x (CrabNet)
MaterialsCO2 Henry coeff.0.40x (XGBoost)12x (TabPFN)
ReactionsC-N cross-coupling2.9x (DRFP)-

Values >1 indicate GPT-3 is more data-efficient. For the HEA phase prediction task, GPT-3 achieved comparable accuracy to a random forest model trained on 1,126 data points using only about 50 training examples.

Representation Sensitivity

An important finding is that GPT-3 performs well regardless of molecular representation format. The authors tested IUPAC names, SMILES, and SELFIES, finding good results across all representations. IUPAC names often produced the best performance, which is notable because it makes the approach accessible to non-specialists who can simply use chemical names rather than learning specialized encodings.

Inverse Design

For inverse design, the authors fine-tuned GPT-3 with reversed question-completion pairs. On photoswitches:

  • Generated molecules include both training set members and novel structures (some not in PubChem)
  • Transition wavelengths matched target values within about 10% mean absolute percentage error (validated using the GPR model from Griffiths et al.)
  • A temperature parameter controls the diversity-validity tradeoff: low temperatures produce training set copies, high temperatures produce diverse but potentially invalid structures
  • Across all temperatures, generated molecules showed low synthetic accessibility (SA) scores, suggesting synthesizability

The authors also demonstrated iterative inverse design for HOMO-LUMO gap optimization: starting from QMugs data, they iteratively fine-tuned GPT-3 to generate molecules with progressively larger bandgaps (>5 eV), successfully shifting the distribution over four generations. This worked even when extrapolating beyond the training distribution (e.g., training only on molecules with gaps <3.5 eV, then generating molecules with gaps >4.0 eV).

Coarse-Grained Polymer Design

A striking test involved coarse-grained dispersant polymers with four monomer types and chain lengths of 16-48 units. GPT-3 had no prior knowledge of these abstract representations, yet it outperformed dedicated models for adsorption free energy prediction and successfully performed inverse design, generating monomer sequences with a mean percentage error of about 22% for the desired property.

Key Findings and Limitations

Key Findings

  1. Low-data advantage: Fine-tuned GPT-3 consistently shows the largest advantages over conventional ML in low-data regimes (tens to hundreds of data points), which is precisely where experimental chemistry datasets typically fall.

  2. Representation agnostic: The model works with IUPAC names, SMILES, SELFIES, and even invented abstract representations, removing the need for chemistry-specific tokenization.

  3. No feature engineering: The approach requires no domain-specific descriptors, fingerprints, or architectural modifications, making it accessible to researchers without ML expertise.

  4. Bidirectional design: Inverse design is achieved by simply reversing the question format, with no architectural changes or separate generative model needed.

  5. Extrapolation capability: The model can generate molecules with properties outside the training distribution, as demonstrated by the HOMO-LUMO gap extrapolation experiments.

Limitations

  • In the high-data regime, conventional ML models with chemistry-specific features often catch up to or surpass GPT-3, as the inductive biases encoded in GPT-3 become less necessary with sufficient data.
  • Regression is inherently limited by the discretization of continuous values into tokens. This requires more data than classification and introduces quantization error.
  • The approach relies on the OpenAI API, introducing cost and reproducibility concerns (model versions may change). The authors partially address this by providing open-source alternatives via chemlift.
  • The authors acknowledge that identified correlations may not represent causal relationships. GPT-3 finding predictive patterns does not guarantee that the patterns are chemically meaningful.
  • No optimization of prompts, tokenization, or hyperparameters was performed, suggesting room for improvement but also making it difficult to assess the ceiling of this approach.

Reproducibility Details

Data

All datasets are publicly available and were obtained from published benchmarks.

PurposeDatasetSizeNotes
ClassificationHEA phase (Pei et al.)1,252 alloysSingle-phase vs. multi-phase
RegressionFreeSolv643 moleculesHydration free energies
RegressionESOL1,128 moleculesAqueous solubility
RegressionQMugs665,000 moleculesHOMO-LUMO gaps via GFN2-xTB
ClassificationLipophilicity (ChEMBL)VariesLogP classification
ClassificationOPV PCEVariesOrganic photovoltaic efficiency
RegressionMOF Henry coefficientsVariesCO2/CH4 adsorption
Inverse designPhotoswitches (Griffiths et al.)392 moleculesTransition wavelengths

Algorithms

  • Fine-tuning via OpenAI API: 8 epochs, learning rate multiplier 0.02
  • GPT-3 ada variant (smallest model) used for all main results
  • In-context learning also tested with larger GPT-3 models and GPT-4
  • Open-source alternative: GPT-J-6B with LoRA + 8-bit quantization
  • Learning curves fit to power laws $-a \exp(-bx + c)$ for data efficiency comparison
  • Validity checked using RDKit via GuacaMol’s is\_valid method

Models

  • GPT-3 ada (OpenAI API, proprietary)
  • GPT-J-6B (open-source, fine-tunable on consumer hardware)

Evaluation

MetricTaskNotes
AccuracyHEA phaseClassification
$F_1$ macroAll classification tasksClass-balanced
Cohen’s $\kappa$ClassificationUsed for learning curve thresholds
MAE / MAPERegression, inverse designProperty prediction accuracy
Validity rateInverse designFraction of parseable SMILES
Frechet ChemNet distanceInverse designDistribution similarity
SA scoreInverse designSynthetic accessibility

Hardware

  • Fine-tuning via OpenAI API (cloud compute, not user-specified)
  • Open-source experiments: consumer GPU hardware with 8-bit quantization
  • Quantum chemistry validation: GFN2-xTB for HOMO-LUMO calculations

Artifacts

ArtifactTypeLicenseNotes
gptchemCodeMITAll experiments with OpenAI API
chemliftCodeMITOpen-source LLM fine-tuning support
Zenodo (gptchem)CodeMITArchived release
Zenodo (chemlift)CodeMITArchived release

Paper Information

Citation: Jablonka, K. M., Schwaller, P., Ortega-Guerrero, A., & Smit, B. (2024). Leveraging large language models for predictive chemistry. Nature Machine Intelligence, 6(2), 161-169. https://doi.org/10.1038/s42256-023-00788-1

@article{jablonka2024leveraging,
  title={Leveraging large language models for predictive chemistry},
  author={Jablonka, Kevin Maik and Schwaller, Philippe and Ortega-Guerrero, Andres and Smit, Berend},
  journal={Nature Machine Intelligence},
  volume={6},
  number={2},
  pages={161--169},
  year={2024},
  publisher={Nature Publishing Group},
  doi={10.1038/s42256-023-00788-1}
}