ChemBERTa-2

Paper Information

Citation: Ahmad, W., Simon, E., Chithrananda, S., Grand, G., & Ramsundar, B. (2022). ChemBERTa-2: Towards Chemical Foundation Models. arXiv preprint arXiv:2209.01712. https://doi.org/10.48550/arXiv.2209.01712

Publication: arXiv 2022 (Presented at 2021 ELLIS ML for Molecule Discovery Workshop)

Additional Resources:

ChemBERTa-1 Paper

What kind of paper is this?

This is primarily a Methodological paper with a secondary Resource contribution.

It fits the Method classification because it focuses on optimizing the architecture and pretraining pipeline for molecular transformers. The authors perform extensive ablation studies (varying dataset size from 5M to 77M, comparing MLM vs. MTR objectives) to determine “how well” these strategies work compared to baselines. The secondary Resource classification applies because they open-source the trained models and establish a benchmark on a massive 77M compound dataset.

Key methodological indicators:

Baseline comparison: The paper explicitly compares ChemBERTa-2 against standard baselines (D-MPNN, Random Forest, GCN) and its predecessor (ChemBERTa-1) with prominent benchmark tables
Ablation studies: Extensive experiments comparing multi-task and self-supervised pretraining by varying hyperparameters and pretraining dataset size
Scaling analysis: Systematic investigation of whether larger datasets (up to 77M compounds) yield better performance

What is the motivation?

The authors aim to bridge the gap between NLP success stories (like GPT-3) and molecular machine learning by developing a “chemical foundation model”.

Key motivations:

Label scarcity: Experimental labels for molecular properties are rare and expensive, but unlabeled SMILES strings are abundant
Scaling hypothesis: Testing if scaling pretraining data (up to 77M compounds) yields consistent downstream improvements, similar to scaling laws in NLP
Efficiency: Optimizing the pretraining process introduced in the original ChemBERTa by comparing self-supervised (MLM) and semi-supervised (MTR) approaches

What is the novelty here?

Scale: Training on 77M unique SMILES from PubChem, which is one of the largest molecular pretraining datasets used to date (compared to 10M for ChemBERTa-1 or 18.7M for SMILES-BERT).

Pipeline optimization: A direct, controlled comparison of Masked Language Modeling (MLM) vs. Multi-Task Regression (MTR) pretraining objectives on identical datasets.

Proxy selection: The finding that MLM loss correlates well with MTR loss, allowing the cheaper MLM task to be used for hyperparameter tuning before running the expensive MTR pretraining.

What experiments were performed?

Pretraining Setup

Datasets: Subsets of PubChem containing 5M, 10M, and 77M unique SMILES.

Tasks:

MLM: Masking 15% of tokens (following RoBERTa procedure)
MTR: Predicting 200 calculated molecular properties (via RDKit) simultaneously

Hyperparameter search: Ran 50 random configurations on the 5M dataset; selected the top 5 to scale up to 10M and 77M.

Downstream Validation

Finetuning: Evaluated on 8 tasks from MoleculeNet (BACE, BBBP, ClinTox, Delaney, etc.) using scaffold splits (80/10/10).

Analysis: Used UMAP to visualize embeddings from MLM, MTR, and ECFP to check for clustering by label without finetuning.

What were the outcomes and conclusions drawn?

SOTA performance: ChemBERTa-2 outperforms the D-MPNN baseline (chemprop) on 6 out of 8 MoleculeNet tasks.

MTR superiority: Models pretrained on Multi-Task Regression (MTR) consistently perform better on downstream tasks than those pretrained on MLM.

Scaling laws: Increasing dataset size from 5M to 77M improved pretraining loss by 25-35%.

Transfer learning: The correlation between pretraining loss and downstream performance is task-dependent; it is strong for Lipophilicity but weaker for BACE classification.

Computational cost: MTR is substantially slower than MLM due to the large label vectors, but MLM loss can be used as a proxy for architecture search.

Reproducibility Details

Data

The pretraining corpus is derived from PubChem.

Purpose	Dataset	Size	Notes
Pretraining	PubChem	77M SMILES	Canonicalized and globally shuffled. Subsets of 5M and 10M were also used.
Validation	PubChem	100k SMILES	A fixed set held out from the 77M corpus.
MTR Labels	RDKit	200 props	200 molecular properties calculated from SMILES using RDKit. Labels are mean-normalized.
Finetuning	MoleculeNet	1.5k - 8k	Tasks: BACE, Clearance, Delaney, Lipophilicity, BBBP, ClinTox, HIV, Tox21. Split 80/10/10 via scaffold splitter.

Algorithms

Pretraining Objectives:

Masked Language Modeling (MLM): Follows RoBERTa procedure. Masks 15% of tokens. Max sequence length 512.
Multi-Task Regression (MTR): Predicting 200 RDKit properties. Labels are mean-normalized.

Tokenizer:

Dictionary of common SMILES characters
Maximum vocabulary size: 591 tokens

Optimization:

Patience: Early stopping set to one pass through the dataset to ensure full coverage
Hyperparameter search: Random search (50 configs) varying hidden size, attention heads, dropout, intermediate size, hidden layers, and learning rate

Models

Architecture: Based on RoBERTa (HuggingFace implementation)
Parameter scale: Models ranged between 5M and 46M parameters
Selection: Top 5 configurations from the 5M-dataset random search were trained on the full 77M dataset

Evaluation

Benchmarks were performed on MoleculeNet using DeepChem.

Metric	Tasks	Baseline	Notes
RMSE ($\downarrow$)	Delaney, Lipo, BACE (Reg), Clearance	D-MPNN	ChemBERTa-2 outperformed D-MPNN on Delaney (0.889 vs 1.105) and Clearance (48.5 vs 49.7).
ROC-AUC ($\uparrow$)	BBBP, ClinTox, HIV, Tox21, BACE (Cls)	D-MPNN	ChemBERTa-2 generally competitive; MTR-77M achieved 0.728 on BBBP vs D-MPNN 0.697.

Hardware

Compute: AWS EC2 instances with Nvidia T4 GPUs
Strategy: AWS Spot instances were used to reduce cost; implemented checkpointing to handle interruptions
Note: For MTR, they wrote a custom data loader wrapper around HuggingFace’s text loader to handle CSV parsing efficiency

Citation

@misc{ahmadChemBERTa2ChemicalFoundation2022,
  title = {{{ChemBERTa-2}}: {{Towards Chemical Foundation Models}}},
  shorttitle = {{{ChemBERTa-2}}},
  author = {Ahmad, Walid and Simon, Elana and Chithrananda, Seyone and Grand, Gabriel and Ramsundar, Bharath},
  year = 2022,
  month = sep,
  number = {arXiv:2209.01712},
  eprint = {2209.01712},
  primaryclass = {cs},
  publisher = {arXiv},
  doi = {10.48550/arXiv.2209.01712},
  urldate = {2025-12-25},
  archiveprefix = {arXiv}
}

Publication Details
Authors	Walid Ahmad, Elana Simon, Seyone Chithrananda, Gabriel Grand, Bharath Ramsundar
Paper Title	ChemBERTa-2: Towards Chemical Foundation Models
Category	Computational Chemistry
Date	December 2025
Links	📊 arXiv • 🔗 DOI • 📄 Paper

Paper Information#

What kind of paper is this?#

What is the motivation?#

What is the novelty here?#

What experiments were performed?#

Pretraining Setup#

Downstream Validation#

What were the outcomes and conclusions drawn?#

Reproducibility Details#

Data#

Algorithms#

Models#

Evaluation#

Hardware#

Citation#