Paper Information

Citation: Ahmad, W., Simon, E., Chithrananda, S., Grand, G., & Ramsundar, B. (2022). ChemBERTa-2: Towards Chemical Foundation Models. arXiv preprint arXiv:2209.01712. https://doi.org/10.48550/arXiv.2209.01712

Publication: arXiv 2022 (Presented at 2021 ELLIS ML for Molecule Discovery Workshop)

Additional Resources:

What kind of paper is this?

This is primarily a Methodological paper with a secondary Resource contribution.

It fits the Method classification because it focuses on optimizing the architecture and pretraining pipeline for molecular transformers. The authors perform extensive ablation studies (varying dataset size from 5M to 77M, comparing MLM vs. MTR objectives) to determine “how well” these strategies work compared to baselines. The secondary Resource classification applies because they open-source the trained models and establish a benchmark on a massive 77M compound dataset.

Key methodological indicators:

  • Baseline comparison: The paper explicitly compares ChemBERTa-2 against standard baselines (D-MPNN, Random Forest, GCN) and its predecessor (ChemBERTa-1) with prominent benchmark tables
  • Ablation studies: Extensive experiments comparing multi-task and self-supervised pretraining by varying hyperparameters and pretraining dataset size
  • Scaling analysis: Systematic investigation of whether larger datasets (up to 77M compounds) yield better performance

What is the motivation?

The authors aim to bridge the gap between NLP success stories (like GPT-3) and molecular machine learning by developing a “chemical foundation model”.

Key motivations:

  • Label scarcity: Experimental labels for molecular properties are rare and expensive, but unlabeled SMILES strings are abundant
  • Scaling hypothesis: Testing if scaling pretraining data (up to 77M compounds) yields consistent downstream improvements, similar to scaling laws in NLP
  • Efficiency: Optimizing the pretraining process introduced in the original ChemBERTa by comparing self-supervised (MLM) and semi-supervised (MTR) approaches

What is the novelty here?

Scale: Training on 77M unique SMILES from PubChem, which is one of the largest molecular pretraining datasets used to date (compared to 10M for ChemBERTa-1 or 18.7M for SMILES-BERT).

Pipeline optimization: A direct, controlled comparison of Masked Language Modeling (MLM) vs. Multi-Task Regression (MTR) pretraining objectives on identical datasets.

Proxy selection: The finding that MLM loss correlates well with MTR loss, allowing the cheaper MLM task to be used for hyperparameter tuning before running the expensive MTR pretraining.

What experiments were performed?

Pretraining Setup

Datasets: Subsets of PubChem containing 5M, 10M, and 77M unique SMILES.

Tasks:

  • MLM: Masking 15% of tokens (following RoBERTa procedure)
  • MTR: Predicting 200 calculated molecular properties (via RDKit) simultaneously

Hyperparameter search: Ran 50 random configurations on the 5M dataset; selected the top 5 to scale up to 10M and 77M.

Downstream Validation

Finetuning: Evaluated on 8 tasks from MoleculeNet (BACE, BBBP, ClinTox, Delaney, etc.) using scaffold splits (80/10/10).

Analysis: Used UMAP to visualize embeddings from MLM, MTR, and ECFP to check for clustering by label without finetuning.

What were the outcomes and conclusions drawn?

SOTA performance: ChemBERTa-2 outperforms the D-MPNN baseline (chemprop) on 6 out of 8 MoleculeNet tasks.

MTR superiority: Models pretrained on Multi-Task Regression (MTR) consistently perform better on downstream tasks than those pretrained on MLM.

Scaling laws: Increasing dataset size from 5M to 77M improved pretraining loss by 25-35%.

Transfer learning: The correlation between pretraining loss and downstream performance is task-dependent; it is strong for Lipophilicity but weaker for BACE classification.

Computational cost: MTR is substantially slower than MLM due to the large label vectors, but MLM loss can be used as a proxy for architecture search.

Reproducibility Details

Data

The pretraining corpus is derived from PubChem.

PurposeDatasetSizeNotes
PretrainingPubChem77M SMILESCanonicalized and globally shuffled. Subsets of 5M and 10M were also used.
ValidationPubChem100k SMILESA fixed set held out from the 77M corpus.
MTR LabelsRDKit200 props200 molecular properties calculated from SMILES using RDKit. Labels are mean-normalized.
FinetuningMoleculeNet1.5k - 8kTasks: BACE, Clearance, Delaney, Lipophilicity, BBBP, ClinTox, HIV, Tox21. Split 80/10/10 via scaffold splitter.

Algorithms

Pretraining Objectives:

  1. Masked Language Modeling (MLM): Follows RoBERTa procedure. Masks 15% of tokens. Max sequence length 512.
  2. Multi-Task Regression (MTR): Predicting 200 RDKit properties. Labels are mean-normalized.

Tokenizer:

  • Dictionary of common SMILES characters
  • Maximum vocabulary size: 591 tokens

Optimization:

  • Patience: Early stopping set to one pass through the dataset to ensure full coverage
  • Hyperparameter search: Random search (50 configs) varying hidden size, attention heads, dropout, intermediate size, hidden layers, and learning rate

Models

  • Architecture: Based on RoBERTa (HuggingFace implementation)
  • Parameter scale: Models ranged between 5M and 46M parameters
  • Selection: Top 5 configurations from the 5M-dataset random search were trained on the full 77M dataset

Evaluation

Benchmarks were performed on MoleculeNet using DeepChem.

MetricTasksBaselineNotes
RMSE ($\downarrow$)Delaney, Lipo, BACE (Reg), ClearanceD-MPNNChemBERTa-2 outperformed D-MPNN on Delaney (0.889 vs 1.105) and Clearance (48.5 vs 49.7).
ROC-AUC ($\uparrow$)BBBP, ClinTox, HIV, Tox21, BACE (Cls)D-MPNNChemBERTa-2 generally competitive; MTR-77M achieved 0.728 on BBBP vs D-MPNN 0.697.

Hardware

  • Compute: AWS EC2 instances with Nvidia T4 GPUs
  • Strategy: AWS Spot instances were used to reduce cost; implemented checkpointing to handle interruptions
  • Note: For MTR, they wrote a custom data loader wrapper around HuggingFace’s text loader to handle CSV parsing efficiency

Citation

@misc{ahmadChemBERTa2ChemicalFoundation2022,
  title = {{{ChemBERTa-2}}: {{Towards Chemical Foundation Models}}},
  shorttitle = {{{ChemBERTa-2}}},
  author = {Ahmad, Walid and Simon, Elana and Chithrananda, Seyone and Grand, Gabriel and Ramsundar, Bharath},
  year = 2022,
  month = sep,
  number = {arXiv:2209.01712},
  eprint = {2209.01712},
  primaryclass = {cs},
  publisher = {arXiv},
  doi = {10.48550/arXiv.2209.01712},
  urldate = {2025-12-25},
  archiveprefix = {arXiv}
}