Paper Information
Citation: Ahmad, W., Simon, E., Chithrananda, S., Grand, G., & Ramsundar, B. (2022). ChemBERTa-2: Towards Chemical Foundation Models. arXiv preprint arXiv:2209.01712. https://doi.org/10.48550/arXiv.2209.01712
Publication: arXiv 2022 (Presented at 2021 ELLIS ML for Molecule Discovery Workshop)
Additional Resources:
What kind of paper is this?
This is primarily a Methodological paper with a secondary Resource contribution.
It fits the Method classification because it focuses on optimizing the architecture and pretraining pipeline for molecular transformers. The authors perform extensive ablation studies (varying dataset size from 5M to 77M, comparing MLM vs. MTR objectives) to determine “how well” these strategies work compared to baselines. The secondary Resource classification applies because they open-source the trained models and establish a benchmark on a massive 77M compound dataset.
Key methodological indicators:
- Baseline comparison: The paper explicitly compares ChemBERTa-2 against standard baselines (D-MPNN, Random Forest, GCN) and its predecessor (ChemBERTa-1) with prominent benchmark tables
- Ablation studies: Extensive experiments comparing multi-task and self-supervised pretraining by varying hyperparameters and pretraining dataset size
- Scaling analysis: Systematic investigation of whether larger datasets (up to 77M compounds) yield better performance
What is the motivation?
The authors aim to bridge the gap between NLP success stories (like GPT-3) and molecular machine learning by developing a “chemical foundation model”.
Key motivations:
- Label scarcity: Experimental labels for molecular properties are rare and expensive, but unlabeled SMILES strings are abundant
- Scaling hypothesis: Testing if scaling pretraining data (up to 77M compounds) yields consistent downstream improvements, similar to scaling laws in NLP
- Efficiency: Optimizing the pretraining process introduced in the original ChemBERTa by comparing self-supervised (MLM) and semi-supervised (MTR) approaches
What is the novelty here?
Scale: Training on 77M unique SMILES from PubChem, which is one of the largest molecular pretraining datasets used to date (compared to 10M for ChemBERTa-1 or 18.7M for SMILES-BERT).
Pipeline optimization: A direct, controlled comparison of Masked Language Modeling (MLM) vs. Multi-Task Regression (MTR) pretraining objectives on identical datasets.
Proxy selection: The finding that MLM loss correlates well with MTR loss, allowing the cheaper MLM task to be used for hyperparameter tuning before running the expensive MTR pretraining.
What experiments were performed?
Pretraining Setup
Datasets: Subsets of PubChem containing 5M, 10M, and 77M unique SMILES.
Tasks:
- MLM: Masking 15% of tokens (following RoBERTa procedure)
- MTR: Predicting 200 calculated molecular properties (via RDKit) simultaneously
Hyperparameter search: Ran 50 random configurations on the 5M dataset; selected the top 5 to scale up to 10M and 77M.
Downstream Validation
Finetuning: Evaluated on 8 tasks from MoleculeNet (BACE, BBBP, ClinTox, Delaney, etc.) using scaffold splits (80/10/10).
Analysis: Used UMAP to visualize embeddings from MLM, MTR, and ECFP to check for clustering by label without finetuning.
What were the outcomes and conclusions drawn?
SOTA performance: ChemBERTa-2 outperforms the D-MPNN baseline (chemprop) on 6 out of 8 MoleculeNet tasks.
MTR superiority: Models pretrained on Multi-Task Regression (MTR) consistently perform better on downstream tasks than those pretrained on MLM.
Scaling laws: Increasing dataset size from 5M to 77M improved pretraining loss by 25-35%.
Transfer learning: The correlation between pretraining loss and downstream performance is task-dependent; it is strong for Lipophilicity but weaker for BACE classification.
Computational cost: MTR is substantially slower than MLM due to the large label vectors, but MLM loss can be used as a proxy for architecture search.
Reproducibility Details
Data
The pretraining corpus is derived from PubChem.
| Purpose | Dataset | Size | Notes |
|---|---|---|---|
| Pretraining | PubChem | 77M SMILES | Canonicalized and globally shuffled. Subsets of 5M and 10M were also used. |
| Validation | PubChem | 100k SMILES | A fixed set held out from the 77M corpus. |
| MTR Labels | RDKit | 200 props | 200 molecular properties calculated from SMILES using RDKit. Labels are mean-normalized. |
| Finetuning | MoleculeNet | 1.5k - 8k | Tasks: BACE, Clearance, Delaney, Lipophilicity, BBBP, ClinTox, HIV, Tox21. Split 80/10/10 via scaffold splitter. |
Algorithms
Pretraining Objectives:
- Masked Language Modeling (MLM): Follows RoBERTa procedure. Masks 15% of tokens. Max sequence length 512.
- Multi-Task Regression (MTR): Predicting 200 RDKit properties. Labels are mean-normalized.
Tokenizer:
- Dictionary of common SMILES characters
- Maximum vocabulary size: 591 tokens
Optimization:
- Patience: Early stopping set to one pass through the dataset to ensure full coverage
- Hyperparameter search: Random search (50 configs) varying hidden size, attention heads, dropout, intermediate size, hidden layers, and learning rate
Models
- Architecture: Based on RoBERTa (HuggingFace implementation)
- Parameter scale: Models ranged between 5M and 46M parameters
- Selection: Top 5 configurations from the 5M-dataset random search were trained on the full 77M dataset
Evaluation
Benchmarks were performed on MoleculeNet using DeepChem.
| Metric | Tasks | Baseline | Notes |
|---|---|---|---|
| RMSE ($\downarrow$) | Delaney, Lipo, BACE (Reg), Clearance | D-MPNN | ChemBERTa-2 outperformed D-MPNN on Delaney (0.889 vs 1.105) and Clearance (48.5 vs 49.7). |
| ROC-AUC ($\uparrow$) | BBBP, ClinTox, HIV, Tox21, BACE (Cls) | D-MPNN | ChemBERTa-2 generally competitive; MTR-77M achieved 0.728 on BBBP vs D-MPNN 0.697. |
Hardware
- Compute: AWS EC2 instances with Nvidia T4 GPUs
- Strategy: AWS Spot instances were used to reduce cost; implemented checkpointing to handle interruptions
- Note: For MTR, they wrote a custom data loader wrapper around HuggingFace’s text loader to handle CSV parsing efficiency
Citation
@misc{ahmadChemBERTa2ChemicalFoundation2022,
title = {{{ChemBERTa-2}}: {{Towards Chemical Foundation Models}}},
shorttitle = {{{ChemBERTa-2}}},
author = {Ahmad, Walid and Simon, Elana and Chithrananda, Seyone and Grand, Gabriel and Ramsundar, Bharath},
year = 2022,
month = sep,
number = {arXiv:2209.01712},
eprint = {2209.01712},
primaryclass = {cs},
publisher = {arXiv},
doi = {10.48550/arXiv.2209.01712},
urldate = {2025-12-25},
archiveprefix = {arXiv}
}
