Paper Information

Citation: Singh, R. et al. (2025). ChemBERTa-3: An Open Source Training Framework for Chemical Foundation Models. ChemRxiv. https://doi.org/10.26434/chemrxiv-2025-4glrl-v2

Publication: ChemRxiv 2025

Additional Resources:

What kind of paper is this?

This is primarily a Resource ($\Psi_{\text{Resource}}$) paper, with secondary Method ($\Psi_{\text{Method}}$) contributions.

  • Resource Basis: The core contribution is “ChemBERTa-3,” an open-source framework integrated into DeepChem that standardizes the pretraining and benchmarking of chemical foundation models. The authors focus heavily on infrastructure (AWS/Ray integration) and correcting benchmarking inconsistencies in the field.
  • Method Basis: While it trains models like “c3-MoLFormer,” the goal is to reproduce and validate the infrastructure rather than propose a novel architecture.

What is the motivation?

  • Scalability Challenges: Building robust molecular models is difficult due to the vast size of chemical space and the computational intensity of pretraining on large datasets.
  • Proprietary Barriers: Many state-of-the-art chemical foundation models (e.g., the full MoLFormer-XL) are partially closed-source or difficult to reproduce.
  • Benchmarking Inconsistencies: There is a lack of systematic comparison between architectures (e.g., Graph vs. Transformer) using unified protocols. Specifically, previous comparisons relied on reported results that used differing scaffold splitting algorithms, making them inaccurate.

What is the novelty here?

  • Unified Infrastructure: Integration of DeepChem with Ray for distributed, scalable pretraining and fine-tuning of both graph and transformer models.
  • Standardized Benchmarking: Identification that MoLFormer’s scaffold splitting algorithm differs from the standard DeepChem/MoleculeNet splitter, and the subsequent standardization of these benchmarks for fair comparison.
  • New DeepChem Tools: Introduction of the ModularTorchModel class for flexible loss computation and HuggingFaceModel wrappers to bridge ecosystems.

What experiments were performed?

  • Architecture Comparison: Benchmarked Transformers (ChemBERTa, MoLFormer) against Graph models (GROVER, InfoGraph, InfoMax3D, DMPNN, GCN) and baselines (Random Forest).
  • Pretraining Scale:
    • Transformers were pretrained on ZINC20 subsets ranging from 10M to 1.1B molecules (combining ZINC and PubChem).
    • Graph models were limited to 250K molecule subsets due to scaling difficulties.
  • Reproducibility Validation: Trained “c3-MoLFormer” (a reproduction of MoLFormer) on 1.1B molecules using two distinct hardware setups: AWS spot instances (Ray) and a local HPC cluster.
  • Scaffold Split Analysis: Compared performance metrics using “DeepChem scaffold splits” vs. “MoLFormer scaffold splits” to quantify the impact of data leakage/overlap.

What were the outcomes and conclusions drawn?

  • Scaling Transformers vs. Graphs: Transformer-based models are significantly easier to scale to large datasets than current graph-based approaches, though performance is comparable at small scales.
  • Benchmarking sensitivity: MoLFormer’s reported superiority over baselines was partly inflated by its specific scaffold splitting method, which had higher structural overlap (lower Tanimoto distance) between train and test sets than DeepChem splits. When standardized, baselines like DMPNN perform more competitively.
  • Infrastructure Viability: The framework successfully replicated large-scale training (MoLFormer-1.1B) on both cloud and on-premise HPC, confirming reproducibility.
  • Open Source Release: All code, configurations, and the c3-MoLFormer-1.1B model weights are released to facilitate future research.

Reproducibility Details

Data

  • Pretraining:
    • Source: ZINC20 (1.4B compounds) and PubChem.
    • Scale: Subsets of 10M, 100M, and 1.1B (100% ZINC20 + 100% PubChem) were used for Transformers. Graph models used a 250K subset.
  • Fine-tuning:
    • Suite: MoleculeNet.
    • Tasks: Classification (BACE, BBBP, Tox21, HIV, SIDER, ClinTox) and Regression (ESOL, FreeSolv, Lipophilicity, QM9).
    • Splits: Critical distinction made between “DeepChem scaffold splits” (80/10/10) and “MoLFormer scaffold splits”. The paper notes these algorithms are not equivalent.

Algorithms

  • Framework: DeepChem integrated with Ray for distributed training.
  • Modular Training: Uses ModularTorchModel to allow loss computation from intermediate values and flexible component connection.
  • Training Stability:
    • Optimizer: Linear learning rate scheduler with warmup.
    • Instability Handling: Noted loss spikes during warmup; strategy involved checkpointing frequently and restarting from the last stable state upon spikes.
    • Numerical Issues: Addressed NaN values by pretraining on a small dataset with low LR before scaling up.

Models

  • ChemBERTa: RoBERTa-based architecture trained with Masked Language Modeling (MLM) and Multitask Regression (MTR).
  • MoLFormer (c3-MoLFormer): Re-implementation of the MoLFormer architecture (Rotary embeddings, linear attention).
    • Tokenizer: ibm/MoLFormer-XL-both-10pct tokenizer.
  • Graph Models:
    • GROVER: Graph Transformer with node/edge/graph level self-supervision.
    • InfoGraph: Maximizes mutual information between graph-level and substructure representations.
    • InfoMax3D: Incorporates 3D conformer data (via RDKit ETKDGv2) into contrastive pretraining.
    • DMPNN: Directed Message Passing Neural Network (Chemprop variant).

Evaluation

  • Metrics: ROC-AUC for classification; RMSE for regression (MAE for QM9).
  • Baselines: Random Forest, GCN, DMPNN trained on fine-tuning splits only.
  • Protocol: Three independent runs per configuration to report mean and standard deviation.
  • Key Results:
    • c3-MoLFormer-1.1B achieved ~0.848 ROC-AUC on BACE and ~0.900 on BBBP (MoLFormer splits).
    • Graph models (InfoGraph, GROVER) performed comparably to Transformers on small data but were harder to scale.

Hardware

  • Cloud (AWS):
    • Compute: 40 NVIDIA T4 GPUs (g4dn.2xlarge and g4dn.12xlarge spot instances).
    • Cost: ~$4000 for MoLFormer 1.1B pretraining.
    • Time: ~10 days (260 hours) for 1.1B model pretraining.
  • On-Premise HPC:
    • Compute: 16 nodes, each with 4 AMD MI300A APUs.
    • Environment: Ray multi-node multi-GPU framework.

Citation

@misc{singhChemBERTa3OpenSource2025,
  title = {{{ChemBERTa-3}}: {{An Open Source Training Framework}} for {{Chemical Foundation Models}}},
  shorttitle = {{{ChemBERTa-3}}},
  author = {Singh, Riya and Barsainyan, Aryan Amit and Irfan, Rida and Amorin, Connor Joseph and He, Stewart and Davis, Tony and Thiagarajan, Arun and Sankaran, Shiva and Chithrananda, Seyone and Ahmad, Walid and Jones, Derek and McLoughlin, Kevin and Kim, Hyojin and Bhutani, Anoushka and Sathyanarayana, Shreyas Vinaya and Viswanathan, Venkat and Allen, Jonathan E. and Ramsundar, Bharath},
  year = 2025,
  month = aug,
  publisher = {ChemRxiv},
  doi = {10.26434/chemrxiv-2025-4glrl-v2},
  archiveprefix = {ChemRxiv}
}