Core Contribution: An Open-Source Framework

This is primarily a Resource ($\Psi_{\text{Resource}}$) paper, with secondary Method ($\Psi_{\text{Method}}$) contributions.

  • Resource Basis: The core contribution is “ChemBERTa-3,” an open-source framework integrated into DeepChem that standardizes the pretraining and benchmarking of chemical foundation models. The authors focus heavily on infrastructure (AWS/Ray integration) and correcting benchmarking inconsistencies in the field.
  • Method Basis: It trains models like “c3-MoLFormer” to reproduce and validate the infrastructure.

The Pretraining Scalability Challenge

  • Scalability Challenges: Building robust molecular models is difficult due to the vast size of chemical space and the computational intensity of pretraining on large datasets.
  • Proprietary Barriers: Many high-performing chemical foundation models (e.g., the full MoLFormer-XL) are partially closed-source or difficult to reproduce.
  • Benchmarking Inconsistencies: There is a lack of systematic comparison between architectures (e.g., Graph vs. Transformer) using unified protocols. Specifically, previous comparisons relied on reported results that used differing scaffold splitting algorithms, making them inaccurate.

Unified Infrastructure & Standardized Benchmarking

  • Unified Infrastructure: Integration of DeepChem with Ray for distributed, scalable pretraining and fine-tuning of both graph and transformer models.
  • Standardized Benchmarking: Identification that MoLFormer’s scaffold splitting algorithm differs from the standard DeepChem/MoleculeNet splitter, and the subsequent standardization of these benchmarks for fair comparison.
  • New DeepChem Tools: Introduction of the ModularTorchModel class for flexible loss computation and HuggingFaceModel wrappers to bridge ecosystems.

Benchmarking Transformers vs. Graph Models

  • Architecture Comparison: Benchmarked Transformers (ChemBERTa, MoLFormer) against Graph models (GROVER, InfoGraph, InfoMax3D, DMPNN, GCN) and baselines (Random Forest).
  • Pretraining Scale Disparity:
    • Transformers were pretrained on ZINC20 subsets ranging from 10M to 1.1B molecules (combining ZINC and PubChem).
    • Graph models were limited to 250K molecule subsets due to memory and computational overhead of message passing on large graphs. While this highlights the superior scalability of Transformer architectures, comparing a 1.1B-trained Transformer to a 250K-trained Graph model provides an unbalanced evaluation of architectural capacity.
  • Reproducibility Validation: Trained “c3-MoLFormer” (a reproduction of MoLFormer) on 1.1B molecules using two distinct hardware setups: AWS spot instances (Ray) and a local HPC cluster.
  • Scaffold Split Analysis: Compared performance metrics using “DeepChem scaffold splits” vs. “MoLFormer scaffold splits” to quantify the impact of data leakage/overlap.

Overcoming Scaffold Splitting Inconsistencies

  • Scaling Transformers vs. Graphs: Transformer-based models are significantly easier to scale to large datasets than current graph-based approaches, though performance is comparable at small scales.
  • Benchmarking sensitivity: MoLFormer’s reported superiority over baselines was partly inflated by its specific scaffold splitting method, which had higher structural overlap between train and test sets (yielding a lower Tanimoto distance, generally quantified via $1 - \frac{|A \cap B|}{|A \cup B|}$) than DeepChem splits. When standardized, baselines like DMPNN perform more competitively.
  • Infrastructure Viability: The framework successfully replicated large-scale training (MoLFormer-1.1B) on both cloud and on-premise HPC, confirming reproducibility.
  • Open Source Release: All code, configurations, and the c3-MoLFormer-1.1B model weights are released to facilitate future research.

Reproducibility Details

Data

  • Pretraining:
    • Source: ZINC20 (1.4B compounds) and PubChem.
    • Scale: Subsets of 10M, 100M, and 1.1B (100% ZINC20 + 100% PubChem) were used for Transformers. Graph models used a 250K subset.
  • Fine-tuning:
    • Suite: MoleculeNet.
    • Tasks: Classification (BACE, BBBP, Tox21, HIV, SIDER, ClinTox) and Regression (ESOL, FreeSolv, Lipophilicity, QM9).
    • Splits: Critical distinction made between “DeepChem scaffold splits” (80/10/10) and “MoLFormer scaffold splits” (which can be downloaded from https://ibm.ent.box.com/v/MoLFormer-data). The paper notes these algorithms differ.

Algorithms

  • Framework: DeepChem integrated with Ray for distributed training. To recreate the environment, the repository relies on a nightly version of DeepChem (pip install --pre deepchem) and specific dependencies found within the requirements.txt. Pretraining scripts are available in the chemberta3_benchmarking/pretraining directory of the repository.
  • Data Preparation: Featurization workflows (e.g., CircularFingerprint, RDKitConformer) are documented under chemberta3_benchmarking/data/data_preprocessing/ in the codebase.
  • Modular Training: Uses ModularTorchModel to allow loss computation from intermediate values and flexible component connection.
  • Training Brittleness:
    • Optimizer: Linear learning rate scheduler with warmup.
    • Instability Handling: The authors observed significant loss spikes during warmup. Their primary mitigation strategy involved checkpointing frequently and restarting from the last stable state upon a spike, highlighting a persistent brittleness in optimizing these large chemical foundation models.
    • Numerical Issues: Addressed NaN values by pretraining on a small dataset with low LR before scaling up.

Models

  • ChemBERTa: RoBERTa-based architecture trained with Masked Language Modeling (MLM) and Multitask Regression (MTR). Specific model identifiers (e.g., DeepChem/ChemBERTa-100M-MLM) are hosted on Hugging Face so researchers can pull them directly via the transformers library. The core pretraining objective minimized the standard MLM loss: $$ \mathcal{L}{\text{MLM}} = - \frac{1}{|\mathcal{M}|} \sum{i \in \mathcal{M}} \log \hat{y}{i} $$ where $\mathcal{M}$ represents the set of masked SMILES token indices, and $\hat{y}{i}$ is the model’s predicted probability for the correct token given the corrupted sequence context.
  • MoLFormer (c3-MoLFormer): Re-implementation of the MoLFormer architecture (Rotary embeddings, linear attention). Specific model identifiers (e.g., DeepChem/MoLFormer-c3-1.1B) are similarly available on Hugging Face.
    • Tokenizer: ibm/MoLFormer-XL-both-10pct tokenizer.
  • Graph Models:
    • GROVER: Graph Transformer with node/edge/graph level self-supervision.
    • InfoGraph: Maximizes mutual information between graph-level and substructure representations.
    • InfoMax3D: Incorporates 3D conformer data (via RDKit ETKDGv2) into contrastive pretraining.
    • DMPNN: Directed Message Passing Neural Network (Chemprop variant).

Evaluation

  • Metrics: ROC-AUC for classification; RMSE for regression (MAE for QM9).
  • Baselines: Random Forest, GCN, DMPNN trained on fine-tuning splits only.
  • Protocol: Three independent runs per configuration to report mean and range (not a confidence interval), with the exception of the compute-heavy QM9 dataset, which only received a single run. Benchmarking execution scripts (e.g., GCN, RF, DMPNN, ChemBERTa) are stored in the repo under chemberta3_benchmarking/models_benchmarking/ and contain the specific fine-tuning hyperparameters and optimizer configurations used for each downstream task.
  • Key Results:
    • c3-MoLFormer-1.1B achieved ~0.848 ROC-AUC on BACE and ~0.900 on BBBP (using MoLFormer splits). This closely matches the original IBM MoLFormer metrics, validating the reproducibility of the open-source framework.
    • When constrained to the equivalent 250K subset, Graph models (InfoGraph, GROVER) performed comparably to Transformers, indicating that Transformer superiority in chemistry is largely driven by data scalability rather than an inherent architectural advantage at small scales.

Hardware

  • Cloud (AWS):
    • Compute: 40 NVIDIA T4 GPUs (g4dn.12xlarge spot instances for pretraining, g4dn.2xlarge for benchmarking).
    • Cost: ~$4000 for MoLFormer 1.1B pretraining.
    • Time: ~10 days (260 hours) for 1.1B model pretraining.
    • Setup: Setup scripts for single-node and multi-node spot EC2 clusters are provided in the GitHub repository’s infra/ and spot/ folders.
  • On-Premise HPC:
    • Compute: 16 nodes (AMD EPYC), each with 4 AMD MI300A APUs.
    • Environment: Ray multi-node multi-GPU framework.

Artifacts

ArtifactTypeLicenseNotes
ChemBERTa-3 GitHub RepositoryCodeUnknownTraining, fine-tuning, and benchmarking framework
DeepChem/MoLFormer-c3-1.1BModelUnknownMoLFormer re-implementation pretrained on 1.1B molecules
DeepChem/ChemBERTa-100M-MLMModelUnknownChemBERTa pretrained on 100M ZINC molecules
DeepChem/MoLFormer-c3-100MModelUnknownMoLFormer pretrained on 100M molecules
DeepChem/MoLFormer-c3-550MModelUnknownMoLFormer pretrained on 550M molecules

Paper Information

Citation: Singh, R. et al. (2026). ChemBERTa-3: an open source training framework for chemical foundation models. Digital Discovery, 5, 662-685. https://doi.org/10.1039/D5DD00348B

Publication: Digital Discovery 2026

Additional Resources:

@article{singhChemBERTa3OpenSource2026,
  author = {Singh, Riya and Barsainyan, Aryan Amit and Irfan, Rida and Amorin, Connor Joseph and He, Stewart and Davis, Tony and Thiagarajan, Arun and Sankaran, Shiva and Chithrananda, Seyone and Ahmad, Walid and Jones, Derek and McLoughlin, Kevin and Kim, Hyojin and Bhutani, Anoushka and Sathyanarayana, Shreyas Vinaya and Viswanathan, Venkat and Allen, Jonathan E. and Ramsundar, Bharath},
  title = {{{ChemBERTa-3}}: an open source training framework for chemical foundation models},
  journal = {Digital Discovery},
  year = {2026},
  volume = {5},
  pages = {662-685},
  publisher = {The Royal Society of Chemistry},
  doi = {10.1039/D5DD00348B},
  url = {https://doi.org/10.1039/D5DD00348B}
}