SPMM: A Bidirectional Molecular Foundation Model

A Multimodal Foundation Model for Structure-Property Comprehension

This is a Method paper that introduces the Structure-Property Multi-Modal foundation model (SPMM), a transformer-based architecture that treats SMILES strings and molecular property vectors (PVs) as two separate modalities and learns to align them in a shared embedding space. The primary contribution is enabling bidirectional generation through a single pre-trained model: given a property vector, SPMM can generate molecules (inverse-QSAR), and given a SMILES string, it can predict all 53 properties simultaneously. The model also transfers to unimodal downstream tasks including MoleculeNet benchmarks and reaction prediction.

Bridging the Gap Between Molecular Structure and Properties

Existing chemical pre-trained models typically learn representations from a single modality (SMILES, graphs, or fingerprints) and fine-tune for specific downstream tasks. While some approaches have attempted multimodal learning by combining SMILES with graph representations or InChI strings, these modalities encode nearly identical structural information, limiting the potential for emergent cross-modal knowledge.

The key gap SPMM addresses is the lack of multimodal pre-training that incorporates genuinely complementary modalities. Prior conditional molecule generation methods could typically control only a small number of properties simultaneously and required retraining when target properties changed. The authors draw on successes in vision-language pre-training (VLP), where aligning image and text modalities has enabled rich bidirectional understanding, and apply similar ideas to molecular structure and property domains.

Treating Property Vectors as a Language

The core innovation in SPMM is treating a collection of 53 RDKit-computed molecular properties as a “language” where each property value is analogous to a word token. This design allows the model to attend to individual properties independently rather than treating the entire property vector as a single fixed-length condition.

Dual-Stream Architecture

SPMM follows the dual-stream VLP architecture. The model has three components:

SMILES Encoder: 6 BERT-base layers that encode tokenized SMILES (using a 300-subword BPE vocabulary) via self-attention
PV Encoder: 6 BERT-base layers that encode the 53 normalized property values (each passed through a linear layer) with learnable positional embeddings
Fusion Encoder: 6 BERT-base layers with cross-attention that combines both modalities, using one modality’s features as queries and the other as keys/values

Pre-training Objectives

The model is pre-trained with four complementary losses:

Contrastive Learning aligns SMILES and PV features in a shared embedding space. For [CLS] token outputs $\mathbf{S}_{cls}$ and $\mathbf{P}_{cls}$:

$$ \text{sim}(\mathbf{S}, \mathbf{P}) = \left(h_{S}(\mathbf{S}_{cls})\right)^{\top} h_{P}(\mathbf{P}_{cls}) $$

The intermodal similarities are computed with a learnable temperature $\tau$:

$$ s_{s2p} = \frac{\exp(\text{sim}(\mathbf{S}, \mathbf{P}) / \tau)}{\sum_{n=1}^{N} \exp(\text{sim}(\mathbf{S}, \mathbf{P}_{n}) / \tau)} $$

The contrastive loss uses cross-entropy with one-hot labels (1 for same-molecule pairs):

$$ L_{\text{contrastive}} = \frac{1}{2}\left(H(y_{s2p}, s_{s2p}) + H(y_{p2s}, s_{p2s}) + H(y_{s2s}, s_{s2s}) + H(y_{p2p}, s_{p2p})\right) $$

Next Word Prediction (NWP) trains autoregressive SMILES generation conditioned on the PV:

$$ L_{NWP} = \sum_{i=1}^{n} H\left(y_{n}^{NWP}, p^{NWP}(s_{n} \mid s_{0:n-1}, \mathbf{P})\right) $$

Next Property Prediction (NPP) applies the same autoregressive concept to property values, using mean-square-error loss:

$$ L_{NPP} = \sum_{i=1}^{n} \left(p_{n} - \hat{p}_{n}(p_{0:n-1}, \mathbf{S})\right)^{2} $$

SMILES-PV Matching (SPM) is a binary classification loss predicting whether a SMILES-PV pair originated from the same molecule, trained with hard-negative mining.

The overall pre-training loss combines all four:

$$ L = \widetilde{L}_{\text{contrastive}} + \widetilde{L}_{NWP} + L_{NPP} + L_{SPM} $$

where tildes indicate the use of momentum teacher distillation to soften one-hot labels, acknowledging that multiple valid SMILES-PV pairings may exist.

Random Property Masking

During pre-training, 50% of property values are randomly replaced with a special [UNK] token. This serves three purposes: preventing overfitting to specific properties, augmenting data, and enabling flexible inference where users can specify any subset of the 53 properties as generation conditions. The model can handle all $2^{53}$ possible property combinations at inference time despite never seeing most of them during training.

Experiments Across Bidirectional and Unimodal Tasks

PV-to-SMILES Generation (Conditional Molecule Design)

The authors evaluate SPMM on multiple generation scenarios using 1000 unseen PubChem PVs:

Sampling	Input PV	Validity	Uniqueness	Novelty	Norm. RMSE
Deterministic	1000 unseen PVs	0.995	0.999	0.961	0.216
Stochastic	Full PV (molecule 1)	0.974	0.905	0.998	0.185
Stochastic	Molar mass = 150	0.974	0.945	0.872	0.192
Stochastic	4 properties controlled	0.998	0.981	0.952	0.257
Stochastic	No control (all [UNK])	0.971	0.991	0.950	-

The normalized RMSE of 0.216 across 53 properties indicates that generated molecules closely match the input property conditions. The model can also perform unconditional generation (all properties masked) where outputs follow the pre-training distribution. The authors report that SPMM outperforms benchmark models including MolGAN, GraphVAE, and scaffold-based graph generative models in both conditional and unconditional settings (Supplementary Table 1).

SMILES-to-PV Generation (Multi-Property Prediction)

When given 1000 unseen ZINC15 molecules, SPMM predicts all 53 properties autoregressively with a mean $r^{2}$ of 0.924 across all properties.

MoleculeNet Benchmarks

Using only the SMILES encoder (6 BERT layers), SPMM achieves best or competitive performance on 9 MoleculeNet tasks:

Dataset	Metric	SPMM	Best Baseline	Baseline Model
ESOL	RMSE	0.817	0.798	ChemRL-GEM
LIPO	RMSE	0.681	0.660	ChemRL-GEM
FreeSolv	RMSE	1.868	1.877	ChemRL-GEM
BACE (reg)	RMSE	1.041	1.047	MolFormer
Clearance	RMSE	42.607	43.175	MolFormer
BBBP	AUROC	75.1%	73.6%	MolFormer
BACE (cls)	AUROC	84.4%	86.3%	MolFormer
ClinTox	AUROC	92.7%	91.2%	MolFormer
SIDER	AUROC	66.9%	67.2%	ChemRL-GEM

SPMM achieved best performance on 5 of 9 tasks, with notable gains on BBBP (75.1% vs. 73.6%) and ClinTox (92.7% vs. 91.2%). Without pre-training, all scores dropped substantially.

DILI Classification

On Drug-Induced Liver Injury prediction, SPMM achieved 92.6% AUROC, outperforming the 5-ensemble model of Ai et al. (90.4% AUROC) while using a single model.

Reaction Prediction

On USPTO-480k forward reaction prediction, SPMM achieved 91.5% top-1 accuracy, the highest among all models tested (including Chemformer at 91.3%). On USPTO-50k retro-reaction prediction, SPMM reached 53.4% top-1 accuracy, second only to Chemformer (54.3%) among string-based models.

Bidirectional Generation From a Single Pre-trained Model

SPMM demonstrates that multimodal pre-training with genuinely complementary modalities (structure and properties, rather than structurally redundant representations) enables a single foundation model to handle both generation directions and downstream unimodal tasks. Key findings include:

Flexible conditional generation: The [UNK] masking strategy allows controlling any subset of 53 properties at inference time without retraining, a capability not demonstrated by prior methods.
Interpretable cross-attention: Attention visualizations show that the model learns chemically meaningful structure-property relationships (e.g., hydrogen bonding properties attend to oxygen and nitrogen atoms; ring count properties attend to ring tokens).
Competitive unimodal transfer: Despite using only 6 BERT layers and 50M pre-training molecules (smaller than ChemBERTa-2’s 77M or Chemformer’s 100M), the SMILES encoder alone achieves best or second-best results on 5 of 9 MoleculeNet tasks and the highest forward reaction prediction accuracy among tested models.

Limitations

The authors acknowledge several limitations:

SMILES representation constraints: Implicit connectivity information in SMILES means small structural changes can cause drastic string changes. Graph representations could be a complementary alternative.
Stereochemistry blindness: All 53 RDKit properties used are invariant to stereochemistry, meaning different stereoisomers produce identical PVs. The contrastive loss then forces their SMILES encoder outputs to converge, which the authors identify as the primary factor limiting MoleculeNet performance on stereo-sensitive tasks.
No wet-lab validation: Generated molecules and predicted properties are not experimentally verified.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pre-training	PubChem	50M molecules	SMILES + 53 RDKit properties
Property prediction	MoleculeNet (9 tasks)	642-4200 per task	Scaffold split via DeepChem (8:1:1)
DILI classification	Ai et al. dataset	Not specified	Following published preparation
Forward reaction	USPTO-480k	479,035 pairs	Reactant-product pairs
Retro reaction	USPTO-50k	50,037 pairs	Product-reactant pairs, no reaction types used
SMILES-to-PV test	ZINC15	1000 molecules	Not in pre-training set

Algorithms

Tokenization: BPE with 300-subword dictionary
Property masking: 50% random replacement with [UNK] during pre-training
Momentum distillation: EMA parameter $\lambda = 0.995$, soft-label mixing $\alpha$ linearly warmed from 0 to 0.4 over first epoch
Contrastive queue: Size $k = 24{,}576$ for storing recent SMILES and PV instances
Beam search: $k = 2$ for PV-to-SMILES generation
SMILES augmentation: Random non-canonical augmentation with probability 0.5 for reaction tasks

Models

Architecture: 6 BERT-base encoder layers each for SMILES encoder, PV encoder, and fusion encoder (18 total layers)
Vocabulary: 300 BPE subwords for SMILES; 53 property tokens for PV
Pre-trained weights: Available via GitHub

Evaluation

Task	Metric	Value	Notes
PV-to-SMILES (deterministic)	Validity	99.5%	1000 unseen PubChem PVs
PV-to-SMILES (deterministic)	Normalized RMSE	0.216	Across 53 properties
SMILES-to-PV	Mean $r^{2}$	0.924	1000 ZINC15 molecules
Forward reaction (USPTO-480k)	Top-1 accuracy	91.5%	Best among all tested models
Retro reaction (USPTO-50k)	Top-1 accuracy	53.4%	Second-best string-based
DILI classification	AUROC	92.6%	Single model vs. 5-ensemble

Hardware

Pre-training: 8 NVIDIA A100 GPUs, approximately 52,000 batch iterations, roughly 12 hours
Batch size: 96
Optimizer: AdamW with weight decay 0.02
Learning rate: Warmed up to $10^{-4}$, cosine decay to $10^{-5}$

Artifacts

Artifact	Type	License	Notes
SPMM Source Code	Code	Apache-2.0	Official implementation with experimental scripts
SPMM Zenodo Archive	Code	Apache-2.0	Archived version for reproducibility
PubChem	Dataset	Public domain	50M molecules for pre-training
MoleculeNet	Dataset	Varies	Benchmark datasets via DeepChem

Paper Information

Citation: Chang, J., & Ye, J. C. (2024). Bidirectional generation of structure and properties through a single molecular foundation model. Nature Communications, 15, 2323. https://doi.org/10.1038/s41467-024-46440-3

@article{chang2024bidirectional,
  title={Bidirectional generation of structure and properties through a single molecular foundation model},
  author={Chang, Jinho and Ye, Jong Chul},
  journal={Nature Communications},
  volume={15},
  pages={2323},
  year={2024},
  doi={10.1038/s41467-024-46440-3}
}

A Multimodal Foundation Model for Structure-Property Comprehension#

Bridging the Gap Between Molecular Structure and Properties#

Treating Property Vectors as a Language#

Dual-Stream Architecture#

Pre-training Objectives#

Random Property Masking#

Experiments Across Bidirectional and Unimodal Tasks#

PV-to-SMILES Generation (Conditional Molecule Design)#

SMILES-to-PV Generation (Multi-Property Prediction)#

MoleculeNet Benchmarks#

DILI Classification#

Reaction Prediction#

Bidirectional Generation From a Single Pre-trained Model#

Limitations#

Reproducibility Details#

Data#

Algorithms#

Models#

Evaluation#

Hardware#

Artifacts#

Paper Information#