PharMolixFM: Multi-Modal All-Atom Molecular Models

A Unified Framework for All-Atom Molecular Foundation Models

PharMolixFM is a Method paper that introduces a unified framework for constructing all-atom foundation models for molecular modeling and generation. The primary contribution is the systematic implementation of three multi-modal generative model variants (diffusion, flow matching, and Bayesian flow networks) within a single architecture, along with a task-unifying denoising formulation that enables training on multiple structural biology tasks simultaneously. The framework achieves competitive performance on protein-small-molecule docking and structure-based drug design while providing the first empirical analysis of inference scaling laws for molecular generative models.

Existing all-atom foundation models such as AlphaFold3, RoseTTAFold All-Atom, and ESM-AA face two core challenges that limit their generalization across molecular modeling and generation tasks.

First, atomic data is inherently multi-modal: each atom comprises both a discrete atom type and continuous 3D coordinates. This poses challenges for structure models that need to jointly capture and predict both modalities. Unlike text or image data that exhibit a single modality, molecular structures require generative models that can handle discrete categorical variables (atom types, bond types) and continuous variables (coordinates) simultaneously.

Second, there has been no comprehensive analysis of how different training objectives and sampling strategies impact the performance of all-atom foundation models. Prior work has focused on individual model architectures without systematically comparing generative frameworks or studying how inference-time compute scaling affects prediction quality.

PharMolixFM addresses both challenges by providing a unified framework that implements three state-of-the-art multi-modal generative models and formulates all downstream tasks as a generalized denoising process with task-specific priors.

The core innovation of PharMolixFM is the formulation of molecular tasks as a generalized denoising process where task-specific priors control which parts of the molecular system are noised during training. The framework decomposes a biomolecular system into $N$ atoms represented as a triplet $\bar{\mathbf{S}}_0 = \langle \mathbf{X}_0, \mathbf{A}_0, \mathbf{E}_0 \rangle$, where $\mathbf{X}_0 \in \mathbb{R}^{N \times 3}$ are atom coordinates, $\mathbf{A}_0 \in \mathbb{Z}^{N \times D_1}$ are one-hot atom types, and $\mathbf{E}_0 \in \mathbb{Z}^{N \times N \times D_2}$ are one-hot bond types.

The generative model estimates the density $p_\theta(\langle \mathbf{X}_0, \mathbf{A}_0, \mathbf{E}_0 \rangle)$ subject to SE(3) invariance:

$$ p_\theta(\langle \mathbf{R}\mathbf{X}_0 + \mathbf{t}, \mathbf{A}_0, \mathbf{E}_0 \rangle) = p_\theta(\langle \mathbf{X}_0, \mathbf{A}_0, \mathbf{E}_0 \rangle) $$

The variational lower bound is optimized over latent variables $S_1, \ldots, S_T$ obtained by adding independent noise to different modalities and atoms:

$$ q(S_{1:T} \mid S_0) = \prod_{i=1}^{T} \prod_{j=1}^{N} q(\mathbf{X}_{i,j} \mid \mathbf{X}_{0,j}, \sigma_{i,j}^{(\mathbf{X})}) , q(\mathbf{A}_{i,j} \mid \mathbf{A}_{0,j}, \sigma_{i,j}^{(\mathbf{A})}) , q(\mathbf{E}_{i,j} \mid \mathbf{E}_{0,j}, \sigma_{i,j}^{(\mathbf{E})}) $$

A key design choice is the noise schedule $\sigma_{i,j}^{(\mathcal{M})} = \frac{i}{T} \cdot \text{fix}_j^{(\mathcal{M})}$, where $\text{fix}_j^{(\mathcal{M})}$ is a scaling factor between 0 and 1 that controls which atoms and modalities receive noise. This “Fix” mechanism enables multiple training tasks:

Docking ($\text{Fix} = 1$ for protein and molecular graph, $\text{Fix} = 0$ for molecule coordinates): predicts binding pose given known atom/bond types.
Structure-based drug design ($\text{Fix} = 1$ for protein, $\text{Fix} = 0$ for all molecule properties): generates novel molecules for a given pocket.
Robustness augmentation ($\text{Fix} = 0.7$ for 15% randomly selected atoms, $\text{Fix} = 0$ for rest): simulates partial structure determination.

Three Generative Model Variants

Multi-modal diffusion (PharMolixFM-Diff) uses a Markovian forward process. Continuous coordinates follow Gaussian diffusion while discrete variables use a D3PM categorical transition:

$$ q(\mathbf{X}_{i,j} \mid \mathbf{X}_{0,j}) = \mathcal{N}(\sqrt{\alpha_{i,j}} , \mathbf{X}_{0,j}, (1 - \alpha_{i,j}) \mathbf{I}), \quad \alpha_{i,j} = \prod_{k=1}^{i}(1 - \sigma_{i,j}^{(\mathbf{X})}) $$

$$ q(\mathbf{A}_{i,j} \mid \mathbf{A}_{0,j}) = \text{Cat}(\mathbf{A}_{0,j} \bar{Q}_{i,j}^{(\mathbf{A})}), \quad Q_{i,j}^{(\mathbf{A})} = (1 - \sigma_{i,j}^{(\mathbf{A})}) \mathbf{I} + \frac{\sigma_{i,j}^{(\mathbf{A})}}{D_1} \mathbb{1}\mathbb{1}^T $$

The training loss combines coordinate MSE with cross-entropy for discrete variables:

$$ \mathcal{L} = \mathbb{E}_{S_0, i, S_i} \left[ \lambda_i^{(\mathbf{X})} | \tilde{\mathbf{X}}_0 - \mathbf{X}_0 |_2^2 + \lambda_i^{(\mathbf{A})} \mathcal{L}_{CE}(\tilde{\mathbf{A}}_0, \mathbf{A}_0) + \lambda_i^{(\mathbf{E})} \mathcal{L}_{CE}(\tilde{\mathbf{E}}_0, \mathbf{E}_0) \right] $$

Multi-modal flow matching (PharMolixFM-Flow) constructs a direct mapping between data and prior distributions using conditional vector fields. For coordinates, the conditional flow uses a Gaussian path $q(\mathbf{X}_{i,j} \mid \mathbf{X}_{0,j}) = \mathcal{N}((1 - \sigma_{i,j}^{(\mathbf{X})}) \mathbf{X}_{0,j}, (\sigma_{i,j}^{(\mathbf{X})})^2 \mathbf{I})$, while discrete variables use the same D3PM Markov chain. Sampling proceeds by solving an ODE via Euler integration.

Bayesian flow networks (PharMolixFM-BFN) perform generative modeling in the parameter space of the data distribution rather than the data space. The Bayesian flow distribution for coordinates is:

$$ p_F(\tilde{\mathbf{X}}_{i,j}^{(\theta)} \mid \mathbf{X}_{0,j}) = \mathcal{N}(\gamma_{i,j} \mathbf{X}_{0,j}, \gamma_{i,j}(1 - \gamma_{i,j}) \mathbf{I}), \quad \gamma_{i,j} = 1 - \alpha^{2(1 - \sigma_{i,j}^{(\mathbf{X})})} $$

Network Architecture

The architecture follows PocketXMol with a dual-branch SE(3)-equivariant graph neural network. A protein branch (4-layer GNN with kNN graph) processes pocket atoms, then representations are passed to a molecule branch (6-layer GNN) that captures protein-molecule interactions. Independent prediction heads reconstruct atom coordinates, atom types, and bond types, with additional confidence heads for self-ranking during inference.

Docking and Drug Design Experiments

Protein-Small-Molecule Docking

PharMolixFM is evaluated on the PoseBusters benchmark (428 protein-small-molecule complexes) using the holo docking setting with a known protein structure and 10 Angstrom binding pocket. The metric is the ratio of predictions with RMSD < 2 Angstrom.

Method	Self-Ranking (%)	Oracle-Ranking (%)
DiffDock	38.0	-
RFAA	42.0	-
Vina	52.3	-
UniMol-Docking V2	77.6	-
SurfDock	78.0	-
AlphaFold3	90.4	-
PocketXMol (50 repeats)	82.2	95.3
PharMolixFM-Diff (50 repeats)	83.4	96.0
PharMolixFM-Flow (50 repeats)	73.4	93.7
PharMolixFM-BFN (50 repeats)	78.5	93.5
PharMolixFM-Diff (500 repeats)	83.9	98.1

PharMolixFM-Diff achieves the second-best self-ranking result (83.4%), outperforming PocketXMol by 1.7% absolute but trailing AlphaFold3 (90.4%). The key advantage is inference speed: approximately 4.6 seconds per complex on a single A800 GPU compared to approximately 249.0 seconds for AlphaFold3 (a 54x speedup). Under oracle-ranking with 500 repeats, PharMolixFM-Diff reaches 98.1%, suggesting that better ranking strategies could further improve practical performance.

Structure-Based Drug Design

Evaluation uses the CrossDocked test set (100 protein pockets, 100 molecules generated per pocket), measuring Vina binding affinity scores and drug-likeness properties (QED and SA).

Method	Vina Score (Avg/Med)	QED	SA
Pocket2Mol	-5.14 / -4.70	0.57	0.76
TargetDiff	-5.47 / -6.30	0.48	0.58
DecompDiff	-5.67 / -6.04	0.45	0.61
MolCRAFT	-6.61 / -8.14	0.46	0.62
PharMolixFM-Diff	-6.18 / -6.44	0.50	0.73
PharMolixFM-Flow	-6.34 / -6.47	0.49	0.74
PharMolixFM-BFN	-6.38 / -6.45	0.48	0.64

PharMolixFM achieves a better balance between binding affinity and drug-like properties compared to baselines. While MolCRAFT achieves the best Vina scores, PharMolixFM-Diff and Flow variants show notably higher QED (0.49-0.50 vs. 0.45-0.48) and SA (0.73-0.74 vs. 0.58-0.62), which are important for downstream validation and in-vivo application.

Inference Scaling Law

The paper explores whether inference-time scaling holds for molecular generative models, fitting the relationship:

$$ \text{Acc} = a \log(bR + c) + d $$

where $R$ is the number of sampling repeats. All three PharMolixFM variants exhibit logarithmic improvement in docking accuracy with increased sampling repeats, analogous to inference scaling laws observed in NLP. Performance plateaus eventually due to distributional differences between training and test sets.

Competitive Docking with Faster Inference, but Limited Task Scope

PharMolixFM demonstrates that multi-modal generative models can achieve competitive all-atom molecular modeling with substantial inference speed advantages over AlphaFold3. The key findings are:

Diffusion outperforms flow matching and BFN for docking under standard sampling budgets. The stochastic nature of diffusion sampling appears beneficial compared to the deterministic ODE integration of flow matching.
Oracle-ranking reveals untapped potential: the gap between self-ranking (83.4%) and oracle-ranking (98.1%) at 500 repeats indicates that confidence-based ranking is a bottleneck. Better ranking methods could close the gap with AlphaFold3.
The three variants show similar performance for drug design, suggesting that model architecture and training data may matter more than the generative framework for generation tasks.
Inference scaling laws hold for molecular generative models, paralleling findings in NLP.

Limitations include that the framework is only evaluated on two tasks (docking and SBDD), and the paper does not address protein structure prediction, protein-protein interactions, or nucleic acid modeling, which are part of AlphaFold3’s scope. The BFN variant underperforms the diffusion model, which the authors attribute to smaller noise scales at early sampling steps making training less challenging. The paper also does not compare against concurrent work on inference-time scaling for molecular models.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Training	PDBBind, Binding MOAD, CrossDocked2020, PepBDB	Not specified	Filtered by PocketXMol criteria
Docking eval	PoseBusters benchmark	428 complexes	Holo docking with known protein
SBDD eval	CrossDocked test set	100 pockets	100 molecules per pocket

Algorithms

Three generative variants: multi-modal diffusion (D3PM), flow matching, Bayesian flow networks
Task-specific noise via Fix mechanism (0, 0.7, or 1.0)
Training tasks selected with equal probability per sample
AdamW optimizer: weight decay 0.001, $\beta_1 = 0.99$, $\beta_2 = 0.999$
Linear warmup to learning rate 0.001 over 1000 steps
180K training steps with batch size 40

Models

Dual-branch SE(3)-equivariant GNN (protein: 4-layer, molecule: 6-layer)
kNN graph construction for protein and protein-molecule interactions
Independent prediction heads for coordinates, atom types, bond types
Confidence heads for self-ranking during inference

Evaluation

Metric	PharMolixFM-Diff	AlphaFold3	Notes
RMSD < 2A self-ranking	83.4% (50 rep)	90.4%	PoseBusters docking
RMSD < 2A oracle-ranking	98.1% (500 rep)	-	PoseBusters docking
Inference time (per complex)	~4.6s	~249.0s	Single A800 GPU
Vina score (avg)	-6.18	-	CrossDocked SBDD

Hardware

Training: 4x 80GB A800 GPUs
Inference benchmarked on single A800 GPU

Artifacts

Artifact	Type	License	Notes
OpenBioMed (GitHub)	Code	MIT	Official implementation

Paper Information

Citation: Luo, Y., Wang, J., Fan, S., & Nie, Z. (2025). PharMolixFM: All-Atom Foundation Models for Molecular Modeling and Generation. arXiv preprint arXiv:2503.21788.

@article{luo2025pharmolixfm,
  title={PharMolixFM: All-Atom Foundation Models for Molecular Modeling and Generation},
  author={Luo, Yizhen and Wang, Jiashuo and Fan, Siqi and Nie, Zaiqing},
  journal={arXiv preprint arXiv:2503.21788},
  year={2025}
}

A Unified Framework for All-Atom Molecular Foundation Models#

Challenges in Multi-Modal Atomic Modeling#

Multi-Modal Denoising with Task-Specific Priors#

Three Generative Model Variants#

Network Architecture#

Docking and Drug Design Experiments#

Protein-Small-Molecule Docking#

Structure-Based Drug Design#

Inference Scaling Law#

Competitive Docking with Faster Inference, but Limited Task Scope#

Reproducibility Details#

Data#

Algorithms#

Models#

Evaluation#

Hardware#

Artifacts#

Paper Information#