Neural Network Potentials and Modern Methods on Hunter Heidenreich | ML Research Scientist

MB-nrg: CCSD(T)-Accurate Potentials for Polyalanine

Sun, 12 Apr 2026 00:00:00 +0000

A Modular MB-nrg Method for Biomolecular Potentials

This is a Method paper. Zhou and colleagues extend the MB-nrg (many-body energy) formalism to covalently bonded biomolecules and build the first coupled-cluster-accurate potential energy function (PEF) for polyalanine in the gas phase. The contribution has three parts: a generalization of the MB-nrg decomposition from whole-molecule 1-mers to functional-group “natural building blocks,” a DLPNO-CCSD(T)/aug-cc-pVTZ training protocol driven by parallel-bias metadynamics sampling, and a demonstration that the resulting PEF reproduces alanine dipeptide energetics and AceAla$_9$Nme secondary-structure dynamics more faithfully than the Amber ff14SB and ff19SB force fields.

Why Empirical Force Fields Fall Short for Protein Dynamics

Protein dynamics span femtosecond vibrations to millisecond conformational changes, and capturing them at atomic resolution is central to understanding catalysis, allostery, and ligand binding. Classical force fields such as CHARMM, OPLS, and Amber approximate the potential energy surface with pairwise-additive analytical terms. This functional form struggles with the many-body interactions that shape disordered regions of proteins, including exchange-repulsion, charge transfer, charge penetration, and cooperative hydrogen bonding. Polarizable force fields add induced dipoles but remain empirically parameterized and fail to capture short-range many-body effects from electron-density overlap.

Quantum-mechanical methods avoid this, but coupled cluster theory scales as $\mathcal{O}(N^7)$ in the number of electrons and even DFT remains $\mathcal{O}(N^3)$ to $\mathcal{O}(N^4)$, ruling out direct ab initio molecular dynamics for biomolecules. Fragmentation methods like molecular fractionation with conjugate caps (MFCC) mitigate the cost, but they truncate the many-body expansion at two bodies and miss long-range hydrogen bonding. Machine-learned force fields (MLFFs) reach near-QM accuracy at lower cost, yet they typically train on DFT data (inheriting delocalization errors and poor dispersion), struggle with interpretability, and extrapolate unreliably. Existing permutationally invariant polynomial (PIP) approaches scale factorially in the number of atoms, capping direct applicability at roughly ten to fifteen atoms per fragment.

MB-nrg PEFs based on the many-body expansion and PIPs have successfully modeled water, halides in water, carbon dioxide, methane, ammonia, nitrogen pentoxide, and N-methylacetamide. Extending them to covalently bonded biomolecules requires rethinking what counts as a “body.”

Building Polyalanine from Functional-Group n-mers

The MB-nrg formalism starts from the many-body expansion of the total energy,

$$ E_N(1, \dots, N) = \sum_{i=1}^{N} \varepsilon^{1\mathrm{B}}(i) + \sum_{i

where each $n$-body contribution is defined recursively as the $n$-mer energy minus all lower-order terms. The full PEF combines physics-based and data-driven components,

$$ V_{\mathrm{MB\text{-}nrg}} = V_{\mathrm{ML}} + V_{\mathrm{phys}} $$

with $V_{\mathrm{ML}} = V_{\mathrm{ML}}^{1\mathrm{B}} + V_{\mathrm{ML}}^{2\mathrm{B}} + V_{\mathrm{ML}}^{3\mathrm{B}}$ capturing short-range quantum-mechanical interactions, and $V_{\mathrm{phys}} = V_{\mathrm{elec}} + V_{\mathrm{disp}} + V_{\mathrm{rep}}$ supplying electrostatics, dispersion, and repulsion. Dispersion follows a Tang-Toennies damped $C_6/R^6$ form with XDM-derived coefficients; electrostatics uses a Thole-modified self-consistent polarization model inherited from MB-pol; the repulsion term is a Lennard-Jones $R^{-12}$ contribution borrowed from Amber ff14SB, activated only for non-bonded atom pairs not covered by a PIP.

Each data-driven $n$-body term is expressed as

$$ V_{\mathrm{ML}}^{n\mathrm{B}} = \sum_{\mathrm{M}_1 < \dots < \mathrm{M}_n}^{N} s^{n\mathrm{B}}(\mathrm{M}_1, \dots, \mathrm{M}_n), V_{\mathrm{PIP}}^{n\mathrm{B}}(\mathrm{M}_1, \dots, \mathrm{M}_n) $$

where $V_{\mathrm{PIP}}^{n\mathrm{B}}$ is a permutationally invariant polynomial in Morse-like variables $\xi_{ij} = \exp(-k_{\tau(ij)} R_{ij})$ and $s^{n\mathrm{B}}$ is a switching function.

The key extension in this paper, building on earlier work on linear alkanes, is to treat functional groups (not whole molecules) as 1-mers. An Ace-capped, Nme-capped polyalanine chain decomposes into three distinct 1-mer types (-CH-, CH$_3$-, -CONH-), five distinct 2-mer types, and six distinct 3-mer types, for 14 unique PIPs that cover every $n$-mer appearing in any AceAla$_n$Nme chain. Cleaving covalent bonds between 1-mers would produce radicals, so the authors cap dangling valences with “ghost” hydrogen atoms at fixed C-H (1.14 Å) and N-H (1.09 Å) distances. Each $n$-mer energy is then referenced to its own optimized H-capped structure,

$$ E_n(1, \dots, n) = E_n^{\mathrm{H\text{-}capped}}(1, \dots, n) - E_n^{\mathrm{H\text{-}capped,opt}}(1, \dots, n). $$

In the current implementation, only covalently bonded $n$-mers receive PIPs, the 2-body contribution from a dimer with one intervening 1-mer is folded into the corresponding 3-body term, and non-bonded 1-mers interact through the Lennard-Jones repulsion alone. Crucially, no whole-chain polyalanine data enters any stage of training: every PIP is parameterized on isolated $n$-mer configurations, and the total energy is reconstructed through the many-body expansion.

Training on DLPNO-CCSD(T) with Metadynamics Sampling

Training sets are generated for each of the 14 $n$-mer types using parallel-bias metadynamics (PBMetaD) with partitioned families, biasing heavy-atom bonds, angles, and dihedrals across 300 K, 500 K, and 700 K in LAMMPS interfaced with PLUMED and modified OPLS/CM1A and Amber ff14SB force fields. For each $n$-mer, 200,000 candidate configurations are sampled, then reduced to roughly 10,000-20,000 training configurations (and about 1,000 test configurations) through Mini-batch K-means clustering on chemically equivalent pairwise distances. Reference energies are computed at the DLPNO-CCSD(T)/aug-cc-pVTZ level in ORCA.

Each PIP minimizes a weighted, ridge-regularized sum of squared errors,

$$ \chi^2 = \sum_{k \in \mathcal{S}} w_k \left[ V^{n\mathrm{B}}(k) - \varepsilon^{n\mathrm{B}}(k) \right]^2 + \Gamma^2 \sum_l c_l^2 $$

with $\Gamma = 0.0005$ throughout and low-energy bias weights

$$ w_k = \left( \frac{\delta E}{\varepsilon^{n\mathrm{B}}(k) - \varepsilon^{n\mathrm{B}}_{\min} + \delta E} \right)^2. $$

MB-Fit handles the fit, combining simplex optimization for non-linear parameters $k_{\tau(ij)}$ with ridge regression for the linear coefficients $c_l$.

Table 1 in the paper reports, for each of the 14 PIPs, the polynomial degree (5 for the smaller -CH- and CH$_3$- 1-mers, 3 for the larger -CONH- 1-mer and for all 2-mers and 3-mers), the number of symmetrized monomials (ranging from 635 for the -CH- and CH$_3$- 1-mers to 2871 for the -CONH-CH-CONH- 3-mer), the training-set size, and RMSDs for the train and test splits. All training RMSDs stay below 0.4 kcal/mol and all test RMSDs below 0.5 kcal/mol, with the smallest errors for the -CH- and CH$_3$- 1-mers (0.05 kcal/mol train, 0.14 kcal/mol test) and the largest test RMSD (0.47 kcal/mol) for the -CONH-CH- 2-mer.

MD validations run in LAMMPS interfaced with MBX and PLUMED. For alanine dipeptide metadynamics, bias potentials on the backbone $\varphi$ and $\psi$ angles are deposited every 500 steps with a 1.0 kJ/mol height and 11.46° width over 10 ns trajectories in the NVT ensemble, using the velocity-Verlet integrator with a 0.5 fs time step. Analogous MetaD runs with Amber ff14SB and ff19SB are performed in Amber23. The longer AceAla$_9$Nme trajectories start from fully extended structures and run in a 100 Å × 100 Å × 100 Å gas-phase box.

CCSD(T) Energy Landscapes, Free-Energy Surfaces, and Helix Dynamics

Alanine dipeptide 2D PES. Alanine dipeptide geometries are optimized on a Ramachandran grid with 10° spacing at the RI-MP2/def2-TZVP level and then evaluated at DLPNO-CCSD(T)/aug-cc-pVTZ. Despite never seeing whole alanine dipeptide in training, MB-nrg closely matches the reference locations and relative energies of four minima ($m_1$ to $m_4$), three maxima ($M_1$ to $M_3$), and one saddle point ($X$). Amber ff14SB and ff19SB capture the minima reasonably but badly overshoot the barriers: at $M_1$, MB-nrg misses the reference by only -2.41 kcal/mol, while ff14SB and ff19SB overshoot by +7.50 and +7.83 kcal/mol. The authors also note that ff19SB incorrectly orders the secondary minima by predicting $m_3$ lower than $m_2$.

Model	RMSD overall (kcal/mol)	RMSD $\leq 10$ kcal/mol	RMSD $> 10$ kcal/mol
MB-nrg	1.27	1.18	1.59
Amber ff14SB	6.33	5.72	8.44
Amber ff19SB	5.23	4.79	6.81

The authors attribute MB-nrg’s residual high-energy error to terminal methyl groups approaching the backbone in conformations where non-bonded 1-mer interactions are modeled by the simple LJ repulsion rather than an explicit PIP.

Harmonic vibrations. Normal modes for the $m_1$ and $m_4$ alanine dipeptide conformers, computed by diagonalizing the Hessian, match RI-MP2/def2-TZVP references with mean deviations of 17.41 cm$^{-1}$ and 21.07 cm$^{-1}$ across all 60 modes. The authors acknowledge that some of this discrepancy reflects differences in theoretical levels (MB-nrg is trained to CCSD(T)/aug-cc-pVTZ, while the reference normal modes are computed at RI-MP2/def2-TZVP).

Free-energy surfaces. Well-tempered metadynamics at 300 K produces 2D free-energy surfaces over $(\varphi, \psi)$. MB-nrg yields a smoother FES whose extrema line up with the DLPNO-CCSD(T) reference PES. Amber ff14SB and ff19SB remain reasonable near the low-energy $m_1$ and $m_2$ minima but systematically overestimate the barriers near $M_1$, $M_2$, and $M_3$, which the authors argue artificially confines the dipeptide and suppresses conformational transitions.

Secondary structure in AceAla$_9$Nme. In 600 ps NVT MD starting from a fully extended structure, the STRIDE algorithm tracks residue-level secondary structures. Amber ff14SB and ff19SB collapse into $\alpha$-helices at roughly 40 ps and 80 ps, respectively, with ff19SB remaining especially rigid. MB-nrg takes about 100 ps before helices begin to form and then exhibits continuous oscillations between $3_{10}$- and $\alpha$-helical conformations. Ramachandran plots over the nine alanine residues show MB-nrg exploring the “bridge” region ($\varphi < 0°$, $-20° \leq \psi \leq 20°$) associated with $3_{10}$-helices and sampling the left-handed $\alpha_L$ region that Amber rarely visits. The authors tie this flexibility to experimental observations of alanine-rich peptides in the gas phase and to similar predictions from GEMS and MACE-OFF.

Transferability Without Whole-Chain Training Data

The paper demonstrates that a modular, bottom-up PEF built from functional-group $n$-mers can reach CCSD(T) accuracy for polyalanine in the gas phase without ever training on whole-chain data. Truncating explicit data-driven terms at the 3-body level appears to balance cost and fidelity, with long-range effects handled by many-body polarization in $V_{\mathrm{elec}}$ and by Amber-derived repulsion between distant 1-mers. The 2D PES, harmonic frequencies, free-energy surface, and secondary-structure dynamics each validate a different facet of the model.

The authors are explicit about limitations. The current PEF applies only to gas-phase polyalanine; solvent effects and other amino acids remain open. The Lennard-Jones repulsion for non-bonded 1-mers is a placeholder for eventual 2-body PIPs that should capture short-range interactions during folding. Long-range hydrogen bonding in compact secondary structures (π-helices, $3_{10}$-helices, $\alpha$-helices) may produce non-negligible higher-order many-body contributions that the current 3-body truncation omits. The 2-body contribution from a dimer with one intervening monomer is currently folded into the 3-body term because of steric conflicts between capping hydrogens, and a systematic fix is flagged for future work. The authors position this paper as the first in a series (the “I.” in the title refers to “Polyalanine in the Gas Phase”) that will extend MB-nrg to broader biomolecular systems under physiological conditions. The follow-up, MB-nrg in Solution: Polyalanine in Water with CCSD(T) PEFs, adds explicit 1-mer/water 2-body PIPs and benchmarks alanine dipeptide solvation.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Training	Per $n$-mer pools from PBMetaD in LAMMPS/PLUMED	200,000 configurations each, reduced to ~10-20k via Mini-batch K-means	OPLS/CM1A and Amber ff14SB sampled at 300 K, 500 K, 700 K
Training labels	DLPNO-CCSD(T)/aug-cc-pVTZ in ORCA	14 unique $n$-mer types	Domain-based local pair natural orbital approximation to canonical CCSD(T)
Test	Held-out $n$-mer configurations	~1,000 per $n$-mer	Same clustering protocol
Alanine dipeptide benchmark	Ramachandran grid at 10° spacing, RI-MP2/def2-TZVP geometries	1,296 grid points (approximate)	Single-point energies at DLPNO-CCSD(T)/aug-cc-pVTZ, ff14SB, ff19SB, MB-nrg
AceAla$_9$Nme dynamics	600 ps NVT MD from fully extended start	Single trajectory per model	STRIDE for secondary-structure assignment

Per the Data Availability statement, “any data generated and analyzed in this study are available from the authors upon request.” No public release is announced in the text.

Algorithms

Many-body expansion of the energy with 1-, 2-, and 3-body data-driven terms.
Permutationally invariant polynomials in Morse-exponential variables $\xi_{ij} = \exp(-k_{\tau(ij)} R_{ij})$, symmetrized over chemically equivalent atoms.
“Ghost” H-capping at cleaved covalent bonds, with fixed C-H (1.14 Å) and N-H (1.09 Å) bond lengths and per-$n$-mer optimized-structure referencing.
Non-linear parameters fit by simplex minimization, linear coefficients by ridge regression with $\Gamma = 0.0005$.
Low-energy weighting in the loss through $w_k = (\delta E / (\varepsilon^{n\mathrm{B}}(k) - \varepsilon^{n\mathrm{B}}_{\min} + \delta E))^2$.
Tang-Toennies damped dispersion with XDM-derived $C_6$ and damping parameters, Thole-modified many-body polarization, and LJ repulsion borrowed from Amber ff14SB.

Models

14 PIPs total covering three 1-mer types, five 2-mer types, and six 3-mer types. Polynomial degree is 5 for the -CH- and CH$_3$- 1-mers, and 3 for the -CONH- 1-mer together with all 2-mers and 3-mers. Term counts range from 635 (-CH-, CH$_3$-) to 2871 (-CONH-CH-CONH-).
MB-nrg PEF implemented in the MBX code and exercised through LAMMPS and PLUMED.
Training set sizes per $n$-mer range from roughly 12,000 to 47,000 configurations (the -CONH- 1-mer dataset is the largest at 47,438).

Evaluation

Metric	MB-nrg	Amber ff14SB	Amber ff19SB
$n$-mer training RMSD	$\leq 0.35$ kcal/mol	n/a	n/a
$n$-mer test RMSD	$\leq 0.47$ kcal/mol	n/a	n/a
Alanine dipeptide 2D PES RMSD (overall)	1.27 kcal/mol	6.33 kcal/mol	5.23 kcal/mol
Same, $\leq 10$ kcal/mol region	1.18 kcal/mol	5.72 kcal/mol	4.79 kcal/mol
Same, $> 10$ kcal/mol region	1.59 kcal/mol	8.44 kcal/mol	6.81 kcal/mol
Alanine dipeptide $m_1$ normal-mode mean deviation vs RI-MP2/def2-TZVP	17.41 cm$^{-1}$	n/a	n/a
Alanine dipeptide $m_4$ normal-mode mean deviation vs RI-MP2/def2-TZVP	21.07 cm$^{-1}$	n/a	n/a
AceAla$_9$Nme helix-formation onset (from extended start)	~100 ps ($\alpha$/$3_{10}$ mix)	~40 ps ($\alpha$)	~80 ps ($\alpha$)

Hardware

Computational resources came from the Air Force Office of Scientific Research (FA9550-20-1-0351), NSF award 2311260, the DoD High Performance Computing Modernization Program, the San Diego Supercomputer Center via ACCESS allocation CHE240114, and NERSC (contract DE-AC02-05CH11231, award BES-ERCAP0030920). Specific wall-clock and node-hour figures are not reported in the main text.

Paper Information

Citation: Zhou, R., Bull-Vulpe, E. F., Pan, Y., & Paesani, F. (2025). Data-Driven Many-Body Simulations of Biomolecules with CCSD(T) Accuracy: I. Polyalanine in the Gas Phase. ChemRxiv. https://doi.org/10.26434/chemrxiv-2025-b05k5

Publication: ChemRxiv preprint, 25 March 2025.

Additional Resources:

@misc{zhou2025data,
  title={Data-Driven Many-Body Simulations of Biomolecules with CCSD(T) Accuracy: I. Polyalanine in the Gas Phase},
  author={Zhou, Ruihan and Bull-Vulpe, Ethan F. and Pan, Yuanhui and Paesani, Francesco},
  year={2025},
  doi={10.26434/chemrxiv-2025-b05k5},
  howpublished={\url{https://doi.org/10.26434/chemrxiv-2025-b05k5}}
}

MB-nrg in Solution: Polyalanine in Water with CCSD(T) PEFs

Sun, 12 Apr 2026 00:00:00 +0000

Extending MB-nrg from Gas-Phase Polyalanine to Aqueous Solution

This is a Method paper, the second installment in Zhou and Paesani’s MB-nrg-for-biomolecules series. Paper I (covered in MB-nrg: CCSD(T)-Accurate Potentials for Polyalanine) decomposed gas-phase polyalanine into functional-group $n$-mers and fit permutationally invariant polynomials (PIPs) to DLPNO-CCSD(T)/aug-cc-pVTZ reference data. This sequel adds the missing piece: explicit, machine-learned 2-body interactions between every polyalanine functional-group 1-mer and a water molecule, trained on the same coupled-cluster reference. The resulting PEF couples the gas-phase intramolecular MB-nrg term, the MB-pol water model, and a new MB-nrg ala-water cross term within a single modular many-body decomposition.

Why Empirical Force Fields Struggle with Hydrated Peptides

Biomolecular function in water emerges from a coupling of intramolecular flexibility with solvent-mediated interactions, including hydrogen-bond networks, cooperative polarization, dispersion, and short-range exchange-repulsion. Empirical force fields such as AMBER, CHARMM, and OPLS approximate the multidimensional PES with pairwise-additive analytical terms whose parameters are tuned to experimental observables or low-level quantum data. The authors note that this functional form leads to systematic errors in predicted conformational ensembles for short peptides and intrinsically disordered proteins (IDPs), with reported overpopulation of polyproline II (pPII) basins and antiparallel $\beta$ regions for alanine residues, plus underrepresentation of the transitional $\beta$ basin compared to experiment.

Polarizable force fields recover dielectric and hydration trends through induced dipoles, but still lean on empirical functional forms and miss short-range quantum effects (charge transfer, charge penetration, exchange-repulsion) that arise from electron-density overlap. Machine-learned force fields like MACE-OFF, GEMS, and FeNNix-Bio1 have improved bio-organic accuracy, but they still depend critically on the diversity and quality of training data, struggle to decompose energies into physically interpretable components, and most rely on DFT references that inherit delocalization errors and incomplete long-range correlation. Local descriptors common to MLFFs also limit treatment of long-range electrostatics and many-body correlations, both essential for biomolecular solvation.

The MB-nrg formalism, originally developed for water and small molecules and recently extended to alkanes and gas-phase polyalanine, offers an alternative: a rigorous many-body expansion (MBE) of the energy combined with both data-driven $n$-body PIPs and physics-based long-range terms. Paper II asks whether this modular gas-phase scaffold can be cleanly extended to aqueous environments by adding only short-range peptide-water 2-body PIPs.

A Modular MB-nrg PEF for Polyalanine in Water

The MBE writes the total energy of a system of $N$ 1-mers as

$$ E_N(1, \dots, N) = \sum_{i=1}^{N} \varepsilon^{1\mathrm{B}}(i) + \sum_{i

with each $n$-body term defined recursively as the $n$-mer energy minus all lower-order contributions. The MBE converges quickly for insulating molecular systems with large electronic band gaps (such as water and peptides), so explicit PIP corrections are typically truncated at $n \leq 4$, with higher-order effects absorbed into many-body polarization.

For polyalanine in water, the total potential is partitioned into three modular blocks:

$$ V_{\mathrm{MB\text{-}nrg}}^{\mathrm{tot}} = V_{\mathrm{MB\text{-}nrg}}^{\mathrm{ala}} + V_{\mathrm{MB\text{-}pol}}^{\mathrm{wat}} + V_{\mathrm{MB\text{-}nrg}}^{\mathrm{ala\text{-}wat}} $$

where $V_{\mathrm{MB\text{-}nrg}}^{\mathrm{ala}}$ is the gas-phase intramolecular polyalanine PEF from Paper I, $V_{\mathrm{MB\text{-}pol}}^{\mathrm{wat}}$ is the MB-pol water model, and $V_{\mathrm{MB\text{-}nrg}}^{\mathrm{ala\text{-}wat}}$ is the new peptide-water cross term. The cross term itself follows the MB-nrg recipe of splitting machine-learned and physics-based contributions:

$$ V_{\mathrm{MB\text{-}nrg}}^{\mathrm{ala\text{-}wat}} = V_{\mathrm{ML}} + V_{\mathrm{phys}} $$

with $V_{\mathrm{ML}} = V_{\mathrm{ML}}^{2\mathrm{B}}$ (only 2-body PIPs in this implementation) and $V_{\mathrm{phys}} = V_{\mathrm{elec}} + V_{\mathrm{disp}}$. The 2-body machine-learned term sums switched PIPs over every (1-mer, water) dimer:

$$ V_{\mathrm{ML}}^{2\mathrm{B}} = \sum_{i=1}^{N} s^{2\mathrm{B}}(\mathrm{M}_i, \mathrm{WAT}), V_{\mathrm{PIP}}^{2\mathrm{B}}(\mathrm{M}_i, \mathrm{WAT}) $$

where $\mathrm{M}_i$ is the $i$-th polyalanine functional-group 1-mer (-CH-, CH$_3$-, or -CONH-), WAT is a water molecule, and $s^{2\mathrm{B}}$ is a cosine switching function

$$ s^{2\mathrm{B}}(x) = \begin{cases} 1 & x < 0 \\ (1 + \cos(x))/2 & 0 \leq x < 1 \\ 0 & 1 \leq x \end{cases}, \quad x = \frac{R - R_{\mathrm{in}}}{R_{\mathrm{out}} - R_{\mathrm{in}}} $$

that smoothly attenuates the short-range PIP beyond a defined distance to preserve energy conservation in MD. The physics-based block uses a Thole-modified self-consistent polarization model (inherited from MB-pol) for $V_{\mathrm{elec}}$ and a Tang-Toennies damped dispersion sum

$$ V_{\mathrm{disp}} = -\sum_{\substack{\alpha \in 1\text{-mers} \\ \beta \in \mathrm{water}}} f(\mathrm{b}_{\alpha\beta} R_{\alpha\beta}), \frac{C_{6, \alpha\beta}}{R_{\alpha\beta}^{6}} $$

with $C_{6, \alpha\beta}$ coefficients and atomic polarizabilities derived from the exchange-hole dipole moment (XDM) method, and atomic charges fit to reproduce the permanent multipole moments of each $n$-mer’s optimized structure.

The authors stress that explicit 3-body and higher peptide-water PIPs are deliberately omitted in this first implementation; their effects are absorbed into the classical polarization term. They flag that strongly hydrogen-bonded or cooperative configurations may benefit from adding higher-body corrections in future work, following the precedent of MB-pol(2023) for water.

Training Set Generation and DLPNO-CCSD(T) Reference Data

Training pools for the three 1-mer-water dimers (CH$_3$-H$_2$O, -CH–H$_2$O, -CONH–H$_2$O) extend the parallel-bias metadynamics with partitioned families (PBMetaD+PFs) protocol from Paper I. Covalent boundaries are capped with “ghost” hydrogens at fixed C-H (1.14 Å) and N-H (1.09 Å) distances to preserve closed-shell character; each 2-body energy is referenced to the corresponding optimized capped 1-mer-water geometry to remove constant offsets.

PBMetaD simulations are run in LAMMPS interfaced with PLUMED, using Amber ff14SB for the alanine 1-mers and TIP4P/2005f for water. Collective variables span all heavy-atom bonds, angles, and dihedrals in each dimer. To target distinct interaction regimes, three separate biased runs apply upper and lower walls on the 1-mer/water center-of-mass distance: 0-4 Å (short-range repulsion), 4-7 Å (mid-range attraction), and 7-10 Å (long-range orientation-dependent interactions). Each dimer yields about 600,000 configurations, reduced to roughly 40,000 training and 2,000 test configurations per type by K-means clustering.

Reference 2-body energies are computed at the DLPNO-CCSD(T)/aug-cc-pVTZ level in ORCA, using the aug-cc-pVTZ/C auxiliary basis, the RIJCOSX approximation, TightSCF, TightPNO, and the PModel pair-selection option. The counterpoise method corrects every 2-body energy for basis set superposition error.

Each PIP minimizes a weighted, ridge-regularized least-squares objective:

$$ \chi^2 = \sum_{k \in \mathcal{S}} w_k \left[ V^{2\mathrm{B}}(k) - \varepsilon^{2\mathrm{B}}(k) \right]^2 + \Gamma^2 \sum_l c_l^2 $$

with $\Gamma = 0.0005$ throughout. Training weights bias the fit toward low-energy configurations,

$$ w_k = \left( \frac{\delta E}{\varepsilon^{2\mathrm{B}}(k) - \varepsilon_{\mathrm{min}}^{2\mathrm{B}} + \delta E} \right)^2 $$

with $\delta E = 40$ kcal/mol for all 1-mer-water pairs. MB-Fit handles the optimization, combining simplex minimization for non-linear parameters (Morse decay constants) with ridge regression for the linear coefficients.

Table 1 reports the PIP specifications. All three PIPs use polynomial degree 3 with a complete, unscreened basis. The -CH- and CH$_3$- dimers each require 710 symmetrized terms; the chemically richer -CONH- dimer requires 1,267 terms to capture its dipolar character and directional hydrogen bonding. Training-set sizes range from 41,781 to 43,174 configurations.

1-mer type	PIP degree	PIP terms	Training configs	Train RMSD (kcal/mol)	Test RMSD (kcal/mol)	Train MAE	Test MAE
-CH-	3	710	43,174	0.07	0.08	0.06	0.06
CH$_3$-	3	710	43,172	0.08	0.08	0.05	0.05
-CONH-	3	1,267	41,781	0.18	0.20	0.13	0.16

All RMSDs sit below 0.20 kcal/mol on both train and test splits, well within sub-chemical accuracy.

Validation: Dimer Scans, Free-Energy Surfaces, and Hydration

The authors stage four validation studies of increasing complexity, each touching a distinct facet of the new PEF.

Alanine dipeptide-water dimer scans. One-dimensional scans probe the interaction energy along four hydrogen-bonding coordinates of an alanine dipeptide-water dimer: O$_1$-H$_w$, H$_1$-O$_w$, O$_2$-H$_w$, and H$_2$-O$_w$, where subscripts 1 and 2 mark the acetyl and N-methyl termini. The dipeptide is constrained to four representative Ramachandran conformations: C5 ($\varphi = -150°$, $\psi = 150°$), pPII ($\varphi = -80°$, $\psi = 150°$), C7$_{\mathrm{eq}}$ ($\varphi = -80°$, $\psi = 70°$), and right-handed $\alpha$-helix $\alpha_R$ ($\varphi = -80°$, $\psi = -30°$). MB-nrg closely tracks the DLPNO-CCSD(T)/aug-cc-pVTZ reference curves across all 16 (4 conformation $\times$ 4 site) scans, despite never seeing the full dipeptide-water surface during training. Amber ff14SB/TIP3P and ff19SB/OPC underestimate hydrogen-bond depths and miss curvature near equilibrium, with the ff14SB/TIP3P combination yielding slightly better overall agreement than ff19SB/OPC even though TIP3P is the less accurate water model.

Two specific failure modes of the empirical force fields stand out. In the pPII conformation, both ff14SB and ff19SB predict significantly deeper interaction wells than the reference, overstabilizing several hydrogen bonds. In the H$_2$-O$_w$ scan of the $\alpha_R$ conformation, both empirical FFs exhibit a spurious 2.5-4.0 Å energy barrier that the authors trace to the simple Lennard-Jones repulsion between the acetyl carbonyl oxygen and water; MB-nrg and DLPNO-CCSD(T) instead show a smoothly decaying profile. The one MB-nrg deviation noted is the C5 H$_1$-O$_w$ scan in the 1.5-2.5 Å range, where MB-nrg predicts a slightly more attractive interaction than the reference. Here the H$_1$-O$_2$ distance is 2.3 Å and water acts simultaneously as acceptor at H$_1$ and donor to O$_2$, a cooperative pattern the authors expect would require explicit 2-mer-water or 3-mer-water terms to fully reproduce.

Free-energy surface in explicit MB-pol water. Four-walker well-tempered metadynamics (WT-MetaD) simulations explore the conformational landscape of alanine dipeptide as a function of $(\varphi, \psi)$, biasing the central alanine residue’s backbone dihedrals every 500 steps with 1.0 kJ/mol Gaussians of 11.46° width. The free-energy section reports 2.5 ns per replica across four parallel walkers (10 ns aggregate, matching the Figure 6 caption); the methods section states 8 ns total, an internal inconsistency in the paper. The MB-nrg FES recovers all major low-energy conformers identified by NMR and prior MP2/DFT studies: a global minimum at $\alpha_R$, additional local minima in C5, $\beta_2$, and $\alpha_L$, and a metastable pPII basin. The C7$_{\mathrm{eq}}$ minimum that dominates the gas-phase Ramachandran surface in Paper I is significantly destabilized in solution, consistent with experiment.

Quantitatively, MB-nrg predicts $\alpha_R$ and $\beta_2$ as isoenergetic global minima, with C5 about 3 kcal/mol higher in free energy. Prior DFT-with-implicit-solvation studies (Mironov et al., Yang and Honig) report C5, $\alpha_R$, and $\beta_2$ as nearly isoenergetic, and the authors note that the discrepancy may reflect the explicit MB-pol water treatment, residual DFT errors in the reference, or both. They flag a planned systematic benchmarking of MB-nrg PEFs for diverse polypeptides against both DFT and DLPNO-CCSD(T) data in future work. The Amber FESs over-stabilize pPII relative to C5/$\alpha_R$, contradicting experimental and DFT benchmarks; ff19SB/OPC also exhibits a spurious C7$_{\mathrm{eq}}$ minimum that is absent from MB-nrg.

Hydration radial distribution functions. Site-site RDFs at 300 K for the same hydrogen-bond contacts (O$_1$-H$_w$, O$_2$-H$_w$, H$_1$-O$_w$, H$_2$-O$_w$) are computed from NVT MD trajectories. All three models reproduce well-defined first-shell peaks near 2.0 Å. For the O-H$_w$ pairs, MB-nrg shows a broader, slightly right-shifted second-shell peak, indicating less rigid water structure beyond the first shell. The amide-hydrogen RDFs are nearly identical between ff14SB/TIP3P and ff19SB/OPC, while MB-nrg reveals subtle first-shell shifts (shorter H$_1$-O$_w$, longer H$_2$-O$_w$) and weaker, less-defined second-shell features near 3.7-3.8 Å that are absent from the empirical force fields and consistent with prior ab initio MD on alanine dipeptide.

A Modular Path to Chemically Accurate Biomolecular Simulations

Across the four benchmarks, the same picture emerges: a modular, bottom-up MB-nrg PEF built from functional-group $n$-mers and trained only on isolated 1-mer-water dimers can reach DLPNO-CCSD(T) accuracy for both energetic and structural observables of alanine dipeptide in explicit water. The decomposition into a gas-phase intramolecular term, an MB-pol water model, and an MB-nrg cross term keeps each piece interpretable and individually replaceable; the gas-phase polyalanine PEF from Paper I drops in unchanged, and the new ala-water PIPs were fit without ever seeing the full alanine dipeptide-water PES.

The authors are explicit about limitations:

The cross term currently includes only 2-body PIPs (one 1-mer with one water). Higher-body peptide-water terms ($n > 2$) are folded into the classical polarization, which the authors expect will be inadequate for strongly cooperative configurations such as the C5 H$_1$-O$_w$ scan where one water bridges H$_1$ and O$_2$.
Quantitative differences between the MB-nrg FES and prior implicit-solvation DFT studies (relative depths of $\alpha_R$, $\beta_2$, and C5) remain to be reconciled through systematic benchmarking against higher-level reference data.
Only polyalanine is considered. The framework is designed to generalize to other amino acids and side-chain-water interactions, but sequence- and side-chain-specific PIPs are still to be fit.
No public release of the parameterized PEF or training data is announced; the data availability statement says “available from the authors upon request.”

The paper positions MB-nrg as a transferable, interpretable strategy for chemically accurate biomolecular simulations in solution, with future work aimed at heteropolypeptides and explicit higher-order many-body cross terms.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Training pools	PBMetaD+PFs in LAMMPS/PLUMED	~600,000 configs per dimer, reduced to ~40,000	ff14SB for alanine 1-mers, TIP4P/2005f for water; 300 K, 500 K, 700 K
Distance regimes	Walls on 1-mer/water COM distance	0-4, 4-7, and 7-10 Å	Short-range repulsion, mid-range attraction, long-range orientation
Training labels	DLPNO-CCSD(T)/aug-cc-pVTZ in ORCA	3 unique 1-mer-water dimer types	RIJCOSX, TightSCF, TightPNO, PModel; counterpoise BSSE correction
Test sets	Held-out clustered configs	~2,000 per dimer	Same K-means clustering protocol
Alanine dipeptide-water scans	1D scans along 4 H-bond coordinates in 4 conformations	16 scans total	C5, pPII, C7$_{\mathrm{eq}}$, and $\alpha_R$ conformations
Alanine dipeptide FES	WT-MetaD on $\varphi$, $\psi$ in MB-pol water	4 walkers, 2.5 ns each (10 ns total per the results section and Figure 6 caption; methods section states 8 ns)	1.0 kJ/mol height, 11.46° width, deposition every 500 steps
Hydration RDFs	NVT MD at 300 K	Single trajectory per model	Same H-bond sites as the dimer scans

Per the data availability statement, “any data generated and analyzed in this study, including the MB-nrg PEF, are available from the authors upon request.” The MBX engine is publicly available on GitHub under a UC Regents custom license that grants free use for educational, research, and non-profit purposes but restricts commercial use. No public release of the new ala-water PIPs is announced in the text.

Artifacts table

Artifact	Type	License	Notes
MBX	Code	UC Regents custom (academic/non-profit only; no SPDX-recognized OSS license)	C++ many-body potential engine; runs the MB-nrg PEF via LAMMPS and PLUMED
MB-Fit	Code	Check repo	Training pipeline for PIP fitting; used to fit the new 1-mer-water PIPs
MB-nrg ala-water PIPs (this paper)	Model	Not released	“Available from the authors upon request” per the data availability statement
DLPNO-CCSD(T) training/test sets	Dataset	Not released	Same statement; ~600,000 raw configs per dimer reduced to ~40,000 train + ~2,000 test

Algorithms

Many-body expansion of the energy partitioned into three modular blocks: $V_{\mathrm{MB\text{-}nrg}}^{\mathrm{ala}} + V_{\mathrm{MB\text{-}pol}}^{\mathrm{wat}} + V_{\mathrm{MB\text{-}nrg}}^{\mathrm{ala\text{-}wat}}$.
Cross term split into $V_{\mathrm{ML}}^{2\mathrm{B}}$ (PIPs over every 1-mer-water dimer) and $V_{\mathrm{phys}} = V_{\mathrm{elec}} + V_{\mathrm{disp}}$.
Permutationally invariant polynomials in Morse-exponential variables $\xi_{ij} = \exp(-k_{\tau(ij)} R_{ij})$, symmetrized over chemically equivalent atoms; same construction as the NMA-water PIPs.
Cosine switching function $s^{2\mathrm{B}}$ smoothly attenuates short-range PIPs between user-defined inner and outer cutoffs.
Dispersion: Tang-Toennies damped $C_6/R^6$ with XDM-derived coefficients and damping parameters.
Electrostatics: modified Thole model with self-consistent induced dipoles for many-body polarization; per-atom charges fit to reproduce permanent multipole moments of each $n$-mer’s optimized structure.
Ghost-H capping at cleaved covalent boundaries with fixed C-H (1.14 Å) and N-H (1.09 Å) distances; per-dimer optimized-structure referencing.
Training with simplex minimization for non-linear parameters and ridge regression for linear coefficients via MB-Fit, with low-energy weighting and $\Gamma = 0.0005$, $\delta E = 40$ kcal/mol.
WT-MetaD with four parallel walkers for the alanine dipeptide FES.

Models

Three new 1-mer-water 2-body PIPs covering -CH-/H$_2$O, CH$_3$-/H$_2$O, and -CONH-/H$_2$O dimers.
All three PIPs use polynomial degree 3 with a complete, unscreened basis (no term screening).
Term counts: 710 for -CH-/H$_2$O and CH$_3$-/H$_2$O, 1,267 for -CONH-/H$_2$O.
Combined with the gas-phase polyalanine MB-nrg PEF from Paper I and the MB-pol water model, exercised through MBX, LAMMPS, and PLUMED.

Evaluation

Metric	MB-nrg	Amber ff14SB/TIP3P	Amber ff19SB/OPC
-CH-/H$_2$O 2-body train/test RMSD	0.07 / 0.08 kcal/mol	n/a	n/a
CH$_3$-/H$_2$O 2-body train/test RMSD	0.08 / 0.08 kcal/mol	n/a	n/a
-CONH-/H$_2$O 2-body train/test RMSD	0.18 / 0.20 kcal/mol	n/a	n/a
Alanine dipeptide-water 1D scans (qualitative)	Tracks DLPNO-CCSD(T) curves across 16 scans	Underestimates H-bond depths; spurious $\alpha_R$ H$_2$-O$_w$ barrier	Same shape as ff14SB/TIP3P
Alanine dipeptide FES global minima	Isoenergetic $\alpha_R$ and $\beta_2$; C5 ~3 kcal/mol higher	Over-stabilizes pPII	Over-stabilizes pPII; spurious C7$_{\mathrm{eq}}$ minimum
O-H$_w$ second shell	Broader, right-shifted; finer detail consistent with prior AIMD	Sharper, less detail	Sharper, less detail
H-O$_w$ second shell	Weak features near 3.7-3.8 Å	Absent	Absent

Quantitative RMSD or KL-divergence values for the FES and RDF benchmarks are not reported in the main text.

Hardware

The authors acknowledge support from the Air Force Office of Scientific Research (FA9550-20-1-0351, theoretical development) and NSF (award 2311260, MBX implementation). Computational resources came from the DoD High Performance Computing Modernization Program, the San Diego Supercomputer Center via ACCESS allocation CHE240114, and NERSC (contract DE-AC02-05CH11231, award BES-ERCAP0030920). Specific wall-clock and node-hour figures are not reported in the main text.

Paper Information

Citation: Zhou, R., & Paesani, F. (2025). Toward Chemical Accuracy in Biomolecular Simulations through Data-Driven Many-Body Potentials: II. Polyalanine in Water. ChemRxiv. https://doi.org/10.26434/chemrxiv-2025-j6cwv-v2

Publication: ChemRxiv preprint (version 2), 10 October 2025.

Additional Resources:

MBX software (Paesani group)
MB-Fit (training pipeline)
Companion paper: MB-nrg: CCSD(T)-Accurate Potentials for Polyalanine (Paper I)

@article{zhou2025toward,
  title={Toward Chemical Accuracy in Biomolecular Simulations through Data-Driven Many-Body Potentials: II. Polyalanine in Water},
  author={Zhou, Ruihan and Paesani, Francesco},
  journal={ChemRxiv},
  year={2025},
  doi={10.26434/chemrxiv-2025-j6cwv-v2}
}

Conformation Autoencoder for 3D Molecules

Sat, 11 Apr 2026 00:00:00 +0000

A Method for Learning Conformation Embeddings

This is a Method paper that introduces an autoencoder architecture for molecular conformations. The model converts the discrete 3D spatial arrangement of atoms (a conformation) in a given molecular graph into a continuous, fixed-size latent representation and back. The approach uses internal coordinates (bond lengths, bond angles, dihedral angles) as input rather than Cartesian coordinates, making the representation inherently invariant to rigid translations and rotations.

Why 3D Structure Matters for Molecular Modeling

Most deep learning methods for molecules operate on 2D representations: molecular graphs (atoms as nodes, bonds as edges) or SMILES strings. These representations capture connectivity and atom types but do not encode the 3D spatial arrangement of atoms. Many important molecular properties, such as the ability to fit inside a protein binding pocket or the shape-dependent pharmacological effect, depend on the molecule’s possible energetically stable spatial arrangements (conformations).

Prior work has addressed either property prediction from fixed conformations (SchNet, Schütt et al., 2018) or conformation generation for a given molecular graph (Mansimov et al., 2019; Simm and Hernández-Lobato, 2019). This paper addresses a different gap: learning a continuous, fixed-size embedding of a conformation that is independent of molecule size and atom ordering, enabling both reconstruction and generation.

Internal Coordinates and Set-Based Encoding

The core innovation is a two-part architecture: a conformation-independent graph neural network and a conformation-dependent encoder/decoder that operates on internal coordinates.

Internal Coordinate Representation

Instead of Cartesian coordinates, conformations are represented as a set of internal coordinates:

$$ \Xi = (\mathcal{D}, \Phi, \Psi) $$

where $\mathcal{D} = \{d_1, \ldots, d_{N_\mathcal{D}}\}$ are bond lengths, $\Phi = \{\phi_1, \ldots, \phi_{N_\Phi}\}$ are bond angles, and $\Psi = \{\psi_1, \ldots, \psi_{N_\Psi}\}$ are dihedral angles. This representation is invariant to rotations and rigid translations and can always be converted to and from Cartesian coordinates.

Molecular Graph Encoder

A Graph Neural Network extracts conformation-independent node embeddings from the molecular graph. The molecular graph $\mathcal{G} = (\mathcal{V}, \mathcal{E})$ uses node features $v_i \in \mathbb{R}^{F_v}$ encoding atom properties (element type, charge) and edge features $\mathbf{e}_{i,j} \in \mathbb{R}^{F_e}$ encoding bond type (single, double, triple, or aromatic). The architecture combines an edge-conditioned convolution (EConv) layer to encode bond-type information with multiple Graph Attention Network (GAT) layers:

$$ \mathbf{h}_i^l = \mathbf{GAT}^{l-1} \circ \cdots \circ \mathbf{GAT}^1 \circ \text{EConv}(\mathbf{h}_i^0) $$

where $\mathbf{h}_i^0 = v_i \in \mathbb{R}^{F_v}$ are the initial atom features. The GAT attention coefficients are:

$$ \alpha_{i,j} = \frac{\exp\left(\sigma\left(\mathbf{a}^T [\boldsymbol{\Theta}\mathbf{h}_i | \boldsymbol{\Theta}\mathbf{h}_j]\right)\right)}{\sum_{k \in \mathcal{N}(i) \cup \{i\}} \exp\left(\sigma\left(\mathbf{a}^T [\boldsymbol{\Theta}\mathbf{h}_i | \boldsymbol{\Theta}\mathbf{h}_k]\right)\right)} $$

Each GAT layer updates node embeddings using the attention weights:

$$ \mathbf{h}’_i = \alpha_{i,i}\boldsymbol{\Theta}\mathbf{h}_i + \sum_{j \in \mathcal{N}(i)} \alpha_{i,j}\boldsymbol{\Theta}\mathbf{h}_j $$

The EConv layer incorporates edge (bond-type) information via a learned filter:

$$ \mathbf{h}’_i = \boldsymbol{\Theta}\mathbf{h}_i + \sum_{j \in \mathcal{N}(i)} \mathbf{h}_j \cdot \mathrm{f}_{\boldsymbol{\Theta}}(\mathbf{e}_{i,j}) $$

where $\mathrm{f}_{\boldsymbol{\Theta}}$ is a multi-layer perceptron.

Permutation-Invariant Conformation Encoder

The conformation encoder uses a Deep Sets-style architecture (Zaheer et al., 2017) to achieve permutation invariance. Three separate neural networks encode each type of internal coordinate, conditioned on the corresponding node embeddings:

$$ z_\Xi = \frac{1}{N_\mathcal{D} + N_\Phi + N_\Psi} \left(\sum_{d \in \mathcal{D}} \rho_\Theta^{(\mathcal{D})}(\mathcal{H}, d) + \sum_{\phi \in \Phi} \rho_\Theta^{(\Phi)}(\mathcal{H}, \phi) + \sum_{\psi \in \Psi} \rho_\Theta^{(\Psi)}(\mathcal{H}, \psi)\right) $$

Each encoding function $\rho_\Theta$ takes both the internal coordinate value and the node embeddings of the involved atoms as input. The resulting conformation embedding $z_\Xi \in \mathbb{R}^{F_z}$ has a fixed dimensionality regardless of molecule size.

Conformation Decoder and Loss

Three decoder networks $\delta_\Theta^{(\mathcal{D})}$, $\delta_\Theta^{(\Phi)}$, and $\delta_\Theta^{(\Psi)}$ reconstruct internal coordinates from the conformation embedding, conditioned on the node embeddings. The reconstruction loss is:

$$ \mathcal{C}_\Xi = \frac{1}{N_\mathcal{D}} \sum_{d \in \mathcal{D}} |d - \hat{d}|_2^2 + \frac{1}{N_\Phi} \sum_{\phi \in \Phi} |\phi - \hat{\phi}|_2^2 + \frac{1}{N_\Psi} \sum_{\psi \in \Psi} \min\left(|\psi - \hat{\psi}|_2^2, 2\pi - |\psi - \hat{\psi}|_2^2\right) $$

The dihedral angle loss uses a periodic distance to account for angular periodicity. The model can be extended to a variational autoencoder (VAE) by applying the reparameterization trick from Kingma and Welling (2013).

Conformer Generation and Spatial Optimization Experiments

Dataset and Training

The model was trained on the PubChem3D dataset (Bolton et al., 2011), which contains organic molecules with up to 50 heavy atoms with multiple conformations generated by the OMEGA forcefield software.

Reconstruction Quality

Upon convergence, the model reconstructs conformations with low RMSD to the input. The median energetic difference between input and reconstructed conformations is approximately 80 kcal/mol (evaluated using the MMFF94 forcefield via RDKit), corresponding to small deviations from local minima without atom clashes.

Latent Space Structure

The learned latent space exhibits meaningful clustering: similar conformations map to nearby points, while distinct conformations separate. Principal component analysis of 200 conformations of a small molecule reveals clear conformational clusters in the first two principal components.

Conformer Generation via VAE

The variational autoencoder variant can sample diverse conformers from the learned distribution. Comparing the average inter-conformer RMSD (icRMSD) for 200 sampled conformers per molecule against the ETKDG algorithm (Riniker and Landrum, 2015) implemented in RDKit, the model achieves comparable diversity with a slightly higher average icRMSD of 0.07 Angstrom.

Multi-Objective Molecular Optimization

By combining the conformation embedding with a continuous molecular structure embedding (CDDD, Winter et al., 2019), the model enables joint optimization over both molecular graph and conformation. Using particle swarm optimization (Kennedy and Eberhart, 1995) to maximize QED (drug-likeness, values between 0 and 1) and asphericity (deviation from spherical shape, values between 0 and 1), starting from aspirin (combined score 0.76), the method finds molecules with a combined score of 1.82 after 50 iterations.

Compact Conformation Encoding with Practical Applications

The conformation autoencoder produces fixed-size latent representations of molecular 3D structures that are invariant to molecule size, atom ordering, and rigid transformations. The key findings are:

Meaningful latent space: Conformational similarity is preserved in the embedding space, enabling clustering and interpolation.
Diverse conformer generation: The VAE variant generates conformer ensembles with diversity comparable to established force-field-based methods.
Joint optimization: Combining conformation and structure embeddings enables multi-objective optimization over both molecular graph and spatial arrangement.

Limitations include the relatively small energy evaluation (MMFF94 only), the lack of comparison with quantum mechanical energy evaluations, and the proof-of-concept nature of the spatial optimization experiments. The approach also relies on the quality of the internal coordinate representation, which may lose information about ring conformations and other constrained geometries.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Training	PubChem3D	Multiple conformations per molecule	Organic molecules, up to 50 heavy atoms
Evaluation	PubChem3D holdout	Subset	Same distribution as training

Algorithms

Graph Neural Network: EConv + multiple GAT layers
Conformation encoder: Deep Sets architecture with three coordinate-specific encoders
VAE: Reparameterization trick for probabilistic sampling
Optimization: Particle Swarm Optimization for multi-objective design

Models

Conformation-independent: EConv + GAT layers for node embeddings
Conformation-dependent: Three encoder/decoder feed-forward networks per coordinate type
Latent dimension $F_z$ is fixed (exact value not specified in the workshop paper)

Evaluation

Metric	Value	Baseline	Notes
Median energy difference	~80 kcal/mol	Input conformations	MMFF94 forcefield
icRMSD difference vs ETKDG	+0.07 Angstrom	ETKDG (RDKit)	200 conformers per molecule
Combined QED+asphericity	1.82	0.76 (aspirin)	After 50 optimization iterations

Hardware

Hardware details are not specified in the workshop paper.

Artifacts

Artifact	Type	License	Notes
PubChem3D	Dataset	Public domain	NIH public database; conformations generated by OMEGA (Hawkins et al., 2010)
arXiv preprint	Paper	arXiv license	6-page workshop paper, open access

Reproducibility status: Partially Reproducible. The training dataset (PubChem3D) is publicly available, and the architecture is described in sufficient detail for reimplementation. No source code, pre-trained weights, or exact hyperparameters (latent dimension $F_z$, learning rate, number of GAT layers) are released. The workshop paper format (6 pages) limits the level of experimental detail provided.

Paper Information

Citation: Winter, R., Noé, F., & Clevert, D.-A. (2020). Auto-Encoding Molecular Conformations. Machine Learning for Molecules Workshop, NeurIPS 2020.

Publication: Machine Learning for Molecules Workshop at NeurIPS 2020

@misc{winter2021auto,
  title={Auto-Encoding Molecular Conformations},
  author={Winter, Robin and No\'{e}, Frank and Clevert, Djork-Arn\'{e}},
  year={2021},
  eprint={2101.01618},
  archiveprefix={arXiv},
  primaryclass={cs.LG}
}

Atom-Density Representations for Machine Learning

Sat, 11 Apr 2026 00:00:00 +0000

A Unified Theory of Atom-Density Representations

This is a Theory paper that provides a formal, basis-independent framework for constructing structural representations of atomic systems for machine learning. Rather than proposing a new representation, Willatt, Musil, and Ceriotti show that many popular approaches (SOAP power spectra, Behler-Parrinello symmetry functions, $n$-body kernels, and tensorial SOAP) are special cases of a single abstract construction based on smoothed atom densities and Haar integration over symmetry groups.

The Challenge of Representing Atomic Structures

Machine learning models for predicting molecular and materials properties require input representations that are (1) complete enough to distinguish structurally distinct configurations and (2) invariant to physical symmetries (translations, rotations, and permutations of identical atoms). This has led to a large and growing set of competing approaches: Coulomb matrices, symmetry functions, radial distribution functions, wavelets, invariant polynomials, and many more.

The proliferation of representations makes it difficult to compare them on equal footing or to identify which design choices are fundamental and which are incidental. Internal-coordinate approaches (e.g., Coulomb matrices) are automatically translation- and rotation-invariant but require additional symmetrization over permutations, which can introduce derivative discontinuities when done via sorting. Density-based approaches such as radial distribution functions and SOAP avoid these discontinuities by working with smooth density fields, but their theoretical connections to one another have not been made explicit.

Dirac Notation for Atomic Environments

The core innovation is to describe an atomic configuration $\mathcal{A}$ as a ket $|\mathcal{A}\rangle$ in a Hilbert space, formed by placing smooth functions $g(\mathbf{r})$ (typically Gaussians) on each atom and decorating them with orthonormal element kets $|\alpha\rangle$:

$$ \langle \mathbf{r} | \mathcal{A} \rangle = \sum_{i} g(\mathbf{r} - \mathbf{r}_{i}) | \alpha_{i} \rangle $$

This ket is basis-independent, which is the reason for adopting the Dirac notation. The same abstract object can be projected onto position space, reciprocal space, or a basis of radial functions and spherical harmonics, yielding different concrete representations that all encode the same structural information.

Symmetrization via Haar Integration

To impose translational invariance, the ket is averaged over the translation group. Averaging the raw density $|\mathcal{A}\rangle$ directly (first order, $\nu = 1$) discards all geometric information and retains only atom counts per element. The solution is to first take tensor products and then average:

$$ \left| \mathcal{A}^{(\nu)} \right\rangle_{\hat{t}} = \int \mathrm{d}\hat{t} \underbrace{\hat{t}|\mathcal{A}\rangle \otimes \hat{t}|\mathcal{A}\rangle \cdots \hat{t}|\mathcal{A}\rangle}_{\nu} $$

For $\nu = 2$, this yields a translationally-invariant ket that encodes pairwise distance information between atoms, and naturally decomposes into atom-centered contributions:

$$ \left| \mathcal{A}^{(2)} \right\rangle_{\hat{t}} = \sum_{j} |\alpha_{j}\rangle |\mathcal{X}_{j}\rangle $$

where $|\mathcal{X}_{j}\rangle$ is the environment ket centered on atom $j$, defined with a smooth cutoff function $f_{c}(r_{ij})$ that restricts each environment to a spherical neighborhood (justified by the nearsightedness principle of electronic matter). This decomposition is what justifies the widely used additive kernel between structures (a sum of kernels between environments).

Rotational Invariance and Body-Order Correlations

The same Haar integration procedure over the $SO(3)$ rotation group produces rotationally invariant representations:

$$ \left| \mathcal{X}^{(\nu)} \right\rangle_{\hat{R}} = \int \mathrm{d}\hat{R} \prod_{\aleph}^{\nu} \otimes \hat{R}\hat{U}_{\aleph}|\mathcal{X}_{j}^{\aleph}\rangle $$

The order $\nu$ of the tensor product before symmetrization determines the body-order of correlations captured: $\nu$ corresponds to $(\nu + 1)$-body correlations. The $\nu = 1$ invariant ket retains only radial (distance) information (two-body). The $\nu = 2$ ket encodes three-body correlations (two distances and an angle), and is argued to be sufficient for unique reconstruction of a configuration (up to inversion symmetry), based on extensive numerical experiments. Using nonlinear kernels (tensor products of the symmetrized ket, parameterized by $\zeta$) allows the model to incorporate higher body-order correlations beyond those explicitly in the feature vector.

Recovering SOAP, Symmetry Functions, and Tensorial Extensions

By projecting the abstract invariant kets onto specific basis sets, the authors recover several well-known frameworks as special cases.

Behler-Parrinello Symmetry Functions

In the $\delta$-function limit of the atomic density, the $\nu = 1$ and $\nu = 2$ invariant kets in real space directly correspond to the 2-body and 3-body correlation functions. Behler-Parrinello symmetry functions are projections of these correlation functions onto suitable test functions $G$:

$$ \langle \alpha \beta G_{2} | \mathcal{X}_{j} \rangle = \langle \alpha | \alpha_{j} \rangle \int \mathrm{d}r, G_{2}(r), r \left\langle \beta r | \mathcal{X}_{j}^{(1)} \right\rangle_{\hat{R},, h \to \delta} $$

where the $h \to \delta$ subscript indicates the Dirac delta limit of the atomic density.

SOAP Power Spectrum

Expanding the environmental ket in a basis of radial functions $R_{n}(r)$ and spherical harmonics $Y_{m}^{l}(\hat{\mathbf{r}})$, the $\nu = 2$ invariant ket is the SOAP power spectrum:

$$ \left\langle \alpha n \alpha’ n’ l \left| \mathcal{X}_{j}^{(2)} \right\rangle_{\hat{R}} \propto \frac{1}{\sqrt{2l+1}} \sum_{m} \langle \alpha n l m | \mathcal{X}_{j} \rangle^{\star} \langle \alpha’ n’ l m | \mathcal{X}_{j} \rangle \right. $$

This identity shows that the SOAP kernel, which can be expressed as a scalar product between truncated power spectrum vectors, is a natural consequence of the inner product between invariant kets. The $\nu = 3$ case yields the bispectrum, used as a four-body feature vector in both SOAP and Spectral Neighbor Analysis Potentials (SNAP), where its high resolution enables accurate interatomic potentials through linear regression:

$$ \langle \alpha_{1} n_{1} l_{1}, \alpha_{2} n_{2} l_{2}, \alpha n l | \mathcal{X}_{j}^{(3)} \rangle_{\hat{R}} \propto \frac{1}{\sqrt{2l+1}} \sum_{m, m_{1}, m_{2}} \langle \mathcal{X}_{j} | \alpha n l m \rangle \langle \alpha_{1} n_{1} l_{1} m_{1} | \mathcal{X}_{j} \rangle \langle \alpha_{2} n_{2} l_{2} m_{2} | \mathcal{X}_{j} \rangle \langle l_{1} m_{1} l_{2} m_{2} | l m \rangle $$

where $\langle l_{1} m_{1} l_{2} m_{2} | l m \rangle$ is a Clebsch-Gordan coefficient.

Tensorial SOAP ($\lambda$-SOAP)

The tensorial extension of SOAP incorporates an angular momentum ket $|\lambda \mu\rangle$ into the tensor product before symmetrization:

$$ \left| \mathcal{X}^{(\nu)} \lambda \mu \right\rangle_{\hat{R}} = \int \mathrm{d}\hat{R}, \hat{R}|\lambda \mu\rangle \prod_{\aleph=1}^{\nu} \otimes \hat{R}|\mathcal{X}_{j}\rangle $$

This construction is rotationally invariant in the full product space but covariant in the subspace of atomic environments, enabling models for tensorial properties (e.g., polarizability tensors, chemical shielding).

Distributions vs. Sorted Vectors

The paper also connects density-based and sorted-vector approaches. Given a set of structural descriptors ${a_{i}}$, the sorted vector is equivalent to the inverse cumulative distribution function of the histogram of values. The Euclidean distance between sorted vectors is the $\mathcal{L}^{2}$ norm of the difference between the inverse CDFs, and the $\mathcal{L}^{1}$ norm corresponds to the earth mover’s distance. This highlights that different symmetrization strategies encode essentially the same structural information.

Generalized Operators for Tuning Representations

The framework becomes especially powerful through the introduction of a linear Hermitian operator $\hat{U}$ that transforms the density ket before symmetrization. This operator must commute with rotations:

$$ \langle \alpha n l m | \hat{U} | \alpha’ n’ l’ m’ \rangle = \delta_{ll’} \delta_{mm’} \langle \alpha n l | \hat{U} | \alpha’ n’ l’ \rangle $$

Several practical modifications to standard representations can be understood as choices of $\hat{U}$:

Dimensionality Reduction

A low-rank expansion of $\hat{U}$ via PCA on the spherical-harmonic covariance matrix of environments identifies linearly independent components, enabling compression of the feature vector. For a given $l$, the covariance matrix between spherical expansion coefficients is:

$$ C_{\alpha n \alpha’ n’}^{(l)} = \frac{1}{N} \sum_{j} \sum_{m} \langle \alpha n l m | \mathcal{X}_{j} \rangle^{\star} \langle \mathcal{X}_{j} | \alpha’ n’ l m \rangle = \frac{\sqrt{2l+1}}{N} \sum_{j} \left\langle \alpha n, \alpha’ n’ l \left| \mathcal{X}_{j}^{(2)} \right\rangle_{\hat{R}} \right. $$

The eigenvectors of $\mathbf{C}^{(l)}$ provide the mixing coefficients for a compressed representation, retaining only components with significant eigenvalues.

Radial Scaling

In systems with relatively uniform atom density, the overlap kernel is dominated by the region farthest from the center. A radial scaling operator $u(r)$ (diagonal in position space) downweights distant contributions:

$$ \langle \alpha \mathbf{r} | \hat{U} | \mathcal{X}_{j} \rangle = u(r), \psi_{\mathcal{X}_{j}}^{\alpha}(\mathbf{r}) $$

This recovers the multi-scale kernels that are known to improve predictions in practice, and connects to the two-body features of Faber et al.

Alchemical Kernels

An operator that acts only in chemical-element space introduces correlations between different elements. The “alchemical” projection:

$$ \langle J \mathbf{r} | \mathcal{X}_{j} \rangle = \sum_{\alpha} u_{J\alpha}, \psi_{\mathcal{X}_{j}}^{\alpha}(\mathbf{r}) $$

reduces the dimensionality from $O(n_{\mathrm{sp}}^{2})$ to $O(d_{J}^{2})$ and has been shown to produce a low-dimensional representation of elemental space that shares similarities with periodic-table groupings.

Non-Factorizable Operators

For more complex modifications (e.g., distance- and angle-dependent scaling of three-body correlations), the operator must act on the full product space rather than factoring into independent components. The authors show that the three-body scaling function of Faber et al. corresponds to a diagonal non-factorizable operator in the real-space representation.

Implications and Future Directions

The main conclusions are:

Unification: SOAP, Behler-Parrinello symmetry functions, $\lambda$-SOAP, bispectrum descriptors, and sorted-vector approaches all emerge from the same abstract construction, differing only in the choice of basis set, body order $\nu$, and kernel power $\zeta$.
Systematic improvability: The $\hat{U}$ operator framework provides a principled way to tune representations, from simple radial scaling to full alchemical and non-factorizable couplings, with clear connections to existing heuristic modifications.
Completeness hierarchy: The body-order parameter $\nu$ and kernel power $\zeta$ together control the trade-off between completeness and computational cost. The $\nu = 2$ (three-body) representation appears to be sufficient for unique structural identification, while higher orders can be recovered through nonlinear kernels.

Limitations: The paper is primarily theoretical and does not include extensive numerical benchmarks comparing the different instantiations of the framework. Optimization of the $\hat{U}$ operator (especially in its general form) carries a risk of overfitting that the authors acknowledge but do not resolve. The connection to neural network-based representations (message-passing networks, equivariant architectures) is not explored.

Reproducibility Details

Data

This is a theoretical paper that does not introduce new benchmark results. No training or test datasets are used. The ethanol molecule in Figure 2 serves as a visualization example for three-body correlation functions.

Algorithms

The paper defines abstract constructions (Haar integration, tensor products, operator transformations) and shows how they reduce to concrete algorithms:

SOAP power spectrum: Expand atom density in radial functions and spherical harmonics, compute $\nu = 2$ invariant ket (Eq. 33).
Alchemical projection: Contract element channels via learned or PCA-derived mixing coefficients (Eq. 53-55).
Dimensionality reduction: PCA on the covariance matrix $C_{\alpha n \alpha’ n’}^{(l)}$ of spherical expansion coefficients (Eq. 45).

Models

No trained models are presented. The framework applies to kernel ridge regression / Gaussian process regression models that use these representations as inputs.

Evaluation

No quantitative benchmarks are reported. The contribution is the theoretical framework itself, connecting and generalizing existing representations.

Hardware

Not applicable (theoretical work).

Paper Information

Citation: Willatt, M. J., Musil, F., & Ceriotti, M. (2019). Atom-density representations for machine learning. The Journal of Chemical Physics, 150(15), 154110. https://doi.org/10.1063/1.5090481

Publication: The Journal of Chemical Physics, 2019

@article{willatt2019atom,
  title={Atom-density representations for machine learning},
  author={Willatt, Michael J. and Musil, F{\'e}lix and Ceriotti, Michele},
  journal={The Journal of Chemical Physics},
  volume={150},
  number={15},
  pages={154110},
  year={2019},
  doi={10.1063/1.5090481},
  eprint={1807.00408},
  archiveprefix={arXiv},
  primaryclass={physics.chem-ph}
}

Ewald Message Passing for Molecular Graphs

Tue, 07 Apr 2026 00:00:00 +0000

A Fourier-Space Long-Range Correction for Molecular GNNs

This is a Method paper that introduces Ewald message passing (Ewald MP), a general framework for incorporating long-range interactions into message passing neural networks (MPNNs) for molecular potential energy surface prediction. The key contribution is a nonlocal Fourier-space message passing scheme, grounded in the classical Ewald summation technique from computational physics, that complements the short-range message passing of existing GNN architectures.

The Long-Range Interaction Problem in Molecular GNNs

Standard MPNNs for molecular property prediction rely on a spatial distance cutoff to define atomic neighborhoods. While this locality assumption enables favorable scaling with system size and provides a useful inductive bias, it fundamentally limits the model’s ability to capture long-range interactions such as electrostatic forces and van der Waals (London dispersion) interactions. These interactions decay slowly with distance (e.g., electrostatic energy follows a $1/r$ power law), and truncating them with a distance cutoff can introduce severe artifacts in thermochemical predictions.

This problem is well-known in molecular dynamics, where empirical force fields explicitly separate bonded (short-range) and non-bonded (long-range) energy terms. The Ewald summation technique addresses this by decomposing interactions into a short-range part that converges quickly with a distance cutoff and a long-range part whose Fourier transform converges quickly with a frequency cutoff. The authors propose bringing this same strategy into the GNN paradigm.

From Ewald Summation to Learnable Fourier-Space Messages

The core insight is a formal analogy between the continuous-filter convolution used in MPNNs and the electrostatic potential computation in Ewald summation. In a standard continuous-filter convolution, the message sum for atom $i$ is:

$$ M_i^{(l+1)} = \sum_{j \in \mathcal{N}(i)} h_j^{(l)} \cdot \Phi^{(l)}(| \mathbf{x}_i - \mathbf{x}_j |) $$

where $h_j^{(l)}$ are atom embeddings and $\Phi^{(l)}$ is a learned radial filter. Comparing this to the electrostatic potential $V_i^{\text{es}}(\mathbf{x}_i) = \sum_{j \neq i} q_j \cdot \Phi^{\text{es}}(| \mathbf{x}_i - \mathbf{x}_j |)$ reveals a direct correspondence: atom embeddings play the role of partial charges, and learned filters replace the $1/r$ kernel.

Ewald MP decomposes the learned filter into short-range and long-range components. The short-range part is handled by any existing GNN architecture with a distance cutoff. The long-range part is computed as a sum over Fourier frequencies:

$$ M^{\text{lr}}(\mathbf{x}_i) = \sum_{\mathbf{k}} \exp(i \mathbf{k}^T \mathbf{x}_i) \cdot s_{\mathbf{k}} \cdot \hat{\Phi}^{\text{lr}}(| \mathbf{k} |) $$

where $s_{\mathbf{k}}$ are structure factor embeddings, computed as:

$$ s_{\mathbf{k}} = \sum_{j \in \mathcal{S}} h_j \exp(-i \mathbf{k}^T \mathbf{x}_j) $$

These structure factor embeddings are a Fourier-space representation of the atom embedding distribution, and truncating to low frequencies effectively coarse-grains the hidden model state while preserving long-range information. The frequency filters $\hat{\Phi}^{\text{lr}}$ are learned, making the entire scheme data-driven rather than tied to a fixed physical functional form.

The method handles both periodic systems (where the reciprocal lattice provides a natural frequency discretization) and aperiodic systems (where the Fourier domain is discretized using a cubic voxel grid with SVD-based rotation alignment to preserve rotation invariance). The combined embedding update becomes:

$$ h_i^{(l+1)} = \frac{1}{\sqrt{3}} \left[ h_i^{(l)} + f_{\text{upd}}^{\text{sr}}(M_i^{\text{sr}}) + f_{\text{upd}}^{\text{lr}}(M_i^{\text{lr}}) \right] $$

The computational complexity is $\mathcal{O}(N_{\text{at}} N_{\text{k}})$, and by fixing the number of frequency vectors $N_{\text{k}}$, linear scaling $\mathcal{O}(N_{\text{at}})$ is achievable.

Experiments Across Four GNN Architectures and Two Datasets

The authors test Ewald MP as an augmentation on four baseline architectures: SchNet, PaiNN, DimeNet++, and GemNet-T. Two datasets are used:

OC20 (Chanussot et al., 2021): ~265M periodic structures of adsorbate-catalyst systems with DFT-computed energies and forces. The OC20-2M subsplit is used for training.
OE62 (Stuke et al., 2020): ~62,000 large aperiodic organic molecules with DFT-computed energies that include a DFT-D3 dispersion correction for London dispersion interactions.

All baselines use a 6 Å distance cutoff and 50 maximum neighbors. The Ewald modification is minimal: the long-range message sum is added as an additional skip connection term in each interaction block. Comparison studies include: (1) increasing the distance cutoff to match the computational cost of Ewald MP, (2) replacing the Ewald block with a SchNet interaction block at increased cutoff, and (3) increasing atom embedding dimensions to match Ewald MP’s parameter count.

Key Energy MAE Results on OE62

Model	Baseline (meV)	Ewald MP (meV)	Improvement
SchNet	133.5	79.2	40.7%
PaiNN	61.4	57.9	5.7%
DimeNet++	51.2	46.5	9.2%
GemNet-T	51.5	47.4	8.0%

Key Energy MAE Results on OC20 (Averaged Across Test Splits)

Model	Baseline (meV)	Ewald MP (meV)	Improvement
SchNet	895	830	7.3%
PaiNN	448	393	12.3%
DimeNet++	496	445	10.4%
GemNet-T	346	307	11.3%

Robust Long-Range Improvements and Dispersion Recovery

Ewald MP achieves consistent improvements across all models and both datasets, averaging 16.1% on OE62 and 10.3% on OC20. Several findings stand out:

Robustness: Unlike the increased-cutoff and SchNet-LR alternatives, Ewald MP never produces detrimental effects in any tested configuration. The increased cutoff setting hurts SchNet and PaiNN on OE62, and the SchNet-LR block fails to improve DimeNet++ and GemNet-T.
Long-range specificity: A binning analysis on OE62 groups molecules by the magnitude of their DFT-D3 dispersion correction. Ewald MP shows an outsize improvement for structures with large long-range energy contributions. It recovers or surpasses a “cheating” baseline that receives the exact DFT-D3 ground truth as an additional input.
Efficiency on periodic systems: Ewald MP achieves similar relative improvements on OC20 at roughly half the relative computational cost compared to OE62, suggesting periodic structures as a particularly attractive application domain.
Force predictions: Improvements in force MAEs are consistent but small, which is expected since the frequency truncation removes high-frequency contributions to the potential energy surface.
Ablation studies: Results are robust across different frequency cutoffs, voxel resolutions, and filtering strategies, with the non-radial periodic filtering scheme outperforming radial alternatives on out-of-distribution generalization.

Limitations include the current focus on scalar (invariant) embeddings only (PaiNN’s equivariant vector embeddings are not augmented), and the potential for a “gap” of medium-range interactions when $N_{\text{k}}$ is fixed for linear scaling. The authors suggest adapting more efficient Ewald summation variants (e.g., particle mesh Ewald with $\mathcal{O}(N \log N)$ scaling) as future work.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Training (periodic)	OC20-2M	~2M structures	Subsplit of OC20; PBC; DFT energies and forces
Training (aperiodic)	OE62	~62,000 molecules	Large organic molecules; DFT energies with D3 correction
Evaluation	OC20-test (4 splits: ID, OOD-ads, OOD-cat, OOD-both)	Varies	Evaluated via submission to OC20 evaluation server
Evaluation	OE62-val, OE62-test	~6,000 each	Direct evaluation

Algorithms

Ewald message passing is integrated as an additional skip connection term in each interaction block
For periodic systems: non-radial filtering with fixed reciprocal lattice positions ($N_x, N_y, N_z$ hyperparameters)
For aperiodic systems: radial Gaussian basis function filtering with frequency cutoff $c_k$ and voxel resolution $\Delta = 0.2$ Å$^{-1}$
SVD-based coordinate alignment for rotation invariance in the aperiodic case
Bottleneck dimension $N_\downarrow = 16$ (GemNet-T) or $N_\downarrow = 8$ (others)
Update function: dense layer + $N_{\text{hidden}}$ residual layers ($N_{\text{hidden}} = 3$, except PaiNN with $N_{\text{hidden}} = 0$)

Models

Model	Embedding Size (OE62)	Interaction Blocks	Ewald Params (OE62)
SchNet	512	4	12.2M total
PaiNN	512	4	15.7M total
DimeNet++	256	3	4.8M total
GemNet-T	256	3	16.1M total

Evaluation

Primary metric: Energy mean absolute error (EMAE) in meV
Secondary metric: Force MAE in meV/Å (OC20 only)
Loss: Linear combination of energy and force MAEs (Eq. 15) with model-specific force multipliers
Optimizer: Adam with weight decay ($\lambda = 0.01$)

Hardware

All runtime measurements on NVIDIA A100 GPUs
Runtimes measured after 50 warmup batches, averaged over 500 batches, minimum of 3 repetitions
Code: EwaldMP (Hippocratic License 3.0)

Artifacts

Artifact	Type	License	Notes
EwaldMP	Code	Hippocratic License 3.0 (new files) / MIT (OC20 base)	Official implementation built on the Open Catalyst Project codebase
OC20	Dataset	CC-BY-4.0	~265M periodic adsorbate-catalyst structures with DFT energies and forces
OE62	Dataset	CC-BY-4.0	~62,000 large organic molecules with DFT energies including D3 correction

Reproducibility status: Highly Reproducible. Source code, both datasets, and detailed hyperparameters (including per-model learning rates, batch sizes, and Ewald-specific settings) are all publicly available. Pre-trained model weights are not provided.

Paper Information

Citation: Kosmala, A., Gasteiger, J., Gao, N., & Günnemann, S. (2023). Ewald-based Long-Range Message Passing for Molecular Graphs. In Proceedings of the 40th International Conference on Machine Learning (ICML 2023).

Publication: ICML 2023

@inproceedings{kosmala2023ewald,
  title={Ewald-based Long-Range Message Passing for Molecular Graphs},
  author={Kosmala, Arthur and Gasteiger, Johannes and Gao, Nicholas and G{\"u}nnemann, Stephan},
  booktitle={Proceedings of the 40th International Conference on Machine Learning},
  year={2023},
  series={PMLR},
  volume={202}
}

PharMolixFM: Multi-Modal All-Atom Molecular Models

Sat, 28 Mar 2026 00:00:00 +0000

A Unified Framework for All-Atom Molecular Foundation Models

PharMolixFM is a Method paper that introduces a unified framework for constructing all-atom foundation models for molecular modeling and generation. The primary contribution is the systematic implementation of three multi-modal generative model variants (diffusion, flow matching, and Bayesian flow networks) within a single architecture, along with a task-unifying denoising formulation that enables training on multiple structural biology tasks simultaneously. The framework achieves competitive performance on protein-small-molecule docking and structure-based drug design while providing the first empirical analysis of inference scaling laws for molecular generative models.

Existing all-atom foundation models such as AlphaFold3, RoseTTAFold All-Atom, and ESM-AA face two core challenges that limit their generalization across molecular modeling and generation tasks.

First, atomic data is inherently multi-modal: each atom comprises both a discrete atom type and continuous 3D coordinates. This poses challenges for structure models that need to jointly capture and predict both modalities. Unlike text or image data that exhibit a single modality, molecular structures require generative models that can handle discrete categorical variables (atom types, bond types) and continuous variables (coordinates) simultaneously.

Second, there has been no comprehensive analysis of how different training objectives and sampling strategies impact the performance of all-atom foundation models. Prior work has focused on individual model architectures without systematically comparing generative frameworks or studying how inference-time compute scaling affects prediction quality.

PharMolixFM addresses both challenges by providing a unified framework that implements three state-of-the-art multi-modal generative models and formulates all downstream tasks as a generalized denoising process with task-specific priors.

The core innovation of PharMolixFM is the formulation of molecular tasks as a generalized denoising process where task-specific priors control which parts of the molecular system are noised during training. The framework decomposes a biomolecular system into $N$ atoms represented as a triplet $\bar{\mathbf{S}}_0 = \langle \mathbf{X}_0, \mathbf{A}_0, \mathbf{E}_0 \rangle$, where $\mathbf{X}_0 \in \mathbb{R}^{N \times 3}$ are atom coordinates, $\mathbf{A}_0 \in \mathbb{Z}^{N \times D_1}$ are one-hot atom types, and $\mathbf{E}_0 \in \mathbb{Z}^{N \times N \times D_2}$ are one-hot bond types.

The generative model estimates the density $p_\theta(\langle \mathbf{X}_0, \mathbf{A}_0, \mathbf{E}_0 \rangle)$ subject to SE(3) invariance:

$$ p_\theta(\langle \mathbf{R}\mathbf{X}_0 + \mathbf{t}, \mathbf{A}_0, \mathbf{E}_0 \rangle) = p_\theta(\langle \mathbf{X}_0, \mathbf{A}_0, \mathbf{E}_0 \rangle) $$

The variational lower bound is optimized over latent variables $S_1, \ldots, S_T$ obtained by adding independent noise to different modalities and atoms:

$$ q(S_{1:T} \mid S_0) = \prod_{i=1}^{T} \prod_{j=1}^{N} q(\mathbf{X}_{i,j} \mid \mathbf{X}_{0,j}, \sigma_{i,j}^{(\mathbf{X})}) , q(\mathbf{A}_{i,j} \mid \mathbf{A}_{0,j}, \sigma_{i,j}^{(\mathbf{A})}) , q(\mathbf{E}_{i,j} \mid \mathbf{E}_{0,j}, \sigma_{i,j}^{(\mathbf{E})}) $$

A key design choice is the noise schedule $\sigma_{i,j}^{(\mathcal{M})} = \frac{i}{T} \cdot \text{fix}_j^{(\mathcal{M})}$, where $\text{fix}_j^{(\mathcal{M})}$ is a scaling factor between 0 and 1 that controls which atoms and modalities receive noise. This “Fix” mechanism enables multiple training tasks:

Docking ($\text{Fix} = 1$ for protein and molecular graph, $\text{Fix} = 0$ for molecule coordinates): predicts binding pose given known atom/bond types.
Structure-based drug design ($\text{Fix} = 1$ for protein, $\text{Fix} = 0$ for all molecule properties): generates novel molecules for a given pocket.
Robustness augmentation ($\text{Fix} = 0.7$ for 15% randomly selected atoms, $\text{Fix} = 0$ for rest): simulates partial structure determination.

Three Generative Model Variants

Multi-modal diffusion (PharMolixFM-Diff) uses a Markovian forward process. Continuous coordinates follow Gaussian diffusion while discrete variables use a D3PM categorical transition:

$$ q(\mathbf{X}_{i,j} \mid \mathbf{X}_{0,j}) = \mathcal{N}(\sqrt{\alpha_{i,j}} , \mathbf{X}_{0,j}, (1 - \alpha_{i,j}) \mathbf{I}), \quad \alpha_{i,j} = \prod_{k=1}^{i}(1 - \sigma_{i,j}^{(\mathbf{X})}) $$

$$ q(\mathbf{A}_{i,j} \mid \mathbf{A}_{0,j}) = \text{Cat}(\mathbf{A}_{0,j} \bar{Q}_{i,j}^{(\mathbf{A})}), \quad Q_{i,j}^{(\mathbf{A})} = (1 - \sigma_{i,j}^{(\mathbf{A})}) \mathbf{I} + \frac{\sigma_{i,j}^{(\mathbf{A})}}{D_1} \mathbb{1}\mathbb{1}^T $$

The training loss combines coordinate MSE with cross-entropy for discrete variables:

$$ \mathcal{L} = \mathbb{E}_{S_0, i, S_i} \left[ \lambda_i^{(\mathbf{X})} | \tilde{\mathbf{X}}_0 - \mathbf{X}_0 |_2^2 + \lambda_i^{(\mathbf{A})} \mathcal{L}_{CE}(\tilde{\mathbf{A}}_0, \mathbf{A}_0) + \lambda_i^{(\mathbf{E})} \mathcal{L}_{CE}(\tilde{\mathbf{E}}_0, \mathbf{E}_0) \right] $$

Multi-modal flow matching (PharMolixFM-Flow) constructs a direct mapping between data and prior distributions using conditional vector fields. For coordinates, the conditional flow uses a Gaussian path $q(\mathbf{X}_{i,j} \mid \mathbf{X}_{0,j}) = \mathcal{N}((1 - \sigma_{i,j}^{(\mathbf{X})}) \mathbf{X}_{0,j}, (\sigma_{i,j}^{(\mathbf{X})})^2 \mathbf{I})$, while discrete variables use the same D3PM Markov chain. Sampling proceeds by solving an ODE via Euler integration.

Bayesian flow networks (PharMolixFM-BFN) perform generative modeling in the parameter space of the data distribution rather than the data space. The Bayesian flow distribution for coordinates is:

$$ p_F(\tilde{\mathbf{X}}_{i,j}^{(\theta)} \mid \mathbf{X}_{0,j}) = \mathcal{N}(\gamma_{i,j} \mathbf{X}_{0,j}, \gamma_{i,j}(1 - \gamma_{i,j}) \mathbf{I}), \quad \gamma_{i,j} = 1 - \alpha^{2(1 - \sigma_{i,j}^{(\mathbf{X})})} $$

Network Architecture

The architecture follows PocketXMol with a dual-branch SE(3)-equivariant graph neural network. A protein branch (4-layer GNN with kNN graph) processes pocket atoms, then representations are passed to a molecule branch (6-layer GNN) that captures protein-molecule interactions. Independent prediction heads reconstruct atom coordinates, atom types, and bond types, with additional confidence heads for self-ranking during inference.

Docking and Drug Design Experiments

Protein-Small-Molecule Docking

PharMolixFM is evaluated on the PoseBusters benchmark (428 protein-small-molecule complexes) using the holo docking setting with a known protein structure and 10 Angstrom binding pocket. The metric is the ratio of predictions with RMSD < 2 Angstrom.

Method	Self-Ranking (%)	Oracle-Ranking (%)
DiffDock	38.0	-
RFAA	42.0	-
Vina	52.3	-
UniMol-Docking V2	77.6	-
SurfDock	78.0	-
AlphaFold3	90.4	-
PocketXMol (50 repeats)	82.2	95.3
PharMolixFM-Diff (50 repeats)	83.4	96.0
PharMolixFM-Flow (50 repeats)	73.4	93.7
PharMolixFM-BFN (50 repeats)	78.5	93.5
PharMolixFM-Diff (500 repeats)	83.9	98.1

PharMolixFM-Diff achieves the second-best self-ranking result (83.4%), outperforming PocketXMol by 1.7% absolute but trailing AlphaFold3 (90.4%). The key advantage is inference speed: approximately 4.6 seconds per complex on a single A800 GPU compared to approximately 249.0 seconds for AlphaFold3 (a 54x speedup). Under oracle-ranking with 500 repeats, PharMolixFM-Diff reaches 98.1%, suggesting that better ranking strategies could further improve practical performance.

Structure-Based Drug Design

Evaluation uses the CrossDocked test set (100 protein pockets, 100 molecules generated per pocket), measuring Vina binding affinity scores and drug-likeness properties (QED and SA).

Method	Vina Score (Avg/Med)	QED	SA
Pocket2Mol	-5.14 / -4.70	0.57	0.76
TargetDiff	-5.47 / -6.30	0.48	0.58
DecompDiff	-5.67 / -6.04	0.45	0.61
MolCRAFT	-6.61 / -8.14	0.46	0.62
PharMolixFM-Diff	-6.18 / -6.44	0.50	0.73
PharMolixFM-Flow	-6.34 / -6.47	0.49	0.74
PharMolixFM-BFN	-6.38 / -6.45	0.48	0.64

PharMolixFM achieves a better balance between binding affinity and drug-like properties compared to baselines. While MolCRAFT achieves the best Vina scores, PharMolixFM-Diff and Flow variants show notably higher QED (0.49-0.50 vs. 0.45-0.48) and SA (0.73-0.74 vs. 0.58-0.62), which are important for downstream validation and in-vivo application.

Inference Scaling Law

The paper explores whether inference-time scaling holds for molecular generative models, fitting the relationship:

$$ \text{Acc} = a \log(bR + c) + d $$

where $R$ is the number of sampling repeats. All three PharMolixFM variants exhibit logarithmic improvement in docking accuracy with increased sampling repeats, analogous to inference scaling laws observed in NLP. Performance plateaus eventually due to distributional differences between training and test sets.

Competitive Docking with Faster Inference, but Limited Task Scope

PharMolixFM demonstrates that multi-modal generative models can achieve competitive all-atom molecular modeling with substantial inference speed advantages over AlphaFold3. The key findings are:

Diffusion outperforms flow matching and BFN for docking under standard sampling budgets. The stochastic nature of diffusion sampling appears beneficial compared to the deterministic ODE integration of flow matching.
Oracle-ranking reveals untapped potential: the gap between self-ranking (83.4%) and oracle-ranking (98.1%) at 500 repeats indicates that confidence-based ranking is a bottleneck. Better ranking methods could close the gap with AlphaFold3.
The three variants show similar performance for drug design, suggesting that model architecture and training data may matter more than the generative framework for generation tasks.
Inference scaling laws hold for molecular generative models, paralleling findings in NLP.

Limitations include that the framework is only evaluated on two tasks (docking and SBDD), and the paper does not address protein structure prediction, protein-protein interactions, or nucleic acid modeling, which are part of AlphaFold3’s scope. The BFN variant underperforms the diffusion model, which the authors attribute to smaller noise scales at early sampling steps making training less challenging. The paper also does not compare against concurrent work on inference-time scaling for molecular models.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Training	PDBBind, Binding MOAD, CrossDocked2020, PepBDB	Not specified	Filtered by PocketXMol criteria
Docking eval	PoseBusters benchmark	428 complexes	Holo docking with known protein
SBDD eval	CrossDocked test set	100 pockets	100 molecules per pocket

Algorithms

Three generative variants: multi-modal diffusion (D3PM), flow matching, Bayesian flow networks
Task-specific noise via Fix mechanism (0, 0.7, or 1.0)
Training tasks selected with equal probability per sample
AdamW optimizer: weight decay 0.001, $\beta_1 = 0.99$, $\beta_2 = 0.999$
Linear warmup to learning rate 0.001 over 1000 steps
180K training steps with batch size 40

Models

Dual-branch SE(3)-equivariant GNN (protein: 4-layer, molecule: 6-layer)
kNN graph construction for protein and protein-molecule interactions
Independent prediction heads for coordinates, atom types, bond types
Confidence heads for self-ranking during inference

Evaluation

Metric	PharMolixFM-Diff	AlphaFold3	Notes
RMSD < 2A self-ranking	83.4% (50 rep)	90.4%	PoseBusters docking
RMSD < 2A oracle-ranking	98.1% (500 rep)	-	PoseBusters docking
Inference time (per complex)	~4.6s	~249.0s	Single A800 GPU
Vina score (avg)	-6.18	-	CrossDocked SBDD

Hardware

Training: 4x 80GB A800 GPUs
Inference benchmarked on single A800 GPU

Artifacts

Artifact	Type	License	Notes
OpenBioMed (GitHub)	Code	MIT	Official implementation

Paper Information

Citation: Luo, Y., Wang, J., Fan, S., & Nie, Z. (2025). PharMolixFM: All-Atom Foundation Models for Molecular Modeling and Generation. arXiv preprint arXiv:2503.21788.

@article{luo2025pharmolixfm,
  title={PharMolixFM: All-Atom Foundation Models for Molecular Modeling and Generation},
  author={Luo, Yizhen and Wang, Jiashuo and Fan, Siqi and Nie, Zaiqing},
  journal={arXiv preprint arXiv:2503.21788},
  year={2025}
}

MAT: Graph-Augmented Transformer for Molecules (2020)

Fri, 27 Mar 2026 00:00:00 +0000

A Graph-Augmented Transformer for Molecular Property Prediction

This is a Method paper that proposes the Molecule Attention Transformer (MAT), a Transformer-based architecture adapted for molecular property prediction. The primary contribution is a modified self-attention mechanism that incorporates inter-atomic distances and molecular graph structure alongside the standard query-key attention. Combined with self-supervised pretraining on 2 million molecules from ZINC15, MAT achieves competitive performance across seven diverse molecular property prediction tasks while requiring minimal hyperparameter tuning.

Challenges in Deep Learning for Molecular Properties

Predicting molecular properties is central to drug discovery and materials design, yet deep neural networks have struggled to consistently outperform shallow methods like random forests and SVMs on these tasks. Wu et al. (2018) demonstrated through the MoleculeNet benchmark that graph neural networks do not reliably beat classical models. Two recurring problems compound this:

Underfitting: Graph neural networks tend to underfit training data, with performance failing to scale with model complexity (Ishiguro et al., 2019).
Hyperparameter sensitivity: Deep models for molecule property prediction require extensive hyperparameter search (often 500+ configurations) to achieve competitive results, making them impractical for many practitioners.

Concurrent work explored using vanilla Transformers on SMILES string representations of molecules (Honda et al., 2019; Wang et al., 2019), but these approaches discard the explicit structural information encoded in molecular graphs and 3D conformations. The motivation for MAT is to combine the flexibility of the Transformer architecture with domain-specific inductive biases from molecular structure.

Molecule Self-Attention: Combining Attention, Distance, and Graph Structure

The core innovation is the Molecule Self-Attention layer, which replaces standard Transformer self-attention. In a standard Transformer, head $i$ computes:

$$ \mathcal{A}^{(i)} = \rho\left(\frac{\mathbf{Q}_{i} \mathbf{K}_{i}^{T}}{\sqrt{d_{k}}}\right) \mathbf{V}_{i} $$

MAT augments this with two additional information sources. Let $\mathbf{A} \in {0, 1}^{N_{\text{atoms}} \times N_{\text{atoms}}}$ denote the molecular graph adjacency matrix and $\mathbf{D} \in \mathbb{R}^{N_{\text{atoms}} \times N_{\text{atoms}}}$ denote the inter-atomic distance matrix. The modified attention becomes:

$$ \mathcal{A}^{(i)} = \left(\lambda_{a} \rho\left(\frac{\mathbf{Q}_{i} \mathbf{K}_{i}^{T}}{\sqrt{d_{k}}}\right) + \lambda_{d}, g(\mathbf{D}) + \lambda_{g}, \mathbf{A}\right) \mathbf{V}_{i} $$

where $\lambda_{a}$, $\lambda_{d}$, and $\lambda_{g}$ are scalar hyperparameters weighting each component, and $g$ is either a row-wise softmax or an element-wise exponential decay $g(d) = \exp(-d)$.

Key architectural details:

Atom embedding: Each atom is represented as a 26-dimensional vector encoding atomic identity (one-hot over B, N, C, O, F, P, S, Cl, Br, I, dummy, other), number of heavy neighbors, number of hydrogens, formal charge, ring membership, and aromaticity.
Dummy node: An artificial disconnected node (distance $10^{6}$ from all atoms) is added to each molecule, allowing the model to “skip” attention heads when no relevant pattern exists, similar to how BERT uses the separation token.
3D conformers: Distance matrices are computed from RDKit-generated 3D conformers using the Universal Force Field (UFF).
Pretraining: Node-level masked atom prediction on 2 million ZINC15 molecules (following Hu et al., 2019), where 15% of atom features are masked and the model predicts them.

Benchmark Evaluation and Ablation Studies

Experimental setup

MAT is evaluated on seven molecular property prediction datasets spanning regression and classification:

Dataset	Task	Size	Metric	Split
FreeSolv	Regression (hydration free energy)	642	RMSE	Random
ESOL	Regression (log solubility)	1,128	RMSE	Random
BBBP	Classification (BBB permeability)	2,039	ROC AUC	Scaffold
Estrogen-alpha	Classification (receptor activity)	2,398	ROC AUC	Scaffold
Estrogen-beta	Classification (receptor activity)	1,961	ROC AUC	Scaffold
MetStab-high	Classification (metabolic stability)	2,127	ROC AUC	Random
MetStab-low	Classification (metabolic stability)	2,127	ROC AUC	Random

Baselines include GCN, Weave, EAGCN, Random Forest (RF), and SVM. Each model receives the same hyperparameter search budget (150 or 500 evaluations). Results are averaged over 6 random train/validation/test splits.

Main results

MAT achieves the best average rank across all seven tasks:

Model	Avg. Rank (500 budget)	Avg. Rank (150 budget)
MAT	2.42	2.71
RF	3.14	3.14
SVM	3.57	3.28
GCN	3.57	3.71
Weave	3.71	3.57
EAGCN	4.14	4.14

With self-supervised pretraining, Pretrained MAT achieves an average rank of 1.57, outperforming both Pretrained EAGCN (4.0) and SMILES Transformer (4.29). Pretrained MAT requires tuning only the learning rate (7 values tested), compared to 500 hyperparameter combinations for the non-pretrained models.

Ablation results

Ablation studies on BBBP, ESOL, and FreeSolv reveal:

Variant	BBBP (AUC)	ESOL (RMSE)	FreeSolv (RMSE)
MAT (full)	.723	.286	.250
- Graph	.716	.316	.276
- Distance	.729	.281	.281
- Attention	.692	.306	.329
- Dummy node	.714	.317	.249
+ Edge features	.683	.314	.358

Removing any single component degrades performance on at least one task, supporting the value of combining all three information sources. Adding edge features does not help, suggesting the adjacency and distance matrices already capture sufficient bond-level information.

Interpretability analysis

Individual attention heads in the first layer learn chemically meaningful functions. Six heads were identified that focus on specific chemical patterns: 2-neighbored aromatic carbons, sulfur atoms, non-ring nitrogens, carbonyl oxygens, 3-neighbored aromatic atoms (substitution positions), and aromatic ring nitrogens. Statistical validation using Kruskal-Wallis tests confirmed that atoms matching these SMARTS patterns receive significantly higher attention weights ($p < 0.001$ for all patterns).

Findings, Limitations, and Future Directions

MAT demonstrates that augmenting Transformer self-attention with molecular graph structure and 3D distance information produces a model that performs consistently well across diverse property prediction tasks. The key practical finding is that self-supervised pretraining dramatically reduces the hyperparameter tuning burden: Pretrained MAT matches or exceeds the performance of extensively tuned models while requiring only learning rate selection.

Several limitations are acknowledged:

Fingerprint-based models still win on some tasks: RF and SVM with extended-connectivity fingerprints outperform MAT on metabolic stability and Estrogen-beta tasks, suggesting that incorporating fingerprint representations could improve MAT further.
Single conformer: Only one pre-computed 3D conformer is used per molecule. More sophisticated conformer sampling or ensemble strategies were not explored.
Limited pretraining exploration: Only the masked atom prediction task from Hu et al. (2019) was used. The authors note that exploring additional pretraining objectives is a promising direction.
Scalability: The pretrained model uses 1024-dimensional embeddings with 8 layers and 16 attention heads, fitting the largest model that fits in GPU memory.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Pretraining	ZINC15	2M molecules	Sampled from ZINC database
Evaluation	FreeSolv	642	Hydration free energy regression
Evaluation	ESOL	1,128	Log solubility regression
Evaluation	BBBP	2,039	Blood-brain barrier classification
Evaluation	Estrogen-alpha/beta	2,398 / 1,961	Receptor activity classification
Evaluation	MetStab-high/low	2,127 each	Metabolic stability classification

Algorithms

Optimizer: Adam with Noam learning rate scheduler (warmup then inverse square root decay)
Pretraining: 8 epochs, learning rate 0.001, batch size 256, binary cross-entropy loss
Fine-tuning: 100 epochs, batch size 32, learning rate selected from {1e-3, 5e-4, 1e-4, 5e-5, 1e-5, 5e-6, 1e-6}
Distance kernel: exponential decay $g(d) = \exp(-d)$ for pretrained model
Lambda weights: $\lambda_{a} = \lambda_{d} = 0.33$ for pretrained model

Models

Pretrained MAT: 1024-dim embeddings, 8 layers, 16 attention heads, 1 feed-forward layer per block
Dropout: 0.0, weight decay: 0.0 for pretrained model
Atom featurization: 26-dimensional one-hot encoding (Table 1 in paper)

Evaluation

Regression: RMSE (FreeSolv, ESOL)
Classification: ROC AUC (BBBP, Estrogen-alpha/beta, MetStab-high/low)
All experiments repeated 6 times with different train/validation/test splits
Scaffold split for BBBP, Estrogen, random split for others

Hardware

The paper does not specify exact hardware details. The pretrained model is described as “the largest model that still fits the GPU memory.”

Artifacts

Artifact	Type	License	Notes
gmum/MAT	Code	MIT	Official implementation with pretrained weights

Paper Information

Citation: Maziarka, Ł., Danel, T., Mucha, S., Rataj, K., Tabor, J., & Jastrzębski, S. (2020). Molecule Attention Transformer. arXiv preprint arXiv:2002.08264.

@article{maziarka2020molecule,
  title={Molecule Attention Transformer},
  author={Maziarka, {\L}ukasz and Danel, Tomasz and Mucha, S{\l}awomir and Rataj, Krzysztof and Tabor, Jacek and Jastrz{\k{e}}bski, Stanis{\l}aw},
  journal={arXiv preprint arXiv:2002.08264},
  year={2020}
}

MOFFlow: Flow Matching for MOF Structure Prediction

Sat, 20 Dec 2025 00:00:00 +0000

Methodological Contribution: MOFFlow Architecture

This is a Methodological Paper ($\Psi_{\text{Method}}$).

It introduces MOFFlow, a generative architecture and training framework designed specifically for the structure prediction of Metal-Organic Frameworks (MOFs). The paper focuses on the algorithmic innovation of decomposing the problem into rigid-body assembly on a Riemannian manifold, validates this through comparison against existing baselines, and performs ablation studies to justify architectural choices. While it leverages the theory of flow matching, its primary contribution is the application-specific architecture and the handling of modular constraints.

Motivation: Scaling Limits of Atom-Level Generation

The primary motivation is to overcome the scalability and accuracy limitations of existing methods for MOF structure prediction.

Computational Cost of DFT: Conventional approaches rely on ab initio calculations (DFT) combined with random search, which are computationally prohibitive for large, complex systems like MOFs.
Failure of General CSP: Existing deep generative models for general Crystal Structure Prediction (CSP) operate on an atom-by-atom basis. They fail to scale to MOFs, which often contain hundreds or thousands of atoms per unit cell, and do not exploit the inherent modular nature (building blocks) of MOFs.
Tunability: MOFs have applications in carbon capture and drug delivery due to their tunable porosity, making automated design tools valuable.

Core Innovation: Rigid-Body Flow Matching on SE(3)

MOFFlow introduces a hierarchical, rigid-body flow matching framework tailored for MOFs.

Rigid Body Decomposition: MOFFlow treats metal nodes and organic linkers as rigid bodies, reducing the search space from $3N$ (atoms) to $6M$ (roto-translation of $M$ blocks) compared to atom-based methods.
Riemannian Flow Matching on $SE(3)$: It is the first end-to-end model to jointly generate block-level rotations ($SO(3)$), translations ($\mathbb{R}^3$), and lattice parameters using Riemannian flow matching.
MOFAttention: A custom attention module designed to encode the geometric relationships between building blocks, lattice parameters, and rotational constraints.
Constraint Handling: It incorporates domain knowledge by operating on a mean-free system for translation invariance and using canonicalized coordinates for rotation invariance.

Experimental Setup and Baselines

The authors evaluated MOFFlow on structure prediction accuracy, physical property preservation, and scalability.

Dataset: The Boyd et al. (2019) dataset consisting of 324,426 hypothetical MOF structures, decomposed into building blocks using the MOFid algorithm. Filtered to structures with <200 blocks, yielding 308,829 structures (247,066 train / 30,883 val / 30,880 test). Structures contain up to approximately 2,400 atoms per unit cell.
Baselines:
- Optimization-based: Random Search (RS) and Evolutionary Algorithm (EA) using CrySPY and CHGNet.
- Deep Learning: DiffCSP (deep generative model for general crystals).
- Self-Assembly: A heuristic algorithm used in MOFDiff (adapted for comparison).
Metrics:
- Match Rate (MR): Percentage of generated structures matching ground truth within tolerance.
- RMSE: Root mean squared displacement normalized by average free length per atom.
- Structural Properties: Volumetric/Gravimetric Surface Area (VSA/GSA), Pore Limiting Diameter (PLD), Void Fraction, etc., calculated via Zeo++.
- Scalability: Performance vs. number of atoms and building blocks.

Results and Generative Performance

MOFFlow outperformed all baselines in accuracy and efficiency, particularly for large structures.

Accuracy: With a single sample, MOFFlow achieved a 31.69% match rate (stol=0.5) and 87.46% (stol=1.0) on the full test set (30,880 structures). With 5 samples, these rose to 44.75% (stol=0.5) and 100.0% (stol=1.0). RS and EA (tested on 100 and 15 samples respectively due to computational cost, generating 20 candidates each) achieved 0.00% MR at both tolerance levels. DiffCSP reached 0.09% (stol=0.5) and 23.12% (stol=1.0) with 1 sample.
Speed: Inference took 1.94 seconds per structure, compared to 5.37s for DiffCSP, 332s for RS, and 1,959s for EA.
Scalability: MOFFlow preserved high match rates across all system sizes, while DiffCSP’s match rate dropped sharply beyond 200 atoms.
Property Preservation: The distributions of physical properties (e.g., surface area, void fraction) for MOFFlow-generated structures closely matched the ground truth. DiffCSP frequently reduced volumetric surface area and void fraction to zero.
Self-Assembly Comparison: In a controlled comparison where the self-assembly (SA) algorithm received MOFFlow’s predicted translations and lattice, MOFFlow (MR=31.69%, RMSE=0.2820) outperformed SA (MR=30.04%, RMSE=0.3084), confirming the value of the learned rotational vector fields. In an extended scalability comparison, SA scaled better for structures with many building blocks, but MOFFlow achieved higher overall match rate (31.69% vs. 27.14%).
Batch Implementation: A refactored Batch version achieves improved results: 32.73% MR (stol=0.5), RMSE of 0.2743, inference in 0.19s per structure (10x faster), and training in roughly 1/3 the GPU hours.

Limitations

The paper identifies three key limitations:

Hypothetical-only evaluation: All experiments use the Boyd et al. hypothetical database. Evaluation on more challenging real-world datasets remains needed.
Rigid-body assumption: The model assumes that local building block structures are known, which may be impractical for rare building blocks whose structural information is missing from existing libraries or is inaccurate.
Periodic invariance: The model is not invariant to periodic transformations of the input. Explicitly modeling periodic invariance could further improve performance.

Reproducibility Details

Data

Source: MOF dataset by Boyd et al. (2019).
Preprocessing: Structures were decomposed using the metal-oxo decomposition algorithm from MOFid.
Filtering: Structures with fewer than 200 building blocks were used, yielding 308,829 structures.
Splits: Train/Validation/Test ratio of 8:1:1 (247,066 / 30,883 / 30,880).
Availability: Pre-processed dataset is available on Zenodo.
Representations:
- Atom-level: Tuple $(X, a, l)$ (coordinates, types, lattice).
- Block-level: Tuple $(\mathcal{B}, q, \tau, l)$ (blocks, rotations, translations, lattice).

Algorithms

Framework: Riemannian Flow Matching.
Objective: Conditional Flow Matching (CFM) loss regressing to clean data $q_1, \tau_1, l_1$. $$ \begin{aligned} \mathcal{L}(\theta) = \mathbb{E}_{t, \mathcal{S}^{(1)}} \left[ \frac{1}{(1-t)^2} \left( \lambda_1 |\log_{q_t}(\hat{q}_1) - \log_{q_t}(q_1)|^2 + \dots \right) \right] \end{aligned} $$
Priors:
- Rotations ($q$): Uniform on $SO(3)$.
- Translations ($\tau$): Standard normal on $\mathbb{R}^3$.
- Lattice ($l$): Log-normal for lengths, Uniform(60, 120) for angles (Niggli reduced).
Inference: ODE solver with 50 integration steps.
Local Coordinates: Defined using PCA axes, corrected for symmetry to ensure consistency.

Models

Architecture: Hierarchical structure with two key modules.
- Atom-level Update Layers: 4-layer EGNN-like structure to encode building block features $h_m$ from atomic graphs (cutoff 5Å).
- Block-level Update Layers: 6 layers that iteratively update $q, \tau, l$ using the MOFAttention module.
MOFAttention: Modified Invariant Point Attention (IPA) that incorporates lattice parameters as offsets to the attention matrix.
Hyperparameters:
- Node dimension: 256 (block-level), 64 (atom-level).
- Attention heads: 24.
- Loss coefficients: $\lambda_1=1.0$ (rot), $\lambda_2=2.0$ (trans), $\lambda_3=0.1$ (lattice).
Checkpoints: Pre-trained weights and models are openly provided on Zenodo.

Evaluation

Metrics:
- Match Rate: Using StructureMatcher from pymatgen. Tolerances: stol=0.5/1.0, ltol=0.3, angle_tol=10.0.
- RMSE: Normalized by average free length per atom.
Tools: Zeo++ for structural property calculations (Surface Area, Pore Diameter, etc.).

Metric	MOFFlow	DiffCSP	RS (20 cands)	EA (20 cands)
MR (stol=0.5, k=1)	31.69%	0.09%	0.00%	0.00%
MR (stol=1.0, k=1)	87.46%	23.12%	0.00%	0.00%
MR (stol=0.5, k=5)	44.75%	0.34%	-	-
MR (stol=1.0, k=5)	100.0%	38.94%	-	-
RMSE (stol=0.5, k=1)	0.2820	0.3961	-	-
Avg. time per structure	1.94s	5.37s	332s	1,959s

Hardware

Training Hardware: 8 $\times$ NVIDIA RTX 3090 (24GB VRAM).
Training Time:
- TimestepBatch version (main paper): ~5 days 15 hours.
- Batch version: ~1 day 17 hours (332.74 GPU hours). The authors also release this refactored implementation, which achieves comparable performance with faster convergence.
Batch Size: 160 (capped by $N^2$ where $N$ is the number of atoms, for memory management).

Artifacts

Artifact	Type	License	Notes
MOFFlow (GitHub)	Code	MIT	Official implementation built on DiffDock, EGNN, MOFDiff, and protein-frame-flow
Pre-processed dataset and checkpoints (Zenodo)	Dataset / Model	Unknown	Includes pre-processed MOF structures and trained model weights

Paper Information

Citation: Kim, N., Kim, S., Kim, M., Park, J., & Ahn, S. (2025). MOFFlow: Flow Matching for Structure Prediction of Metal-Organic Frameworks. International Conference on Learning Representations (ICLR).

Publication: ICLR 2025

@inproceedings{kimMOFFlowFlowMatching2025,
  title={MOFFlow: Flow Matching for Structure Prediction of Metal-Organic Frameworks},
  author={Kim, Nayoung and Kim, Seongsu and Kim, Minsu and Park, Jinkyoo and Ahn, Sungsoo},
  booktitle={The Thirteenth International Conference on Learning Representations},
  year={2025},
  url={https://openreview.net/forum?id=dNT3abOsLo}
}

Additional Resources:

DenoiseVAE: Adaptive Noise for Molecular Pre-training

Sun, 24 Aug 2025 00:00:00 +0000

Paper Contribution Type

This is a method paper with a supporting theoretical component. It introduces a new pre-training framework, DenoiseVAE, that challenges the standard practice of using fixed, hand-crafted noise distributions in denoising-based molecular representation learning.

Motivation: The Inter- and Intra-molecular Variations Problem

The motivation is to create a more physically principled denoising pre-training task for 3D molecules. The core idea of denoising is to learn molecular force fields by corrupting an equilibrium conformation with noise and then learning to recover it. However, existing methods use a single, hand-crafted noise strategy (e.g., Gaussian noise of a fixed scale) for all atoms across all molecules. This is physically unrealistic for two main reasons:

Inter-molecular differences: Different molecules have unique Potential Energy Surfaces (PES), meaning the space of low-energy (i.e., physically plausible) conformations is highly molecule-specific.
Intra-molecular differences (Anisotropy): Within a single molecule, different atoms have different degrees of freedom. For instance, an atom in a rigid functional group can move much less than one connected by a single, rotatable bond.

The authors argue that this “one-size-fits-all” noise approach leads to inaccurate force field learning because it samples many physically improbable conformations.

Novelty: A Learnable, Atom-Specific Noise Generator

The core novelty is a framework that learns to generate noise tailored to each specific molecule and atom. This is achieved through three key innovations:

Learnable Noise Generator: The authors introduce a Noise Generator module (a 4-layer Equivariant Graph Neural Network) that takes a molecule’s equilibrium conformation $X$ as input and outputs a unique, atom-specific Gaussian noise distribution (i.e., a different variance $\sigma_i^2$ for each atom $i$). This directly addresses the issues of PES specificity and force field anisotropy.
Variational Autoencoder (VAE) Framework: The Noise Generator (encoder) and a Denoising Module (a 7-layer EGNN decoder) are trained jointly within a VAE paradigm. The noisy conformation is sampled using the reparameterization trick: $$ \begin{aligned} \tilde{x}_i &= x_i + \epsilon \sigma_i \end{aligned} $$
Principled Optimization Objective: The training loss balances two competing goals: $$ \begin{aligned} \mathcal{L}_{DenoiseVAE} &= \mathcal{L}_{Denoise} + \lambda \mathcal{L}_{KL} \end{aligned} $$
- A denoising reconstruction loss ($\mathcal{L}_{Denoise}$) encourages the Noise Generator to produce physically plausible perturbations from which the original conformation can be recovered. This implicitly constrains the noise to respect the molecule’s underlying force fields.
- A KL divergence regularization term ($\mathcal{L}_{KL}$) pushes the generated noise distributions towards a predefined prior. This prevents the trivial solution of generating zero noise and encourages the model to explore a diverse set of low-energy conformations.

The authors also provide a theoretical analysis showing that optimizing their objective is equivalent to maximizing the Evidence Lower Bound (ELBO) on the log-likelihood of observing physically realistic conformations.

Methodology & Experimental Baselines

The model was pretrained on the PCQM4Mv2 dataset (approximately 3.4 million organic molecules) and then evaluated on a comprehensive suite of downstream tasks to test the quality of the learned representations:

Molecular Property Prediction (QM9): The model was evaluated on 12 quantum chemical property prediction tasks for small molecules (134k molecules; 100k train, 18k val, 13k test split). DenoiseVAE achieved state-of-the-art or second-best performance on 11 of the 12 tasks, with particularly significant gains on $C_v$ (heat capacity), indicating better capture of vibrational modes.
Force Prediction (MD17): The task was to predict atomic forces from molecular dynamics trajectories for 8 different small molecules (9,500 train, 500 val split). DenoiseVAE was the top performer on 5 of the 8 molecules (Aspirin, Benzene, Ethanol, Naphthalene, Toluene), though it underperformed Frad on Malonaldehyde, Salicylic Acid, and Uracil by significant margins.
Ligand Binding Affinity (PDBBind v2019): On the PDBBind dataset with 30% and 60% protein sequence identity splits, the model showed strong generalization, outperforming baselines like Uni-Mol particularly on the more stringent 30% split across RMSE, Pearson correlation, and Spearman correlation.
PCQM4Mv2 Validation: DenoiseVAE achieved a validation MAE of 0.0777 on the PCQM4Mv2 HOMO-LUMO gap prediction task with only 1.44M parameters, competitive with models 10-40x larger (e.g., GPS++ at 44.3M params achieves 0.0778).
Ablation Studies: The authors analyzed the sensitivity to key hyperparameters, namely the prior’s standard deviation ($\sigma$) and the KL-divergence weight ($\lambda$), confirming that $\lambda=1$ and $\sigma=0.1$ are optimal. Removing the KL term leads to trivial solutions (near-zero noise). An additional ablation on the Noise Generator depth found 4 EGNN layers optimal over 2 layers. A comparison of independent (diagonal) versus non-independent (full covariance) noise sampling showed comparable results, suggesting the EGNN already captures inter-atomic dependencies implicitly.
Case Studies: Visualizations of the learned noise variances for different molecules confirmed that the model learns chemically intuitive noise patterns. For example, it applies smaller perturbations to atoms in a rigid bicyclic norcamphor derivative and larger ones to atoms in flexible functional groups of a cyclopropane derivative. Even identical functional groups (e.g., hydroxyl) receive different noise scales in different molecular contexts.

Key Findings on Force Field Learning

Primary Conclusion: Learning a molecule-adaptive and atom-specific noise distribution is a superior strategy for denoising-based pre-training compared to using fixed, hand-crafted heuristics. This more physically-grounded approach leads to representations that better capture molecular force fields.
Strong Benchmark Performance: DenoiseVAE achieves best or second-best results on 11 of 12 QM9 tasks, 5 of 8 MD17 molecules, and leads on the stringent 30% LBA split. Performance is mixed on some MD17 molecules (Malonaldehyde, Salicylic Acid, Uracil), where it trails Frad.
Effective Framework: The proposed VAE-based framework, which jointly trains a Noise Generator and a Denoising Module, is an effective and theoretically sound method for implementing this adaptive noise strategy. The interplay between the reconstruction loss and the KL-divergence regularization is key to its success.
Limitation and Future Direction: The method is based on classical force field assumptions. The authors note that integrating more accurate force fields represents a promising direction for future work.

Reproducibility Details

Artifacts

Artifact	Type	License	Notes
Serendipity-r/DenoiseVAE	Code	Unknown	Official implementation

Reproducibility Status

Source Code: The authors have released their code at Serendipity-r/DenoiseVAE on GitHub. No license is specified in the repository.
Implementation: Hyperparameters and architectures are detailed in the paper’s appendix (A.14), and the repository provides reference implementations.

Data

Pre-training Dataset: PCQM4Mv2 (approximately 3.4 million organic molecules)
Property Prediction: QM9 dataset (134k molecules; 100k train, 18k val, 13k test split) for 12 quantum chemical properties
Force Prediction: MD17 dataset (9,500 train, 500 val split) for 8 different small molecules
Ligand Binding Affinity: PDBBind v2019 (4,463 protein-ligand complexes) with 30% and 60% sequence identity splits

Algorithms

Noise Generator: 4-layer Equivariant Graph Neural Network (EGNN) that outputs atom-specific Gaussian noise distributions
Denoising Module: 7-layer EGNN decoder
Training Objective: $\mathcal{L}_{DenoiseVAE} = \mathcal{L}_{Denoise} + \lambda \mathcal{L}_{KL}$ with $\lambda=1$
Noise Sampling: Reparameterization trick with $\tilde{x}_i = x_i + \epsilon \sigma_i$
Prior Distribution: Standard deviation $\sigma=0.1$

Models

Model Size: 1.44M parameters total
Fine-tuning Protocol: Noise Generator discarded after pre-training; only the pre-trained Denoising Module (7-layer EGNN) is retained for downstream fine-tuning
Optimizer: AdamW with cosine learning rate decay (max LR of 0.0005)
Batch Size: 128
System Training: Fine-tuned end-to-end for specific tasks; force prediction involves computing the gradient of the predicted energy

Evaluation

Ablation Studies: Sensitivity analysis confirmed $\lambda=1$ and $\sigma=0.1$ as optimal hyperparameters; removing the KL term leads to trivial solutions (near-zero noise)
Noise Generator Depth: 4 EGNN layers outperformed 2 layers across both QM9 and MD17 benchmarks
Covariance Structure: Full covariance matrix (non-independent noise sampling) yielded comparable results to diagonal variance (independent sampling), likely because the EGNN already integrates neighboring atom information
O(3) Invariance: The method satisfies O(3) probabilistic invariance, meaning the noise distribution is unchanged under rotations and reflections

Hardware

GPU Configuration: Experiments conducted on a single RTX A3090 GPU; 6 GPUs with 144GB total memory sufficient for full reproduction
CPU: Intel Xeon Gold 5318Y @ 2.10GHz

Paper Information

Citation: Liu, Y., Chen, J., Jiao, R., Li, J., Huang, W., & Su, B. (2025). DenoiseVAE: Learning Molecule-Adaptive Noise Distributions for Denoising-based 3D Molecular Pre-training. The Thirteenth International Conference on Learning Representations (ICLR).

Publication: ICLR 2025

@inproceedings{liu2025denoisevae,
  title={DenoiseVAE: Learning Molecule-Adaptive Noise Distributions for Denoising-based 3D Molecular Pre-training},
  author={Yurou Liu and Jiahao Chen and Rui Jiao and Jiangmeng Li and Wenbing Huang and Bing Su},
  booktitle={The Thirteenth International Conference on Learning Representations},
  year={2025},
  url={https://openreview.net/forum?id=ym7pr83XQr}
}

Additional Resources:

eSEN: Smooth Interatomic Potentials (ICML Spotlight)

Sat, 23 Aug 2025 00:00:00 +0000

Paper Overview

This is a method paper. It addresses a critical disconnect in the evaluation of Machine Learning Interatomic Potentials (MLIPs) and introduces a novel architecture, eSEN, designed based on insights from this analysis. The paper proposes a new standard for evaluating MLIPs beyond simple test-set errors.

The Energy Conservation Gap in MLIP Evaluation

The motivation addresses a well-known but under-addressed problem in the field: improvements in standard MLIP metrics (lower energy/force MAE on static test sets) do not reliably translate to better performance on complex downstream tasks like molecular dynamics (MD) simulations, materials stability prediction, or phonon calculations. The authors seek to understand why this gap exists and how to design models that are both accurate on test sets and physically reliable in practical scientific workflows.

The eSEN Architecture and Continuous Representation

The novelty is twofold, spanning both a conceptual framework for evaluation and a new model architecture:

Energy Conservation as a Diagnostic Test: The core conceptual contribution is using an MLIP’s ability to conserve energy in out-of-distribution MD simulations as a crucial diagnostic test. The authors demonstrate that for models passing this test, a strong correlation between test-set error and downstream task performance is restored.
The eSEN Architecture: The paper introduces the equivariant Smooth Energy Network (eSEN), designed with specific choices to ensure a smooth and well-behaved Potential Energy Surface (PES):
- Strictly Conservative Forces: Forces are computed exclusively as the negative gradient of energy ($F = -\nabla E$), using conservative force prediction instead of faster direct-force prediction heads.
- Continuous Representations: Maintains strict equivariance and smoothness by using equivariant gated non-linearities instead of discretizing spherical harmonic representations during nodewise processing.
- Smooth PES Construction: Critical design choices include using distance cutoffs, polynomial envelope functions ensuring derivatives go to zero at cutoffs, and limited radial basis functions to avoid overly sensitive PES.
Efficient Training Strategy: A two-stage training regimen with fast pre-training using a non-conservative direct-force model, followed by fine-tuning to enforce energy conservation. This captures the efficiency of direct-force training while ensuring physical robustness.

Evaluating OOD Energy Conservation and Physical Properties

The paper presents a comprehensive experimental validation:

Ablation Studies on Energy Conservation: MD simulations on out-of-distribution systems (TM23 and MD22 datasets) systematically tested key design choices (direct-force vs. conservative, representation discretization, neighbor limits, envelope functions). This empirically demonstrated which choices lead to energy drift despite negligible impact on test-set MAE.
Physical Property Prediction Benchmarks: The eSEN model was evaluated on challenging downstream tasks:
- Matbench-Discovery: Materials stability and thermal conductivity prediction, where eSEN achieved the highest F1 score among compliant models and excelled at both metrics simultaneously.
- MDR Phonon Benchmark: Predicting phonon properties that test accurate second and third-order derivatives of the PES. eSEN achieved state-of-the-art results, particularly outperforming direct-force models.
- SPICE-MACE-OFF: Standard energy and force prediction on organic molecules, demonstrating that physical plausibility design choices enhanced raw accuracy.
Correlation Analysis: Explicit plots of test-set energy MAE versus performance on downstream benchmarks showed weak overall correlation that becomes strong and predictive when restricted to models passing the energy conservation test.

Outcomes and Conclusions

Primary Conclusion: Energy conservation is a critical, practical property for MLIPs. Using it as a filter re-establishes test-set error as a reliable proxy for model development, dramatically accelerating the innovation cycle. Models that are not conservative, even with low test error, are unreliable for many critical scientific applications.
Model Performance: The eSEN architecture outperforms base models across diverse tasks, from energy/force prediction to geometry optimization, phonon calculations, and thermal conductivity prediction.
Actionable Design Principles: The paper provides experimentally-validated architectural choices that promote physical plausibility. Seemingly minor details, like how atomic neighbors are selected, can have profound impacts on a model’s utility in simulations.
Efficient Path to Robust Models: The direct-force pre-training plus conservative fine-tuning strategy offers a practical method for developing physically robust models without incurring the full computational cost of conservative training from scratch.

Reproducibility

Artifacts

Artifact	Type	License	Notes
fairchem (GitHub)	Code	MIT	Official implementation within FAIR Chemistry framework
OMAT24 (Hugging Face)	Model	FAIR Acceptable Use Policy	Pre-trained eSEN-30M-MP and eSEN-30M-OAM checkpoints
OpenReview	Paper	CC BY 4.0	ICML 2025 camera-ready paper

Models

The eSEN architecture builds on components from eSCN (Equivariant Spherical Channel Network) and Equiformer, combining them with design choices that prioritize smoothness and energy conservation. The implementation integrates into the standard fairchem Open Catalyst experimental framework.

Layer Structure

Edgewise Convolution: Uses SO2 convolution layers (from eSCN) with an envelope function applied. Source and target embeddings are concatenated before convolution.
Nodewise Feed-Forward: Two equivariant linear layers with an intermediate SiLU-based gated non-linearity (from Equiformer).
Normalization: Equivariant Layer Normalization (from Equiformer).

Smoothness Design Choices

Several architectural decisions distinguish eSEN from prior work:

No Grid Projection: eSEN performs operations directly in the spherical harmonic space to maintain equivariance and energy conservation, bypassing the projection of spherical harmonics to spatial grids for non-linearity.
Distance Cutoff for Graph Construction: Uses a strict distance cutoff (6 Å for MPTrj models, 5 Å for SPICE models). Neighbor limits introduce discontinuities that break energy conservation.
Polynomial Envelope Functions: Ensures derivatives go to zero smoothly at the cutoff radius.

Algorithms

Two-Stage Training (eSEN-30M-MP)

Direct-Force Pre-training (60 epochs): Uses DeNS (Denoising Non-equilibrium Structures) to reduce overfitting. This stage is fast because it does not require backpropagation through energy gradients.
Conservative Fine-tuning (40 epochs): The direct-force head is removed, and forces are calculated via gradients ($F = -\nabla E$). This enforces energy conservation.

Important: DeNS is used exclusively during the direct-force pre-training stage, with a noising probability of 0.5, a standard deviation of 0.1 Å for the added Gaussian noise, and a DeNS loss coefficient of 10. The fine-tuning strategy reduces the wall-clock time for model training by 40%.

Optimization

Optimizer: AdamW with cosine learning rate scheduler
Max Learning Rate: $4 \times 10^{-4}$
Batch Size: 512 (for MPTrj models)
Weight Decay: $1 \times 10^{-3}$
Gradient Clipping: Norm of 100
Warmup: 0.1 epochs with a factor of 0.2

Loss Function

A composite loss combining per-atom energy MAE, force $L_2$ loss, and stress MAE:

$$ \begin{aligned} \mathcal{L} = \lambda_{\text{e}} \frac{1}{N} \sum_{i=1}^N \lvert E_{i} - \hat{E}_{i} \rvert + \lambda_{\text{f}} \frac{1}{3N} \sum_{i=1}^N \lVert \mathbf{F}_{i} - \hat{\mathbf{F}}_{i} \rVert_2^2 + \lambda_{\text{s}} \lVert \mathbf{S} - \hat{\mathbf{S}} \rVert_1 \end{aligned} $$

For MPTrj-30M, the weighting coefficients are set to $\lambda_{\text{e}} = 20$, $\lambda_{\text{f}} = 20$, and $\lambda_{\text{s}} = 5$.

Data

Training Data

Inorganic: MPTrj (Materials Project Trajectory) dataset
Organic: SPICE-MACE-OFF dataset

Test Data Construction

MPTrj Testing: Since MPTrj lacks an official test split, the authors created a test set using 5,000 random samples from the subsampled Alexandria (sAlex) dataset to ensure fair comparison.
Out-of-Distribution Conservation Testing:
- Inorganic: TM23 dataset (transition metal defects). Simulation: 100 ps, 5 fs timestep.
- Organic: MD22 dataset (large molecules). Simulation: 100 ps, 1 fs timestep.

Hardware

Compute for training operations predominantly utilizes 80GB NVIDIA A100 GPUs.

Inference Efficiency

For a periodic system of 216 atoms on a single A100 (PyTorch 2.4.0, CUDA 12.1, no compile/torchscript), the 2-layer eSEN models achieve approximately 0.4 million steps per day (3.2M parameters) and 0.8 million steps per day (6.5M parameters), comparable to MACE-OFF-L at 0.7 million steps per day.

Evaluation

The paper evaluated eSEN across three major benchmark tasks. Key evaluation metrics included energy MAE (meV/atom), force MAE (meV/Å), stress MAE (meV/Å/atom), F1 score for stability prediction, $\kappa_{\text{SRME}}$ for thermal conductivity, and phonon frequency accuracy.

Ablation Test-Set MAE (Table 1)

Design choices that dramatically affect energy conservation have negligible impact on static test-set MAE, which is precisely why test-set error alone is misleading. All models are 2-layer with 3.2M parameters, $L_{\text{max}} = 2$, $M_{\text{max}} = 2$:

Model	Energy MAE	Force MAE	Stress MAE
eSEN (default)	17.02	43.96	0.14
eSEN, direct-force	18.66	43.62	0.16
eSEN, neighbor limit	17.30	44.11	0.14
eSEN, no envelope	17.60	44.69	0.14
eSEN, $N_{\text{basis}} = 512$	19.87	48.29	0.15
eSEN, Bessel	17.65	44.83	0.15
eSEN, discrete, res=6	17.05	43.10	0.14
eSEN, discrete, res=10	17.11	43.13	0.14
eSEN, discrete, res=14	17.12	43.09	0.14

Energy MAE in meV/atom. Force MAE in meV/Å. Stress MAE in meV/Å/atom.

Matbench-Discovery (Tables 2 and 3)

Compliant models (trained only on MPTrj or its subset), unique prototype split:

Model	F1	DAF	$\kappa_{\text{SRME}}$	RMSD
eSEN-30M-MP	0.831	5.260	0.340	0.0752
eqV2-S-DeNS	0.815	5.042	1.676	0.0757
MatRIS-MP	0.809	5.049	0.861	0.0773
AlphaNet-MP	0.799	4.863	1.31	0.1067
DPA3-v2-MP	0.786	4.822	0.959	0.0823
ORB v2 MPtrj	0.765	4.702	1.725	0.1007
SevenNet-13i5	0.760	4.629	0.550	0.0847
GRACE-2L-MPtrj	0.691	4.163	0.525	0.0897
MACE-MP-0	0.669	3.777	0.647	0.0915
CHGNet	0.613	3.361	1.717	0.0949
M3GNet	0.569	2.882	1.412	0.1117

eSEN-30M-MP excels at both F1 and $\kappa_{\text{SRME}}$ simultaneously, while all previous models only achieve SOTA on one or the other.

Non-compliant models (trained on additional datasets):

Model	F1	$\kappa_{\text{SRME}}$	RMSD
eSEN-30M-OAM	0.925	0.170	0.0608
eqV2-M-OAM	0.917	1.771	0.0691
ORB v3	0.905	0.210	0.0750
SevenNet-MF-ompa	0.901	0.317	0.0639
DPA3-v2-OpenLAM	0.890	0.687	0.0679
GRACE-2L-OAM	0.880	0.294	0.0666
MatterSim-v1-5M	0.862	0.574	0.0733
MACE-MPA-0	0.852	0.412	0.0731

The eSEN-30M-OAM model is pre-trained on the OMat24 dataset, then fine-tuned on the subsampled Alexandria (sAlex) dataset and MPTrj dataset.

MDR Phonon Benchmark (Table 4)

Metrics: maximum phonon frequency MAE($\omega_{\text{max}}$) in K, vibrational entropy MAE($S$) in J/K/mol, Helmholtz free energy MAE($F$) in kJ/mol, heat capacity MAE($C_V$) in J/K/mol.

Model	MAE($\omega_{\text{max}}$)	MAE($S$)	MAE($F$)	MAE($C_V$)
eSEN-30M-MP	21	13	5	4
SevenNet-13i5	26	28	10	5
GRACE-2L (r6)	40	25	9	5
SevenNet-0	40	48	19	9
MACE	61	60	24	13
CHGNet	89	114	45	21
M3GNet	98	150	56	22

Direct-force models show dramatically worse performance at the standard 0.01 Å displacement (e.g., eqV2-S-DeNS: 280/224/54/94) but improve at larger displacements (0.2 Å: 58/26/8/8), revealing that their PES is rough near energy minima.

SPICE-MACE-OFF (Table 5)

Test set MAE for organic molecule energy/force prediction. Energy MAE in meV/atom, force MAE in meV/Å:

Dataset	MACE-4.7M (E/F)	EscAIP-45M* (E/F)	eSEN-3.2M (E/F)	eSEN-6.5M (E/F)
PubChem	0.88 / 14.75	0.53 / 5.86	0.22 / 6.10	0.15 / 4.21
DES370K M.	0.59 / 6.58	0.41 / 3.48	0.17 / 1.85	0.13 / 1.24
DES370K D.	0.54 / 6.62	0.38 / 2.18	0.20 / 2.77	0.15 / 2.12
Dipeptides	0.42 / 10.19	0.31 / 5.21	0.10 / 3.04	0.07 / 2.00
Sol. AA	0.98 / 19.43	0.61 / 11.52	0.30 / 5.76	0.25 / 3.68
Water	0.83 / 13.57	0.72 / 10.31	0.24 / 3.88	0.15 / 2.50
QMugs	0.45 / 16.93	0.41 / 8.74	0.16 / 5.70	0.12 / 3.78

*EscAIP-45M is a direct-force model. eSEN-6.5M outperforms MACE-OFF-L and EscAIP on all test splits. The smaller eSEN-3.2M has inference efficiency comparable to MACE-4.7M while achieving lower MAE.

Why These Design Choices Matter

Bounded Energy Derivatives and the Verlet Integrator

The theoretical foundation for why smoothness matters comes from Theorem 5.1 of Hairer et al. (2003). For the Verlet integrator (the standard NVE integrator), the total energy drift satisfies:

$$ |E(\mathbf{r}_T, \mathbf{a}) - E(\mathbf{r}_0, \mathbf{a})| \leq C \Delta t^2 + C_N \Delta t^N T $$

where $T$ is the total simulation time ($T \leq \Delta t^{-N}$), $N$ is the highest order for which the $N$th derivative of $E$ is continuously differentiable with bounded derivative, and $C$, $C_N$ are constants independent of $T$ and $\Delta t$. The first term is a time-independent fluctuation of $O(\Delta t^2)$; the second term governs long-term conservation. This means the PES must be continuously differentiable to high order, with bounded derivatives, for energy conservation in long-time simulations.

Architectural Choices That Break Conservation

The authors provide theoretical justification for why specific architectural choices break energy conservation:

Max Neighbor Limit (KNN): Introduces discontinuity in the PES. If a neighbor at distance $r$ moves to $r + \epsilon$ and drops out of the top-$K$, the energy changes discontinuously.
Grid Discretization: Projecting spherical harmonics to a spatial grid introduces discretization errors in energy gradients that break conservation. This can be mitigated with higher-resolution grids but not eliminated.
Direct-Force Prediction: Imposes no mathematical constraint that forces must be the gradient of an energy scalar field. In other words, $\nabla \times \mathbf{F} \neq 0$ is permitted, violating the requirement for a conservative force field.

Displacement Sensitivity in Phonon Calculations

An important empirical finding concerns how displacement values affect phonon predictions. Conservative models (eSEN, MACE) show convergent phonon band structures as displacement decreases toward zero. In contrast, direct-force models (eqV2-S-DeNS) fail to converge, exhibiting missing acoustic branches and spurious imaginary frequencies at small displacements. While direct-force models achieve competitive thermodynamic property accuracy at large displacements (0.2 Å), this is deceptive: the underlying phonon band structures remain inaccurate, and the apparent accuracy comes from Boltzmann-weighted integrals smoothing over errors.

Paper Information

Citation: Fu, X., Wood, B. M., Barroso-Luque, L., Levine, D. S., Gao, M., Dzamba, M., & Zitnick, C. L. (2025). Learning Smooth and Expressive Interatomic Potentials for Physical Property Prediction. Proceedings of the 42nd International Conference on Machine Learning (ICML), PMLR 267:17875–17893.

Publication: ICML 2025 (Spotlight)

@inproceedings{fu2025learning,
  title={Learning Smooth and Expressive Interatomic Potentials for Physical Property Prediction},
  author={Fu, Xiang and Wood, Brandon M. and Barroso-Luque, Luis and Levine, Daniel S. and Gao, Meng and Dzamba, Misko and Zitnick, C. Lawrence},
  booktitle={Proceedings of the 42nd International Conference on Machine Learning},
  series={Proceedings of Machine Learning Research},
  volume={267},
  pages={17875--17893},
  publisher={PMLR},
  year={2025}
}

Additional Resources:

Efficient DFT Hamiltonian Prediction via Adaptive Sparsity

Sat, 23 Aug 2025 00:00:00 +0000

Core Innovation: Adaptive Sparsity in SE(3) Networks

This is a methodological paper introducing a novel architecture and training curriculum to solve efficiency bottlenecks in Geometric Deep Learning. It directly tackles the primary computational bottleneck in modern SE(3)-equivariant graph neural networks (the tensor product operation) and proposes a generalizable solution through adaptive network sparsification.

The Computational Bottleneck in DFT Hamiltonian Prediction

SE(3)-equivariant networks are accurate but unscalable for DFT Hamiltonian prediction due to two key bottlenecks:

Atom Scaling: Tensor Product (TP) operations grow quadratically with atoms ($N^2$).
Basis Set Scaling: Computational complexity grows with the sixth power of the angular momentum order ($L^6$). Larger basis sets (e.g., def2-TZVP) require higher orders ($L=6$), making them prohibitively slow.

Existing SE(3)-equivariant models cannot handle large molecules (40-100 atoms) with high-quality basis sets, limiting their practical applicability in computational chemistry.

SPHNet Architecture and the Three-Phase Sparsity Scheduler

SPHNet introduces Adaptive Sparsity to prune redundant computations at two levels:

Sparse Pair Gate: Learns which atom pairs to include in message passing, adapting the interaction graph based on importance.
Sparse TP Gate: Filters which spherical harmonic triplets $(l_1, l_2, l_3)$ are computed in tensor product operations, pruning higher-order combinations that contribute less to accuracy.
Three-Phase Sparsity Scheduler: A training curriculum (Random → Adaptive → Fixed) that enables stable convergence to high-performing sparse subnetworks.

Key insight: The Sparse Pair Gate learns to preserve long-range interactions (16-25 Angstrom) at higher rates than short-range ones. Short-range pairs are abundant and easier to learn, while rare long-range interactions require more samples for accurate representation, making them more critical to retain.

Benchmarks and Ablation Studies

The authors evaluated SPHNet on three datasets (MD17, QH9, and PubChemQH) with varying molecule sizes and basis set complexities. Baselines include SchNOrb, PhiSNet, QHNet, and WANet. SchNOrb and PhiSNet results are limited to MD17, as those models are designed for trajectory datasets. WANet was not open-sourced, so only partial metrics from its paper are reported.

Evaluation Metrics

Hamiltonian MAE ($H$): Mean absolute error between predicted and DFT-computed Hamiltonian matrices, in Hartrees ($E_h$)
Occupied Orbital Energy MAE ($\epsilon$): Mean absolute error of all occupied molecular orbital energies derived from the predicted Hamiltonian
Orbital Coefficient Similarity ($\psi$): Cosine similarity of occupied molecular orbital coefficients between predicted and reference wavefunctions

Ablation Studies

Sparse Gates (on PubChemQH):

Configuration	$H$ [$10^{-6} E_h$] $\downarrow$	Memory [GB] $\downarrow$	Speedup $\uparrow$
Both gates	97.31	5.62	7.09x
Pair Gate only	87.70	6.98	2.73x
TP Gate only	94.31	8.04	3.98x
Neither gate	86.35	10.91	1.73x

The Sparse Pair Gate contributes a 78% speedup with 30% memory reduction. The Sparse TP Gate (pruning 70% of combinations) yields a 160% speedup. Both gates together achieve the highest speedup, though accuracy slightly decreases compared to no gating.

Three-Phase Scheduler: Removing the random phase causes convergence to local optima ($112.68 \pm 10.75$ vs $97.31 \pm 0.52$). Removing the adaptive phase increases variance and lowers accuracy ($122.79 \pm 19.02$). Removing the fixed phase has minimal accuracy impact but reduces speedup from 7.09x to 5.45x due to dynamic graph overhead.

Sparsity Rate: The critical sparsity threshold scales with system complexity: 30% for MD17 (small molecules), 40% for QH9 (medium), and 70% for PubChemQH (large). Beyond the threshold, MAE increases sharply. Computational cost decreases approximately linearly with sparsity rate.

Transferability to Other Models

To demonstrate the speedup is architecture-agnostic, the authors applied the Sparse Pair Gate and Sparse TP Gate to the QHNet baseline on PubChemQH:

Configuration	$H$ [$10^{-6} E_h$] $\downarrow$	Memory [GB] $\downarrow$	Speedup $\uparrow$
QHNet baseline	123.74	22.50	1.00x
+ TP Gate	128.16	12.68	2.04x
+ Pair Gate	126.27	10.07	1.66x
+ Both gates	128.89	8.46	3.30x

The gates reduced QHNet’s memory by 62% and improved speed by 3.3x with modest accuracy trade-off, confirming the gates are portable modules applicable to other SE(3)-equivariant architectures.

Performance Results

QH9 (134k molecules, $\leq$ 20 atoms)

SPHNet achieves 3.3x to 4.0x speedup over QHNet across all four QH9 splits, with improved Hamiltonian MAE and orbital energy MAE. Memory drops to 0.23 GB/sample (33% of QHNet’s 0.70 GB). On the stable-iid split, Hamiltonian MAE improves from 76.31 to 45.48 ($10^{-6} E_h$).

PubChemQH (50k molecules, 40-100 atoms)

Model	$H$ [$10^{-6} E_h$] $\downarrow$	$\epsilon$ [$E_h$] $\downarrow$	$\psi$ [$10^{-2}$] $\uparrow$	Memory [GB] $\downarrow$	Speedup $\uparrow$
QHNet	123.74	3.33	2.32	22.5	1.0x
WANet	99.98	1.17	3.13	15.0	2.4x
SPHNet	97.31	2.16	2.97	5.62	7.1x

SPHNet achieves the best Hamiltonian MAE and efficiency, though WANet outperforms on orbital energy MAE and coefficient similarity. The higher speedup on PubChemQH (vs QH9) reflects greater computational redundancy in larger systems with higher-order basis sets ($L_{max} = 6$ for def2-TZVP vs $L_{max} = 4$ for def2-SVP).

MD17 (Small Molecule Trajectories)

SPHNet achieves accuracy comparable to QHNet and PhiSNet on four MD17 molecules (water, ethanol, malondialdehyde, uracil; 3-12 atoms). MD17 represents a simpler task where baseline models already perform well, leaving limited room for improvement. For water (3 atoms), the number of interaction combinations is inherently small, limiting the benefit of adaptive sparsification.

Scaling Limit

SPHNet can train on systems with approximately 3000 atomic orbitals on a single A6000 GPU; the QHNet baseline runs out of memory at approximately 1800 orbitals. Memory consumption scales more favorably as molecule size increases.

Key Findings

Adaptive sparsity scales with system complexity: The method is most effective for large systems where redundancy is high. For small molecules (e.g., water with only 3 atoms), every interaction is critical, so pruning hurts accuracy and yields negligible speedup.
Long-range pair preservation: The Sparse Pair Gate selects long-range pairs (16-25 Angstrom) at higher rates than short-range ones. Short-range pairs are numerous and easier to learn, while rare long-range interactions are harder to represent and thus more critical to retain.
Generalizable components: The sparsification techniques are portable modules, demonstrated by successful integration into QHNet with 3.3x speedup.
Architecture ablation: Removing one Vectorial Node Interaction block or Spherical Node Interaction block significantly hurts accuracy, confirming the importance of the progressive order-increase design. Removing one Pair Construction block has less impact, suggesting room for further speedup.

Reproducibility Details

Artifacts

Artifact	Type	License	Notes
SPHNet (GitHub)	Code	MIT	Official implementation; archived by Microsoft (Dec 2025), read-only
PubChemQH (Hugging Face)	Dataset	MIT	50k molecules, 40-100 atoms, def2-TZVP basis

No pre-trained model weights are provided. MD17 and QH9 are publicly available community datasets. Training requires 4x NVIDIA A100 (80GB) GPUs; benchmarking uses a single NVIDIA RTX A6000 (46GB).

Data

The experiments evaluated SPHNet on three datasets with different molecular sizes and basis set complexities. All datasets use DFT calculations as ground truth, with MD17 using the PBE exchange-correlation functional and QH9/PubChemQH using B3LYP.

Dataset	Molecules	Molecule Size	Basis Set	$L_{max}$	Functional
MD17	4 systems	3-12 atoms (water, ethanol, malondialdehyde, uracil)	def2-SVP	4	PBE
QH9	134k	$\leq$ 20 atoms (Stable/Dynamic splits)	def2-SVP	4	B3LYP
PubChemQH	50k	40-100 atoms	def2-TZVP	6	B3LYP

Data Availability:

MD17 & QH9: Publicly available
PubChemQH: Publicly available on Hugging Face (EperLuo/PubChemQH)

Algorithms

Loss Function:

The model learns the residual $\Delta H$:

$$ \begin{aligned} \Delta H &= H_{\text{ref}} - H_{\text{init}} \\ \mathcal{L} &= \text{MAE}(H_{\text{ref}}, H_{\text{pred}}) + \text{MSE}(H_{\text{ref}}, H_{\text{pred}}) \end{aligned} $$

where $H_{\text{init}}$ is a computationally inexpensive initial guess computed via PySCF.

Hyperparameters:

Parameter	PubChemQH	QH9	MD17
Batch Size	8	32	10 (uracil: 5)
Training Steps	300k	260k	200k
Warmup Steps	1k	1k	1k
Learning Rate	1e-3	1e-3	5e-4
Sparsity Rate	0.7	0.4	0.1-0.3
TSS Epoch $t$	3	3	3

Sparse Pair Gate: Adapts the interaction graph. It concatenates zero-order features and inner products of atom pairs, then passes them through a linear layer $F_p$ with sigmoid activation to learn a weight $W_p^{ij}$ for every pair. Pairs are kept only if selected by the scheduler ($U_p^{TSS}$). The overhead comes primarily from the linear layer $F_p$.

Sparse TP Gate: Filters triplets $(l_1, l_2, l_3)$ inside the TP operation. Higher-order combinations are more likely to be pruned. Complexity: $\mathcal{O}(L^3)$.

Three-Phase Sparsity Scheduler: Training curriculum designed to optimize the sparse gates effectively:

Phase 1 (Random): Random selection ($1-k$ probability) to ensure unbiased weight updates. Complexity: $\mathcal{O}(|U|)$.
Phase 2 (Adaptive): Selects top $(1-k)$ percent based on learned magnitude. Complexity: $\mathcal{O}(|U|\log|U|)$.
Phase 3 (Fixed): Freezes the connectivity mask for maximum inference speed. No overhead.

Weight Initialization: Learnable sparsity weights ($W$) initialized as all-ones vector.

Models

The model predicts the Hamiltonian matrix $H$ from atomic numbers $Z$ and coordinates $r$.

Inputs: Atomic numbers ($Z$) and 3D coordinates.

Backbone Structure:

Vectorial Node Interaction (x4): Uses long-short range message passing. Extracts vectorial representations ($l=1$) without high-order TPs to save cost.
Spherical Node Interaction (x2): Projects features to high-order spherical harmonics (up to $L_{max}$). The first block increases the maximum order from 0 to $L_{max}$ without the Sparse Pair Gate; the second block applies the Sparse Pair Gate to filter node pairs.
Pair Construction Block (x2): Splits into Diagonal (self-interaction) and Non-Diagonal (cross-interaction) blocks. Both use the Sparse TP Gate to prune cross-order combinations $(l_1, l_2, l_3)$. The Non-Diagonal blocks also use the Sparse Pair Gate to filter atom pairs. The two Pair Construction blocks receive representations from the two Spherical Node Interaction blocks respectively, and their outputs are summed.
Expansion Block: Reconstructs the full Hamiltonian matrix from the sparse irreducible representations, exploiting symmetry ($H_{ji} = H_{ij}^T$) to halve computations.

Hardware

Training: 4x NVIDIA A100 (80GB)
Benchmarking: Single NVIDIA RTX A6000 (46GB)

Paper Information

Citation: Luo, E., Wei, X., Huang, L., Li, Y., Yang, H., Xia, Z., Wang, Z., Liu, C., Shao, B., & Zhang, J. (2025). Efficient and Scalable Density Functional Theory Hamiltonian Prediction through Adaptive Sparsity. Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:41368–41390.

Publication: ICML 2025

@inproceedings{luo2025efficient,
  title={Efficient and Scalable Density Functional Theory Hamiltonian Prediction through Adaptive Sparsity},
  author={Luo, Erpai and Wei, Xinran and Huang, Lin and Li, Yunyang and Yang, Han and Xia, Zaishuo and Wang, Zun and Liu, Chang and Shao, Bin and Zhang, Jia},
  booktitle={Proceedings of the 42nd International Conference on Machine Learning},
  pages={41368--41390},
  year={2025},
  volume={267},
  series={Proceedings of Machine Learning Research},
  publisher={PMLR}
}

Additional Resources:

ICML 2025 poster page
OpenReview forum
PDF on OpenReview
GitHub Repository (Note: The official repository was archived by Microsoft in December 2025. It is available for reference but no longer actively maintained.)

Dark Side of Forces: Non-Conservative ML Force Models

Sat, 23 Aug 2025 00:00:00 +0000

Contribution: Systematic Assessment of Non-Conservative ML Force Models

This is a Systematization paper. It systematically catalogs the exact failure modes of existing non-conservative force approaches, quantifies them with a new diagnostic metric, and proposes a hybrid Multiple Time-Stepping solution combining the speed benefits of direct force prediction with the physical correctness of conservative models.

Motivation: The Speed-Accuracy Trade-off in ML Force Fields

Many recent machine learning interatomic potential (MLIP) architectures predict forces directly ($F_\theta(r)$). This “non-conservative” approach avoids the computational overhead of automatic differentiation, yielding faster inference (typically 2-3x speedup) and faster training (up to 3x). However, it sacrifices energy conservation and rotational constraints, potentially destabilizing molecular dynamics simulations. The field lacks rigorous quantification of when this trade-off breaks down and how to mitigate the failures.

Novelty: Jacobian Asymmetry and Hybrid Architectures

Four key contributions:

Jacobian Asymmetry Metric ($\lambda$): A quantitative diagnostic for non-conservation. Since conservative forces derive from a scalar field, their Jacobian (the Hessian of energy) must be symmetric. The normalized norm of the antisymmetric part quantifies the degree of violation: $$ \lambda = \frac{|| \mathbf{J}_{\text{anti}} ||_F}{|| \mathbf{J} ||_F} $$ where $\mathbf{J}_{\text{anti}} = (\mathbf{J} - \mathbf{J}^\top)/2$. Measured values range from $\lambda \approx 0.004$ (PET-NC) to $\lambda \approx 0.032$ (SOAP-BPNN-NC), with ORB at 0.015 and EquiformerV2 at 0.017. Notably, the pairwise $\lambda_{ij}$ approaches 1 at large interatomic distances, meaning non-conservative artifacts disproportionately affect long-range and collective interactions.
Systematic Failure Mode Catalog: First comprehensive demonstration that non-conservative models cause runaway heating in NVE ensembles (temperature drifts of ~7,000 billion K/s for PET-NC and ~10x larger for ORB) and equipartition violations in NVT ensembles where different atom types equilibrate to different temperatures, a physical impossibility.
Theoretical Analysis of Force vs. Energy Training: Force-only training overemphasizes high-frequency vibrational modes because force labels carry per-atom gradients that are dominated by stiff, short-range interactions. Energy labels provide a more balanced representation across the frequency spectrum. Additionally, conservative models benefit from backpropagation extending the effective receptive field to approximately 2x the interaction cutoff, while direct-force models are limited to the nominal cutoff radius.
Hybrid Training and Inference Protocol: A practical workflow that combines fast direct-force prediction with conservative corrections:
- Training: Pre-train on direct forces, then fine-tune on energy gradients (2-4x faster than training conservative models from scratch)
- Inference: Multiple Time-Stepping (MTS) where fast non-conservative forces are periodically corrected by slower conservative forces

Methodology: Systematic Failure Mode Analysis

The evaluation systematically tests multiple state-of-the-art models across diverse simulation scenarios:

Models tested:

PET-C/PET-NC (Point Edge Transformer, conservative and non-conservative variants)
PET-M (hybrid variant jointly predicting both conservative and non-conservative forces)
ORB-v2 (non-conservative, trained on Alexandria/MPtrj)
EquiformerV2 (non-conservative equivariant Transformer)
MACE-MP-0 (conservative message-passing)
SevenNet (conservative message-passing)
SOAP-BPNN-C/SOAP-BPNN-NC (descriptor-based baseline, both conservative and non-conservative variants)

Test scenarios:

NVE stability tests on bulk liquid water, graphene, amorphous carbon, and FCC aluminum
Thermostat artifact analysis with Langevin and GLE thermostats
Geometry optimization on water snapshots and QM9 molecules using FIRE and L-BFGS
MTS validation on OC20 catalysis dataset
Species-resolved temperature measurements for equipartition testing

Key metrics:

Jacobian asymmetry ($\lambda$)
Kinetic temperature drift in NVE
Velocity-velocity correlations
Radial distribution functions
Species-resolved temperatures
Inference speed benchmarks

Results: Simulation Instability and Hybrid Solutions

Purely non-conservative models are unsuitable for production simulations due to uncontrollable unphysical artifacts that no thermostat can correct. Key findings:

Performance failures:

Non-conservative models exhibited catastrophic temperature drift in NVE simulations: ~7,000 billion K/s for PET-NC and ~70,000 billion K/s for ORB, with EquiformerV2 comparable to PET-NC
Strong Langevin thermostats ($\tau=10$ fs) damped diffusion by ~5x, negating the speed benefits of non-conservative models
Advanced GLE thermostats also failed to control non-conservative drift (ORB reached 1181 K vs. 300 K target)
Equipartition violations: under stochastic velocity rescaling, O and H atoms equilibrated at different temperatures. For ORB, H atoms reached 336 K and O atoms 230 K against a 300 K target. For PET-NC, deviations were smaller but still significant (H at 296 K, O at 310 K).
Geometry optimization was more fragile with non-conservative forces: inaccurate NC models (SOAP-BPNN-NC) failed catastrophically, while more accurate ones (PET-NC) could converge with FIRE but showed large force fluctuations with L-BFGS. Non-conservative models consistently had lower success rates across water and QM9 benchmarks.

Hybrid solution success:

MTS with non-conservative forces corrected every 8 steps ($M=8$) achieved conservative stability with only ~20% overhead compared to a purely non-conservative trajectory. Results were essentially indistinguishable from fully conservative simulations. Higher stride values ($M=16$) became unstable due to resonances between fast degrees of freedom and integration errors.
Conservative fine-tuning achieved the accuracy of from-scratch training in about 1/3 the total training time (2-4x resource reduction)
Validated on OC20 catalysis benchmark

Scaling caveat: The authors note that as training datasets grow and models become more expressive, non-conservative artifacts should diminish because accurate models naturally exhibit less non-conservative behavior. However, they argue the best path forward is hybrid approaches rather than waiting for scale to solve the problem.

Recommendation: The optimal production path is hybrid architectures using direct forces for acceleration (via MTS and pre-training) while anchoring models in conservative energy surfaces. This captures computational benefits without sacrificing physical reliability.

Reproducibility Details

Data

Primary training/evaluation:

Bulk Liquid Water (Cheng et al., 2019): revPBE0-D3 calculations with over 250,000 force/energy targets, chosen for rigorous thermodynamic testing

Generalization tests:

Graphene, amorphous carbon, FCC aluminum (tested with general-purpose foundation models)

Benchmarks:

QM9: Geometry optimization tests
OC20 (Open Catalyst): Oxygen on alloy surfaces for MTS validation

All datasets publicly available through cited sources.

Models

Point Edge Transformer (PET) variants:

PET-C (Conservative): Forces via energy backpropagation
PET-NC (Non-Conservative): Direct force prediction head, slightly higher parameter count
PET-M (Hybrid): Jointly predicts both conservative and non-conservative forces, accuracy within ~10% of the best single-task models

Baseline comparisons:

Model	Type	Training Data	Notes
ORB-v2	Non-conservative	Alexandria/MPtrj	Rotationally unconstrained
EquiformerV2	Non-conservative	Alexandria/MPtrj	Equivariant Transformer
MACE-MP-0	Conservative	MPtrj	Equivariant message-passing
SevenNet	Conservative	MPtrj	Equivariant message-passing
SOAP-BPNN-C	Conservative	Bulk water	Descriptor-based baseline
SOAP-BPNN-NC	Non-conservative	Bulk water	Descriptor-based baseline

Training details:

Loss functions: PET-C uses joint Energy + Force $L^2$ loss; PET-NC uses Force-only $L^2$ loss
Fine-tuning protocol: PET-NC converted to conservative via energy head fine-tuning
MTS configuration: Non-conservative forces with conservative corrections every 8 steps ($M=8$)

Evaluation

Metrics & Software: Molecular dynamics evaluations were performed using i-PI, while geometry optimizations used ASE (Atomic Simulation Environment). Note that primary code reproducibility is provided via an archived Zenodo snapshot; the authors did not link a live, public GitHub repository.

Jacobian asymmetry ($\lambda$): Quantifies non-conservation via antisymmetric component
Temperature drift: NVE ensemble stability
Velocity-velocity correlation ($\hat{c}_{vv}(\omega)$): Thermostat artifact detection
Radial distribution functions ($g(r)$): Structural accuracy
Species-resolved temperature: Equipartition testing
Inference speed: Wall-clock time per MD step

Key results:

Model	Speed (ms/step)	NVE Stability	Notes
PET-NC	8.58	Failed	~7,000 billion K/s drift
PET-C	19.4	Stable	2.3x slower than PET-NC
SevenNet	52.8	Stable	Conservative baseline
PET Hybrid (MTS)	~10.3	Stable	~20% overhead vs. pure NC

Thermostat artifacts:

Langevin ($\tau=10$ fs) dampened diffusion by ~5x (weaker coupling at $\tau=100$ fs reduced diffusion by ~1.5x)
GLE thermostats also failed to control non-conservative drift
Equipartition violations under SVR: ORB showed H at 336 K and O at 230 K (target 300 K); PET-NC showed smaller but significant species-resolved deviations

Optimization failures:

Non-conservative models showed lower geometry optimization success rates across water and QM9 benchmarks, with inaccurate NC models failing catastrophically

Hardware

Compute resources:

Training: From-scratch baseline models were trained using 4x Nvidia H100 GPUs (over a duration of around two days).
Fine-Tuning: Conservative fine-tuning was performed using a single (1x) Nvidia H100 GPU for a duration of one day.
This hybrid fine-tuning approach achieved a 2-4x reduction in computational resources compared to training conservative models from scratch.

Reproduction resources:

Artifact	Type	License	Notes
Zenodo repository	Code/Data	Unknown	Code and data to reproduce all results
MTS inference tutorial	Other	Unknown	Multiple time-stepping dynamics tutorial
Conservative fine-tuning tutorial	Other	Unknown	Fine-tuning workflow tutorial

Paper Information

Citation: Bigi, F., Langer, M. F., & Ceriotti, M. (2025). The dark side of the forces: assessing non-conservative force models for atomistic machine learning. Proceedings of the 42nd International Conference on Machine Learning, PMLR 267.

Publication: ICML 2025

@inproceedings{bigi2025dark,
  title={The dark side of the forces: assessing non-conservative force models for atomistic machine learning},
  author={Bigi, Filippo and Langer, Marcel F and Ceriotti, Michele},
  booktitle={Proceedings of the 42nd International Conference on Machine Learning},
  series={Proceedings of Machine Learning Research},
  volume={267},
  address={Vancouver, Canada},
  year={2025}
}

Additional Resources:

Beyond Atoms: 3D Space Modeling for Molecular Pretraining

Sat, 23 Aug 2025 00:00:00 +0000

Paper Typology and Contribution

This is a Method paper. It challenges the atom-centric paradigm of molecular representation learning by proposing a novel framework that models the continuous 3D space surrounding atoms. The core contribution is SpaceFormer, a Transformer-based architecture that discretizes molecular space into grids to capture physical phenomena (electron density, electromagnetic fields) often missed by traditional point-cloud models.

The Physical Intuition: Modeling “Empty” Space

The Gap: Prior 3D molecular representation models, such as Uni-Mol, treat molecules as discrete sets of atoms, essentially point clouds in 3D space. However, from a quantum physics perspective, the “empty” space between atoms is far from empty. It is permeated by electron density distributions and electromagnetic fields that determine molecular properties.

The Hypothesis: Explicitly modeling this continuous 3D space alongside discrete atom positions yields superior representations for downstream tasks, particularly for computational properties that depend on electronic structure, such as HOMO/LUMO energies and energy gaps.

A Surprising Observation: Virtual Points Improve Representations

Before proposing SpaceFormer, the authors present a simple yet revealing experiment. They augment Uni-Mol by adding randomly sampled virtual points (VPs) from the 3D space within the circumscribed cuboid of each molecule. These VPs carry no chemical information whatsoever: they are purely random noise points.

The result is surprising: adding just 10 random VPs already yields a noticeable improvement in validation loss. The improvement remains consistent and gradually increases as the number of VPs grows, eventually reaching a plateau. This observation holds across downstream tasks as well, with Uni-Mol + VPs improving on several quantum property predictions (LUMO, E1-CC2, E2-CC2) compared to vanilla Uni-Mol.

The implication is that even uninformative spatial context helps the model learn better representations, motivating a principled framework for modeling the full 3D molecular space.

SpaceFormer: Voxelization and 3D Positional Encodings

The key innovation is treating the molecular representation problem as 3D space modeling. SpaceFormer follows these core steps:

Voxelizes the entire 3D space into a grid with cells of $0.49\text{\AA}$ (based on O-H bond length to ensure at most one atom per cell).
Uses adaptive multi-resolution grids to efficiently handle empty space, keeping it fine-grained near atoms and coarse-grained far away.
Applies Transformers to 3D spatial tokens with custom positional encodings that achieve linear complexity.

Specifically, the model utilizes two forms of 3D Positional Encoding:

3D Directional PE (RoPE Extension) They extend Rotary Positional Encoding (RoPE) to 3D continuous space by splitting the Query and Key vectors into three blocks (one for each spatial axis). The directional attention mechanism takes the form:

$$ \begin{aligned} \mathbf{q}_{i}^{\top} \mathbf{k}_{j} = \sum_{s=1}^{3} \mathbf{q}_{i,s}^{\top} \mathbf{R}(c_{j,s} - c_{i,s}) \mathbf{k}_{j,s} \end{aligned} $$

3D Distance PE (RFF Approximation) To compute invariant geometric distance without incurring quadratic memory overhead, they use Random Fourier Features (RFF) to approximate a Gaussian kernel of pairwise distances:

$$ \begin{aligned} \exp \left( - \frac{| \mathbf{c}_i - \mathbf{c}_j |_2^2}{2\sigma^2} \right) &\approx z(\mathbf{c}_i)^\top z(\mathbf{c}_j) \\ z(\mathbf{c}_i) &= \sqrt{\frac{2}{d}} \cos(\sigma^{-1} \mathbf{c}_i^\top \boldsymbol{\omega} + \mathbf{b}) \end{aligned} $$

This approach enables the model to natively encode complex field-like phenomena without computing exhaustive $O(N^2)$ distance matrices.

Experimental Setup and Downstream Tasks

Pretraining Data: 19 million unlabeled molecules from the same dataset used by Uni-Mol.

Downstream Benchmarks: The authors propose a new benchmark of 15 tasks, motivated by known limitations of MoleculeNet: invalid structures, inconsistent chemical representations, data curation errors, and an inability to adequately distinguish model performance. The tasks split into two categories:

Computational Properties (Quantum Mechanics)
- Subsets of GDB-17 (HOMO, LUMO, GAP energy prediction, 20K samples; E1-CC2, E2-CC2, f1-CC2, f2-CC2, 21.7K samples)
- Cata-condensed polybenzenoid hydrocarbons (Dipole moment, adiabatic ionization potential, D3 dispersion correction, 8,678 samples)
- Metric: Mean Absolute Error (MAE)
Experimental Properties (Pharma/Bio)
- MoleculeNet tasks (BBBP, BACE for drug discovery)
- Biogen ADME tasks (HLM, MME, Solubility)
- Metrics: AUC for classification, MAE for regression

Splitting Strategy: All datasets use 8:1:1 train/validation/test ratio with scaffold splitting to test out-of-distribution generalization.

Training Setup:

Objective: Masked Auto-Encoder (MAE) with 30% random masking. Model predicts whether a cell contains an atom, and if so, regresses both atom type and precise offset position.
Hardware: ~50 hours on 8 NVIDIA A100 GPUs
Optimizer: Adam ($\beta_1=0.9, \beta_2=0.99$)
Learning Rate: Peak 1e-4 with linear decay and 0.01 warmup ratio
Batch Size: 128
Total Updates: 1 million

Baseline Comparisons: GROVER (2D graph-based MPR), GEM (2D graph enhanced with 3D information), 3D Infomax (GNN with 3D information), Uni-Mol (3D MPR, primary baseline using the same pretraining dataset), and Mol-AE (extends Uni-Mol with atom-based MAE pretraining).

Results and Analysis

Strong Contextual Performance: SpaceFormer ranked 1st in 10 of 15 tasks and in the top 2 for 14 of 15 tasks. It surpassed the runner-up models by approximately 20% on quantum property tasks (HOMO, LUMO, GAP, E1-CC2, Dipmom), validating that modeling non-atom space captures electronic structure better than atom-only regimes.

Key Results on Quantum Properties

Task	GROVER	GEM	3D Infomax	Uni-Mol	Mol-AE	SpaceFormer
HOMO (Ha)	0.0075	0.0068	0.0065	0.0052	0.0050	0.0042
LUMO (Ha)	0.0086	0.0080	0.0070	0.0060	0.0057	0.0040
GAP (Ha)	0.0109	0.0107	0.0095	0.0081	0.0080	0.0064
E1-CC2 (eV)	0.0101	0.0090	0.0089	0.0067	0.0070	0.0058
Dipmom (Debye)	0.0752	0.0289	0.0291	0.0106	0.0113	0.0083

SpaceFormer’s advantage is most pronounced on computational properties that depend on electronic structure. On experimental biological tasks (e.g., BBBP), where measurements are noisy, the advantage narrows or reverses: Uni-Mol achieves 0.9066 AUC on BBBP compared to SpaceFormer’s 0.8605.

Ablation Studies

The authors present several ablations that isolate the source of SpaceFormer’s improvements:

MAE vs. Denoising: SpaceFormer with MAE pretraining outperforms SpaceFormer with denoising on all four ablation tasks. The MAE objective requires predicting whether an atom exists in a masked voxel, which forces the model to learn global structural dependencies. In the denoising variant, only atom cells are masked so the model never needs to predict atom existence, reducing the task to coordinate regression.

FLOPs Control: A SpaceFormer-Large model (4x width, atom-only) trained with comparable FLOPs still falls short of SpaceFormer with 1000 non-atom cells on most downstream tasks. This confirms the improvement comes from modeling 3D space, not from additional compute.

Virtual Points vs. SpaceFormer: Adding up to 200 random virtual points to Uni-Mol improves some tasks but leaves a significant gap compared to SpaceFormer, demonstrating that principled space discretization outperforms naive point augmentation.

Efficiency Validation: The Adaptive Grid Merging method reduces the number of cells by roughly 10x with virtually no performance degradation. The 3D positional encodings scale linearly with the number of cells, while Uni-Mol’s pretraining cost scales quadratically.

Scope and Future Directions

SpaceFormer does not incorporate built-in SE(3) equivariance, relying instead on data augmentation (random rotations and random boundary padding) during training. The authors identify extending SpaceFormer to force field tasks and larger systems such as proteins and complexes as promising future directions.

Reproducibility Details

Code and Data Availability

Source Code: As of the current date, the authors have not released the official source code or pre-trained weights.
Datasets: Pretraining utilized the same 19M unlabeled molecule dataset as Uni-Mol. Downstream tasks use a newly curated internal benchmark built from subsets of GDB-17, MoleculeNet, and Biogen ADME. The exact customized scaffold splits for these evaluations are pending the official code release.
Compute: Pretraining the base SpaceFormer encoder (~67.8M parameters, configured to merge level 3) required approximately 50 hours on 8 NVIDIA A100 GPUs.

Artifact	Type	License	Notes
Source code	Code	N/A	Not publicly released as of March 2026
Pre-trained weights	Model	N/A	Not publicly released
Pretraining data (19M molecules)	Dataset	Unknown	Same dataset as Uni-Mol; not independently released
Downstream benchmark splits	Dataset	N/A	Custom scaffold splits pending code release

Models

The model treats a molecule as a 3D “image” via voxelization, processed by a Transformer.

Input Representation:

Discretization: 3D space divided into grid cells with length $0.49\text{\AA}$ (based on O-H bond length to ensure at most one atom per cell)
Tokenization: Tokens are pairs $(t_i, c_i)$ where $t_i$ is atom type (or NULL) and $c_i$ is the coordinate
Embeddings: Continuous embeddings with dimension 512. Inner-cell positions discretized with $0.01\text{\AA}$ precision

Transformer Specifications:

Component	Layers	Attention Heads	Embedding Dim	FFN Dim
Encoder	16	8	512	2048
Decoder (MAE)	4	4	256	1024

Attention Mechanism: FlashAttention for efficient handling of large sequence lengths.

Positional Encodings:

3D Directional PE: Extension of Rotary Positional Embedding (RoPE) to 3D continuous space, capturing relative directionality
3D Distance PE: Random Fourier Features (RFF) to approximate Gaussian kernel of pairwise distances with linear complexity

Visualizing RFF and RoPE

Visual intuition for SpaceFormer’s positional encodings: Top row shows RFF distance encoding (Gaussian-like attention decay and high-frequency feature fingerprints). Bottom row shows RoPE directional encoding (vector rotation fields and resulting attention patterns).

Top Row (Distance / RFF): Shows how the model learns “closeness.” Distance is represented by a complex “fingerprint” of waves that creates a Gaussian-like force field.

Top Left (The Force Field): The attention score (dot product) naturally forms a Gaussian curve. It is high when atoms are close and decays to zero as they move apart. This mimics physical forces without the model needing to learn that math from scratch.
Top Right (The Fingerprint): Each dimension oscillates at a different frequency. A specific distance (e.g., $d=2$) has a unique combination of high and low values across these dimensions, creating a unique “fingerprint” for that exact distance.

Bottom Row (Direction / RoPE): Shows how the model learns “relative position.” It visualizes the vector rotation and how that creates a grid-like attention pattern.

Bottom Left (The Rotation): This visualizes the “X-axis chunk” of the vector. As you move from left ($x=-3$) to right ($x=3$), the arrows rotate. The model compares angles between atoms to determine relative positions.
Bottom Right (The Grid): The resulting attention pattern when combining X-rotations and Y-rotations. The red/blue regions show where the model pays attention relative to the center, forming a grid-like interference pattern that distinguishes relative positions (e.g., “top-right” vs “bottom-left”).

Adaptive Grid Merging

To make the 3D grid approach computationally tractable, two key strategies are employed:

Grid Sampling: Randomly selecting 10-20% of empty cells during training
Adaptive Grid Merging: Recursively merging $2 \times 2 \times 2$ blocks of empty cells into larger “coarse” cells, creating a multi-resolution view that is fine-grained near atoms and coarse-grained in empty space (merging set to Level 3)

Visualizing Adaptive Grid Merging:

Adaptive grid merging demonstrated on H₂O. Red cells (Level 0) contain atoms and remain at full resolution. Progressively darker blue cells represent merged empty regions at higher levels, covering the same volume with fewer tokens.

The adaptive grid process compresses empty space around molecules while maintaining high resolution near atoms:

Red Cells (Level 0): The smallest squares ($0.49$Å) containing atoms. These are kept at highest resolution because electron density changes rapidly here.
Light Blue Cells (Level 0/1): Small empty regions close to atoms.
Darker Blue Cells (Level 2/3): Large blocks of empty space further away.

If we used a naive uniform grid, we would have to process thousands of empty “Level 0” cells containing almost zero information. By merging them into larger blocks (the dark blue squares), the model covers the same volume with significantly fewer input tokens, reducing the number of tokens by roughly 10x compared to a dense grid.

Adaptive grid merging for benzene (C₆H₆). The model maintains maximum resolution (red Level 0 cells) only where atoms exist, while merging vast empty regions into large blocks (dark blue L3/L4 cells). This allows the model to focus computational power on chemically active zones.

The benzene example above demonstrates how this scales to larger molecules. The characteristic hexagonal ring of 6 carbon atoms (black) and 6 hydrogen atoms (white) occupies a small fraction of the total grid. The dark blue corners (L3, L4) represent massive merged blocks of empty space, allowing the model to focus 90% of its computational power on the red “active” zones where chemistry actually happens.

Paper Information

Citation: Lu, S., Ji, X., Zhang, B., Yao, L., Liu, S., Gao, Z., Zhang, L., & Ke, G. (2025). Beyond Atoms: Enhancing Molecular Pretrained Representations with 3D Space Modeling. Proceedings of the 42nd International Conference on Machine Learning (ICML), 267, 40491-40504. https://proceedings.mlr.press/v267/lu25e.html

Publication: ICML 2025

@inproceedings{lu2025beyond,
  title={Beyond Atoms: Enhancing Molecular Pretrained Representations with 3D Space Modeling},
  author={Lu, Shuqi and Ji, Xiaohong and Zhang, Bohang and Yao, Lin and Liu, Siyuan and Gao, Zhifeng and Zhang, Linfeng and Ke, Guolin},
  booktitle={Proceedings of the 42nd International Conference on Machine Learning},
  pages={40491--40504},
  volume={267},
  series={Proceedings of Machine Learning Research},
  publisher={PMLR},
  year={2025}
}

Additional Resources: