Paper Overview
This is a method paper. It addresses a critical disconnect in the evaluation of Machine Learning Interatomic Potentials (MLIPs) and introduces a novel architecture, eSEN, designed based on insights from this analysis. The paper proposes a new standard for evaluating MLIPs beyond simple test-set errors.
The Energy Conservation Gap in MLIP Evaluation
The motivation addresses a well-known but under-addressed problem in the field: improvements in standard MLIP metrics (lower energy/force MAE on static test sets) do not reliably translate to better performance on complex downstream tasks like molecular dynamics (MD) simulations, materials stability prediction, or phonon calculations. The authors seek to understand why this gap exists and how to design models that are both accurate on test sets and physically reliable in practical scientific workflows.
The eSEN Architecture and Continuous Representation
The novelty is twofold, spanning both a conceptual framework for evaluation and a new model architecture:
Energy Conservation as a Diagnostic Test: The core conceptual contribution is using an MLIP’s ability to conserve energy in out-of-distribution MD simulations as a crucial diagnostic test. The authors demonstrate that for models passing this test, a strong correlation between test-set error and downstream task performance is restored.
The eSEN Architecture: The paper introduces the equivariant Smooth Energy Network (eSEN), designed with specific choices to ensure a smooth and well-behaved Potential Energy Surface (PES):
- Strictly Conservative Forces: Forces are computed exclusively as the negative gradient of energy ($F = -\nabla E$), using conservative force prediction instead of faster direct-force prediction heads.
- Continuous Representations: Maintains strict equivariance and smoothness by using equivariant gated non-linearities instead of discretizing spherical harmonic representations during nodewise processing.
- Smooth PES Construction: Critical design choices include using distance cutoffs, polynomial envelope functions ensuring derivatives go to zero at cutoffs, and limited radial basis functions to avoid overly sensitive PES.
Efficient Training Strategy: A two-stage training regimen with fast pre-training using a non-conservative direct-force model, followed by fine-tuning to enforce energy conservation. This captures the efficiency of direct-force training while ensuring physical robustness.
Evaluating OOD Energy Conservation and Physical Properties
The paper presents a comprehensive experimental validation:
Ablation Studies on Energy Conservation: MD simulations on out-of-distribution systems (TM23 and MD22 datasets) systematically tested key design choices (direct-force vs. conservative, representation discretization, neighbor limits, envelope functions). This empirically demonstrated which choices lead to energy drift despite negligible impact on test-set MAE.
Physical Property Prediction Benchmarks: The eSEN model was evaluated on challenging downstream tasks:
- Matbench-Discovery: Materials stability and thermal conductivity prediction, where eSEN achieved the highest F1 score among compliant models and excelled at both metrics simultaneously.
- MDR Phonon Benchmark: Predicting phonon properties that test accurate second and third-order derivatives of the PES. eSEN achieved state-of-the-art results, particularly outperforming direct-force models.
- SPICE-MACE-OFF: Standard energy and force prediction on organic molecules, demonstrating that physical plausibility design choices enhanced raw accuracy.
Correlation Analysis: Explicit plots of test-set energy MAE versus performance on downstream benchmarks showed weak overall correlation that becomes strong and predictive when restricted to models passing the energy conservation test.
Outcomes and Conclusions
Primary Conclusion: Energy conservation is a critical, practical property for MLIPs. Using it as a filter re-establishes test-set error as a reliable proxy for model development, dramatically accelerating the innovation cycle. Models that are not conservative, even with low test error, are unreliable for many critical scientific applications.
Model Performance: The eSEN architecture outperforms base models across diverse tasks, from energy/force prediction to geometry optimization, phonon calculations, and thermal conductivity prediction.
Actionable Design Principles: The paper provides experimentally-validated architectural choices that promote physical plausibility. Seemingly minor details, like how atomic neighbors are selected, can have profound impacts on a model’s utility in simulations.
Efficient Path to Robust Models: The direct-force pre-training plus conservative fine-tuning strategy offers a practical method for developing physically robust models without incurring the full computational cost of conservative training from scratch.
Reproducibility Details
Models
The eSEN architecture builds on components from eSCN (Equivariant Spherical Channel Network) and Equiformer, combining them with design choices that prioritize smoothness and energy conservation. The implementation integrates into the standard fairchem Open Catalyst experimental framework.
Layer Structure
- Edgewise Convolution: Uses
SO2convolution layers (from eSCN) with an envelope function applied. Source and target embeddings are concatenated before convolution. - Nodewise Feed-Forward: Two equivariant linear layers with an intermediate SiLU-based gated non-linearity (from Equiformer).
- Normalization: Equivariant Layer Normalization (from Equiformer).
Smoothness Design Choices
Several architectural decisions distinguish eSEN from prior work:
- No Grid Projection: eSEN performs operations directly in the spherical harmonic space to maintain equivariance and energy conservation, bypassing the projection of spherical harmonics to spatial grids for non-linearity.
- Distance Cutoff for Graph Construction: Uses a strict distance cutoff (6 Å for MPTrj models, 5 Å for SPICE models). Neighbor limits introduce discontinuities that break energy conservation.
- Polynomial Envelope Functions: Ensures derivatives go to zero smoothly at the cutoff radius.
Algorithms
Two-Stage Training (eSEN-30M-MP)
- Direct-Force Pre-training (60 epochs): Uses DeNS (Denoising Non-equilibrium Structures) to reduce overfitting. This stage is fast because it does not require backpropagation through energy gradients.
- Conservative Fine-tuning (40 epochs): The direct-force head is removed, and forces are calculated via gradients ($F = -\nabla E$). This enforces energy conservation.
Important: DeNS is used exclusively during the direct-force pre-training stage, with a noising probability of 0.5, a standard deviation of 0.1 Å for the added Gaussian noise, and a DeNS loss coefficient of 10. The fine-tuning strategy reduces the wall-clock time for model training by 40% compared to training a conservative model from scratch for the same number of total epochs.
Optimization
- Optimizer: AdamW with cosine learning rate scheduler
- Max Learning Rate: $4 \times 10^{-4}$
- Batch Size: 512 (for MPTrj models)
- Weight Decay: $1 \times 10^{-3}$
- Gradient Clipping: Norm of 100
- Warmup: 0.1 epochs with a factor of 0.2
Loss Function
A composite loss combining per-atom energy MAE, force $L_2$ loss, and stress MAE:
$$ \begin{aligned} \mathcal{L} = \lambda_{\text{e}} \frac{1}{N} \sum_{i=1}^N \lvert E_{i} - \hat{E}_{i} \rvert + \lambda_{\text{f}} \frac{1}{3N} \sum_{i=1}^N \lVert \mathbf{F}_{i} - \hat{\mathbf{F}}_{i} \rVert_2^2 + \lambda_{\text{s}} \lVert \mathbf{S} - \hat{\mathbf{S}} \rVert_1 \end{aligned} $$
For MPTrj-30M, the weighting coefficients are set to $\lambda_{\text{e}} = 20$, $\lambda_{\text{f}} = 20$, and $\lambda_{\text{s}} = 5$.
Data
Training Data
- Inorganic: MPTrj (Materials Project Trajectory) dataset
- Organic: SPICE-MACE-OFF dataset
Test Data Construction
- MPTrj Testing: Since MPTrj lacks an official test split, the authors created a test set using 5,000 random samples from the subsampled Alexandria (sAlex) dataset to ensure fair comparison.
- Out-of-Distribution Conservation Testing:
- Inorganic: TM23 dataset (transition metal defects). Simulation: 100 ps, 5 fs timestep.
- Organic: MD22 dataset (large molecules). Simulation: 100 ps, 1 fs timestep.
Hardware
Compute for training operations predominantly utilizes 80GB NVIDIA A100 GPUs.
Inference Efficiency
For a periodic system of 216 atoms on a single A100 (PyTorch 2.4.0, CUDA 12.1, no compile/torchscript), the 2-layer eSEN models achieve approximately 0.4 million steps per day (3.2M parameters) and 0.8 million steps per day (6.5M parameters), comparable to MACE-OFF-L at 0.7 million steps per day.
Evaluation
The paper evaluated eSEN across three major benchmark tasks. Key evaluation metrics included energy MAE (meV/atom), force MAE (meV/Å), stress MAE (meV/Å/atom), F1 score for stability prediction, $\kappa_{\text{SRME}}$ for thermal conductivity, and phonon frequency accuracy.
Ablation Test-Set MAE (Table 1)
Design choices that dramatically affect energy conservation have negligible impact on static test-set MAE, which is precisely why test-set error alone is misleading. All models are 2-layer with 3.2M parameters, $L_{\text{max}} = 2$, $M_{\text{max}} = 2$:
| Model | Energy MAE | Force MAE | Stress MAE |
|---|---|---|---|
| eSEN (default) | 17.02 | 43.96 | 0.14 |
| eSEN, direct-force | 18.66 | 43.62 | 0.16 |
| eSEN, neighbor limit | 17.30 | 44.11 | 0.14 |
| eSEN, no envelope | 17.60 | 44.69 | 0.14 |
| eSEN, $N_{\text{basis}} = 512$ | 19.87 | 48.29 | 0.15 |
| eSEN, Bessel | 17.65 | 44.83 | 0.15 |
| eSEN, discrete, res=6 | 17.05 | 43.10 | 0.14 |
| eSEN, discrete, res=10 | 17.11 | 43.13 | 0.14 |
| eSEN, discrete, res=14 | 17.12 | 43.09 | 0.14 |
Energy MAE in meV/atom. Force MAE in meV/Å. Stress MAE in meV/Å/atom.
Matbench-Discovery (Tables 2 and 3)
Compliant models (trained only on MPTrj or its subset), unique prototype split:
| Model | F1 | DAF | $\kappa_{\text{SRME}}$ | RMSD |
|---|---|---|---|---|
| eSEN-30M-MP | 0.831 | 5.260 | 0.340 | 0.0752 |
| eqV2-S-DeNS | 0.815 | 5.042 | 1.676 | 0.0757 |
| MatRIS-MP | 0.809 | 5.049 | 0.861 | 0.0773 |
| AlphaNet-MP | 0.799 | 4.863 | 1.31 | 0.1067 |
| DPA3-v2-MP | 0.786 | 4.822 | 0.959 | 0.0823 |
| ORB v2 MPtrj | 0.765 | 4.702 | 1.725 | 0.1007 |
| SevenNet-13i5 | 0.760 | 4.629 | 0.550 | 0.0847 |
| GRACE-2L-MPtrj | 0.691 | 4.163 | 0.525 | 0.0897 |
| MACE-MP-0 | 0.669 | 3.777 | 0.647 | 0.0915 |
| CHGNet | 0.613 | 3.361 | 1.717 | 0.0949 |
| M3GNet | 0.569 | 2.882 | 1.412 | 0.1117 |
eSEN-30M-MP excels at both F1 and $\kappa_{\text{SRME}}$ simultaneously, while all previous models only achieve SOTA on one or the other.
Non-compliant models (trained on additional datasets):
| Model | F1 | $\kappa_{\text{SRME}}$ | RMSD |
|---|---|---|---|
| eSEN-30M-OAM | 0.925 | 0.170 | 0.0608 |
| eqV2-M-OAM | 0.917 | 1.771 | 0.0691 |
| ORB v3 | 0.905 | 0.210 | 0.0750 |
| SevenNet-MF-ompa | 0.901 | 0.317 | 0.0639 |
| DPA3-v2-OpenLAM | 0.890 | 0.687 | 0.0679 |
| GRACE-2L-OAM | 0.880 | 0.294 | 0.0666 |
| MatterSim-v1-5M | 0.862 | 0.574 | 0.0733 |
| MACE-MPA-0 | 0.852 | 0.412 | 0.0731 |
The eSEN-30M-OAM model starts from eSEN-30M-OMat (trained on OMat24), then is fine-tuned for 1 epoch on a dataset combining sAlex and 8 copies of MPTrj.
MDR Phonon Benchmark (Table 4)
Metrics: maximum phonon frequency MAE($\omega_{\text{max}}$) in K, vibrational entropy MAE($S$) in J/K/mol, Helmholtz free energy MAE($F$) in kJ/mol, heat capacity MAE($C_V$) in J/K/mol.
| Model | MAE($\omega_{\text{max}}$) | MAE($S$) | MAE($F$) | MAE($C_V$) |
|---|---|---|---|---|
| eSEN-30M-MP | 21 | 13 | 5 | 4 |
| SevenNet-13i5 | 26 | 28 | 10 | 5 |
| GRACE-2L (r6) | 40 | 25 | 9 | 5 |
| SevenNet-0 | 40 | 48 | 19 | 9 |
| MACE | 61 | 60 | 24 | 13 |
| CHGNet | 89 | 114 | 45 | 21 |
| M3GNet | 98 | 150 | 56 | 22 |
Direct-force models show dramatically worse performance at the standard 0.01 Å displacement (e.g., eqV2-S-DeNS: 280/224/54/94) but improve at larger displacements (0.2 Å: 58/26/8/8), revealing that their PES is rough near energy minima.
SPICE-MACE-OFF (Table 5)
Test set MAE for organic molecule energy/force prediction. Energy MAE in meV/atom, force MAE in meV/Å:
| Dataset | MACE-4.7M (E/F) | EscAIP-45M* (E/F) | eSEN-3.2M (E/F) | eSEN-6.5M (E/F) |
|---|---|---|---|---|
| PubChem | 0.88 / 14.75 | 0.53 / 5.86 | 0.22 / 6.10 | 0.15 / 4.21 |
| DES370K M. | 0.59 / 6.58 | 0.41 / 3.48 | 0.17 / 1.85 | 0.13 / 1.24 |
| DES370K D. | 0.54 / 6.62 | 0.38 / 2.18 | 0.20 / 2.77 | 0.15 / 2.12 |
| Dipeptides | 0.42 / 10.19 | 0.31 / 5.21 | 0.10 / 3.04 | 0.07 / 2.00 |
| Sol. AA | 0.98 / 19.43 | 0.61 / 11.52 | 0.30 / 5.76 | 0.25 / 3.68 |
| Water | 0.83 / 13.57 | 0.72 / 10.31 | 0.24 / 3.88 | 0.15 / 2.50 |
| QMugs | 0.45 / 16.93 | 0.41 / 8.74 | 0.16 / 5.70 | 0.12 / 3.78 |
*EscAIP-45M is a direct-force model. eSEN-6.5M outperforms MACE-OFF-L and EscAIP on all test splits. The smaller eSEN-3.2M has inference efficiency comparable to MACE-4.7M while achieving lower MAE.
Why These Design Choices Matter
Bounded Energy Derivatives and the Verlet Integrator
The theoretical foundation for why smoothness matters comes from Theorem 5.1 of Hairer et al. (2003). For the Verlet integrator (the standard NVE integrator), the total energy drift satisfies:
$$ |E(\mathbf{r}_T, \mathbf{a}) - E(\mathbf{r}_0, \mathbf{a})| \leq C \Delta t^2 + C_N \Delta t^N T $$
where $T$ is the total simulation time ($T \leq \Delta t^{-N}$), $N$ is the highest order for which the $N$th derivative of $E$ is continuously differentiable with bounded derivative, and $C$, $C_N$ are constants independent of $T$ and $\Delta t$. The first term is a time-independent fluctuation of $O(\Delta t^2)$; the second term governs long-term conservation. This means the PES must be continuously differentiable to high order, with bounded derivatives, for energy conservation in long-time simulations.
Architectural Choices That Break Conservation
The authors provide theoretical justification for why specific architectural choices break energy conservation:
- Max Neighbor Limit (KNN): Introduces discontinuity in the PES. If a neighbor at distance $r$ moves to $r + \epsilon$ and drops out of the top-$K$, the energy changes discontinuously.
- Grid Discretization: Projecting spherical harmonics to a spatial grid introduces discretization errors in energy gradients that break conservation. This can be mitigated with higher-resolution grids but not eliminated.
- Direct-Force Prediction: Imposes no mathematical constraint that forces must be the gradient of an energy scalar field. In other words, $\nabla \times \mathbf{F} \neq 0$ is permitted, violating the requirement for a conservative force field.
Displacement Sensitivity in Phonon Calculations
An important empirical finding concerns how displacement values affect phonon predictions. Conservative models (eSEN, MACE) show convergent phonon band structures as displacement decreases toward zero. In contrast, direct-force models (eqV2-S-DeNS) fail to converge, exhibiting missing acoustic branches and spurious imaginary frequencies at small displacements. While direct-force models achieve competitive thermodynamic property accuracy at large displacements (0.2 Å), this is deceptive: the underlying phonon band structures remain inaccurate, and the apparent accuracy comes from Boltzmann-weighted integrals smoothing over errors.
Paper Information
Citation: Fu, X., Wood, B. M., Barroso-Luque, L., Levine, D. S., Gao, M., Dzamba, M., & Zitnick, C. L. (2025). Learning Smooth and Expressive Interatomic Potentials for Physical Property Prediction. Proceedings of the 42nd International Conference on Machine Learning (ICML).
Publication: ICML 2025
@inproceedings{fu2025learning,
title={Learning Smooth and Expressive Interatomic Potentials for Physical Property Prediction},
author={Fu, Xiang and Wood, Brandon M. and Barroso-Luque, Luis and Levine, Daniel S. and Gao, Meng and Dzamba, Misko and Zitnick, C. Lawrence},
booktitle={Proceedings of the 42nd International Conference on Machine Learning},
year={2025}
}
Additional Resources:
