Contribution: Systematic Assessment of Non-Conservative ML Force Models

This is a Systematization paper. It systematically catalogs the exact failure modes of existing non-conservative force approaches, quantifies them with a new diagnostic metric, and proposes a hybrid Multiple Time-Stepping solution combining the speed benefits of direct force prediction with the physical correctness of conservative models.

Motivation: The Speed-Accuracy Trade-off in ML Force Fields

Many recent machine learning interatomic potential (MLIP) architectures predict forces directly ($F_\theta(r)$). This “non-conservative” approach avoids the computational overhead of automatic differentiation, yielding faster inference (typically 2-3x speedup) and faster training (up to 3x). However, it sacrifices energy conservation and rotational constraints, potentially destabilizing molecular dynamics simulations. The field lacks rigorous quantification of when this trade-off breaks down and how to mitigate the failures.

Novelty: Jacobian Asymmetry and Hybrid Architectures

Four key contributions:

  1. Jacobian Asymmetry Metric ($\lambda$): A quantitative diagnostic for non-conservation. Since conservative forces derive from a scalar field, their Jacobian (the Hessian of energy) must be symmetric. The normalized norm of the antisymmetric part quantifies the degree of violation: $$ \lambda = \frac{|| \mathbf{J}_{\text{anti}} ||_F}{|| \mathbf{J} ||_F} $$ where $\mathbf{J}_{\text{anti}} = (\mathbf{J} - \mathbf{J}^\top)/2$. Measured values range from $\lambda \approx 0.004$ (PET-NC) to $\lambda \approx 0.032$ (SOAP-BPNN-NC), with ORB at 0.015 and EquiformerV2 at 0.017. Notably, the pairwise $\lambda_{ij}$ approaches 1 at large interatomic distances, meaning non-conservative artifacts disproportionately affect long-range and collective interactions.

  2. Systematic Failure Mode Catalog: First comprehensive demonstration that non-conservative models cause runaway heating in NVE ensembles (temperature drifts of ~7,000 billion K/s for PET-NC and ~10x larger for ORB) and equipartition violations in NVT ensembles where different atom types equilibrate to different temperatures, a physical impossibility.

  3. Theoretical Analysis of Force vs. Energy Training: Force-only training overemphasizes high-frequency vibrational modes because force labels carry per-atom gradients that are dominated by stiff, short-range interactions. Energy labels provide a more balanced representation across the frequency spectrum. Additionally, conservative models benefit from backpropagation extending the effective receptive field to approximately 2x the interaction cutoff, while direct-force models are limited to the nominal cutoff radius.

  4. Hybrid Training and Inference Protocol: A practical workflow that combines fast direct-force prediction with conservative corrections:

    • Training: Pre-train on direct forces, then fine-tune on energy gradients (2-4x faster than training conservative models from scratch)
    • Inference: Multiple Time-Stepping (MTS) where fast non-conservative forces are periodically corrected by slower conservative forces

Methodology: Systematic Failure Mode Analysis

The evaluation systematically tests multiple state-of-the-art models across diverse simulation scenarios:

Models tested:

  • PET-C/PET-NC (Point Edge Transformer, conservative and non-conservative variants)
  • PET-M (hybrid variant jointly predicting both conservative and non-conservative forces)
  • ORB-v2 (non-conservative, trained on Alexandria/MPtrj)
  • EquiformerV2 (non-conservative equivariant Transformer)
  • MACE-MP-0 (conservative message-passing)
  • SevenNet (conservative message-passing)
  • SOAP-BPNN-C/SOAP-BPNN-NC (descriptor-based baseline, both conservative and non-conservative variants)

Test scenarios:

  1. NVE stability tests on bulk liquid water, graphene, amorphous carbon, and FCC aluminum
  2. Thermostat artifact analysis with Langevin and GLE thermostats
  3. Geometry optimization on water snapshots and QM9 molecules using FIRE and L-BFGS
  4. MTS validation on OC20 catalysis dataset
  5. Species-resolved temperature measurements for equipartition testing

Key metrics:

  • Jacobian asymmetry ($\lambda$)
  • Kinetic temperature drift in NVE
  • Velocity-velocity correlations
  • Radial distribution functions
  • Species-resolved temperatures
  • Inference speed benchmarks

Results: Simulation Instability and Hybrid Solutions

Purely non-conservative models are unsuitable for production simulations due to uncontrollable unphysical artifacts that no thermostat can correct. Key findings:

Performance failures:

  • Non-conservative models exhibited catastrophic temperature drift in NVE simulations: ~7,000 billion K/s for PET-NC and ~70,000 billion K/s for ORB, with EquiformerV2 comparable to PET-NC
  • Strong Langevin thermostats ($\tau=10$ fs) damped diffusion by ~5x, negating the speed benefits of non-conservative models
  • Advanced GLE thermostats also failed to control non-conservative drift (ORB reached 1181 K vs. 300 K target)
  • Equipartition violations: under stochastic velocity rescaling, O and H atoms equilibrated at different temperatures. For ORB, H atoms reached 336 K and O atoms 230 K against a 300 K target. For PET-NC, deviations were smaller but still significant (H at 296 K, O at 310 K).
  • Geometry optimization was more fragile with non-conservative forces: inaccurate NC models (SOAP-BPNN-NC) failed catastrophically, while more accurate ones (PET-NC) could converge with FIRE but showed large force fluctuations with L-BFGS. Non-conservative models consistently had lower success rates across water and QM9 benchmarks.

Hybrid solution success:

  • MTS with non-conservative forces corrected every 8 steps ($M=8$) achieved conservative stability with only ~20% overhead compared to a purely non-conservative trajectory. Results were essentially indistinguishable from fully conservative simulations. Higher stride values ($M=16$) became unstable due to resonances between fast degrees of freedom and integration errors.
  • Conservative fine-tuning achieved the accuracy of from-scratch training in about 1/3 the total training time (2-4x resource reduction)
  • Validated on OC20 catalysis benchmark

Scaling caveat: The authors note that as training datasets grow and models become more expressive, non-conservative artifacts should diminish because accurate models naturally exhibit less non-conservative behavior. However, they argue the best path forward is hybrid approaches rather than waiting for scale to solve the problem.

Recommendation: The optimal production path is hybrid architectures using direct forces for acceleration (via MTS and pre-training) while anchoring models in conservative energy surfaces. This captures computational benefits without sacrificing physical reliability.

Reproducibility Details

Data

Primary training/evaluation:

  • Bulk Liquid Water (Cheng et al., 2019): revPBE0-D3 calculations with over 250,000 force/energy targets, chosen for rigorous thermodynamic testing

Generalization tests:

  • Graphene, amorphous carbon, FCC aluminum (tested with general-purpose foundation models)

Benchmarks:

  • QM9: Geometry optimization tests
  • OC20 (Open Catalyst): Oxygen on alloy surfaces for MTS validation

All datasets publicly available through cited sources.

Models

Point Edge Transformer (PET) variants:

  • PET-C (Conservative): Forces via energy backpropagation
  • PET-NC (Non-Conservative): Direct force prediction head, slightly higher parameter count
  • PET-M (Hybrid): Jointly predicts both conservative and non-conservative forces, accuracy within ~10% of the best single-task models

Baseline comparisons:

ModelTypeTraining DataNotes
ORB-v2Non-conservativeAlexandria/MPtrjRotationally unconstrained
EquiformerV2Non-conservativeAlexandria/MPtrjEquivariant Transformer
MACE-MP-0ConservativeMPtrjEquivariant message-passing
SevenNetConservativeMPtrjEquivariant message-passing
SOAP-BPNN-CConservativeBulk waterDescriptor-based baseline
SOAP-BPNN-NCNon-conservativeBulk waterDescriptor-based baseline

Training details:

  • Loss functions: PET-C uses joint Energy + Force $L^2$ loss; PET-NC uses Force-only $L^2$ loss
  • Fine-tuning protocol: PET-NC converted to conservative via energy head fine-tuning
  • MTS configuration: Non-conservative forces with conservative corrections every 8 steps ($M=8$)

Evaluation

Metrics & Software: Molecular dynamics evaluations were performed using i-PI, while geometry optimizations used ASE (Atomic Simulation Environment). Note that primary code reproducibility is provided via an archived Zenodo snapshot; the authors did not link a live, public GitHub repository.

  1. Jacobian asymmetry ($\lambda$): Quantifies non-conservation via antisymmetric component
  2. Temperature drift: NVE ensemble stability
  3. Velocity-velocity correlation ($\hat{c}_{vv}(\omega)$): Thermostat artifact detection
  4. Radial distribution functions ($g(r)$): Structural accuracy
  5. Species-resolved temperature: Equipartition testing
  6. Inference speed: Wall-clock time per MD step

Key results:

ModelSpeed (ms/step)NVE StabilityNotes
PET-NC8.58Failed~7,000 billion K/s drift
PET-C19.4Stable2.3x slower than PET-NC
SevenNet52.8StableConservative baseline
PET Hybrid (MTS)~10.3Stable~20% overhead vs. pure NC

Thermostat artifacts:

  • Langevin ($\tau=10$ fs) dampened diffusion by ~5x (weaker coupling at $\tau=100$ fs reduced diffusion by ~1.5x)
  • GLE thermostats also failed to control non-conservative drift
  • Equipartition violations under SVR: ORB showed H at 336 K and O at 230 K (target 300 K); PET-NC showed smaller but significant species-resolved deviations

Optimization failures:

  • Non-conservative models showed lower geometry optimization success rates across water and QM9 benchmarks, with inaccurate NC models failing catastrophically

Hardware

Compute resources:

  • Training: From-scratch baseline models were trained using 4x Nvidia H100 GPUs (over a duration of around two days).
  • Fine-Tuning: Conservative fine-tuning was performed using a single (1x) Nvidia H100 GPU for a duration of one day.
  • This hybrid fine-tuning approach achieved a 2-4x reduction in computational resources compared to training conservative models from scratch.

Reproduction resources:

Paper Information

Citation: Bigi, F., Langer, M. F., & Ceriotti, M. (2025). The dark side of the forces: assessing non-conservative force models for atomistic machine learning. Proceedings of the 42nd International Conference on Machine Learning (ICML).

Publication: ICML 2025

@inproceedings{bigi2025dark,
  title={The dark side of the forces: assessing non-conservative force models for atomistic machine learning},
  author={Bigi, Filippo and Langer, Marcel F and Ceriotti, Michele},
  booktitle={Proceedings of the 42nd International Conference on Machine Learning},
  year={2025}
}

Additional Resources: