Core Innovation: Adaptive Sparsity in SE(3) Networks

This is a methodological paper introducing a novel architecture and training curriculum to solve efficiency bottlenecks in Geometric Deep Learning. It directly tackles the primary computational bottleneck in modern SE(3)-equivariant graph neural networks (the tensor product operation) and proposes a generalizable solution through adaptive network sparsification.

The Computational Bottleneck in DFT Hamiltonian Prediction

SE(3)-equivariant networks are accurate but unscalable for DFT Hamiltonian prediction due to two key bottlenecks:

  • Atom Scaling: Tensor Product (TP) operations grow quadratically with atoms ($N^2$).
  • Basis Set Scaling: Computational complexity grows with the sixth power of the angular momentum order ($L^6$). Larger basis sets (e.g., def2-TZVP) require higher orders ($L=6$), making them prohibitively slow.

Existing SE(3)-equivariant models cannot handle large molecules (40-100 atoms) with high-quality basis sets, limiting their practical applicability in computational chemistry.

SPHNet Architecture and the Three-Phase Sparsity Scheduler

SPHNet introduces Adaptive Sparsity to prune redundant computations at two levels:

  1. Sparse Pair Gate: Learns which atom pairs to include in message passing, adapting the interaction graph based on importance.
  2. Sparse TP Gate: Filters which spherical harmonic triplets $(l_1, l_2, l_3)$ are computed in tensor product operations, pruning higher-order combinations that contribute less to accuracy.
  3. Three-Phase Sparsity Scheduler: A training curriculum (Random → Adaptive → Fixed) that enables stable convergence to high-performing sparse subnetworks.

Key insight: The Sparse Pair Gate learns to preserve long-range interactions (16-25 Angstrom) at higher rates than short-range ones. Short-range pairs are abundant and easier to learn, while rare long-range interactions require more samples for accurate representation, making them more critical to retain.

Benchmarks and Ablation Studies

The authors evaluated SPHNet on three datasets (MD17, QH9, and PubChemQH) with varying molecule sizes and basis set complexities. Baselines include SchNOrb, PhiSNet, QHNet, and WANet. SchNOrb and PhiSNet results are limited to MD17, as those models are designed for trajectory datasets. WANet was not open-sourced, so only partial metrics from its paper are reported.

Evaluation Metrics

  • Hamiltonian MAE ($H$): Mean absolute error between predicted and DFT-computed Hamiltonian matrices, in Hartrees ($E_h$)
  • Occupied Orbital Energy MAE ($\epsilon$): Mean absolute error of all occupied molecular orbital energies derived from the predicted Hamiltonian
  • Orbital Coefficient Similarity ($\psi$): Cosine similarity of occupied molecular orbital coefficients between predicted and reference wavefunctions

Ablation Studies

Sparse Gates (on PubChemQH):

Configuration$H$ [$10^{-6} E_h$] $\downarrow$Memory [GB] $\downarrow$Speedup $\uparrow$
Both gates97.315.627.09x
Pair Gate only87.706.982.73x
TP Gate only94.318.043.98x
Neither gate86.3510.911.73x

The Sparse Pair Gate contributes a 78% speedup with 30% memory reduction. The Sparse TP Gate (pruning 70% of combinations) yields a 160% speedup. Both gates together achieve the highest speedup, though accuracy slightly decreases compared to no gating.

Three-Phase Scheduler: Removing the random phase causes convergence to local optima ($112.68 \pm 10.75$ vs $97.31 \pm 0.52$). Removing the adaptive phase increases variance and lowers accuracy ($122.79 \pm 19.02$). Removing the fixed phase has minimal accuracy impact but reduces speedup from 7.09x to 5.45x due to dynamic graph overhead.

Sparsity Rate: The critical sparsity threshold scales with system complexity: 30% for MD17 (small molecules), 40% for QH9 (medium), and 70% for PubChemQH (large). Beyond the threshold, MAE increases sharply. Computational cost decreases approximately linearly with sparsity rate.

Transferability to Other Models

To demonstrate the speedup is architecture-agnostic, the authors applied the Sparse Pair Gate and Sparse TP Gate to the QHNet baseline on PubChemQH:

Configuration$H$ [$10^{-6} E_h$] $\downarrow$Memory [GB] $\downarrow$Speedup $\uparrow$
QHNet baseline123.7422.501.00x
+ TP Gate128.1612.682.04x
+ Pair Gate126.2710.071.66x
+ Both gates128.898.463.30x

The gates reduced QHNet’s memory by 62% and improved speed by 3.3x with modest accuracy trade-off, confirming the gates are portable modules applicable to other SE(3)-equivariant architectures.

Performance Results

QH9 (134k molecules, $\leq$ 20 atoms)

SPHNet achieves 3.3x to 4.0x speedup over QHNet across all four QH9 splits, with improved Hamiltonian MAE and orbital energy MAE. Memory drops to 0.23 GB/sample (33% of QHNet’s 0.70 GB). On the stable-iid split, Hamiltonian MAE improves from 76.31 to 45.48 ($10^{-6} E_h$).

PubChemQH (50k molecules, 40-100 atoms)

Model$H$ [$10^{-6} E_h$] $\downarrow$$\epsilon$ [$E_h$] $\downarrow$$\psi$ [$10^{-2}$] $\uparrow$Memory [GB] $\downarrow$Speedup $\uparrow$
QHNet123.743.332.3222.51.0x
WANet99.981.173.1315.02.4x
SPHNet97.312.162.975.627.1x

SPHNet achieves the best Hamiltonian MAE and efficiency, though WANet outperforms on orbital energy MAE and coefficient similarity. The higher speedup on PubChemQH (vs QH9) reflects greater computational redundancy in larger systems with higher-order basis sets ($L_{max} = 6$ for def2-TZVP vs $L_{max} = 4$ for def2-SVP).

MD17 (Small Molecule Trajectories)

SPHNet achieves accuracy comparable to QHNet and PhiSNet on four MD17 molecules (water, ethanol, malondialdehyde, uracil; 3-12 atoms). MD17 represents a simpler task where baseline models already perform well, leaving limited room for improvement. For water (3 atoms), the number of interaction combinations is inherently small, limiting the benefit of adaptive sparsification.

Scaling Limit

SPHNet can train on systems with approximately 3000 atomic orbitals on a single A6000 GPU; the QHNet baseline runs out of memory at approximately 1800 orbitals. Memory consumption scales more favorably as molecule size increases.

Key Findings

  • Adaptive sparsity scales with system complexity: The method is most effective for large systems where redundancy is high. For small molecules (e.g., water with only 3 atoms), every interaction is critical, so pruning hurts accuracy and yields negligible speedup.
  • Long-range pair preservation: The Sparse Pair Gate selects long-range pairs (16-25 Angstrom) at higher rates than short-range ones. Short-range pairs are numerous and easier to learn, while rare long-range interactions are harder to represent and thus more critical to retain.
  • Generalizable components: The sparsification techniques are portable modules, demonstrated by successful integration into QHNet with 3.3x speedup.
  • Architecture ablation: Removing one Vectorial Node Interaction block or Spherical Node Interaction block significantly hurts accuracy, confirming the importance of the progressive order-increase design. Removing one Pair Construction block has less impact, suggesting room for further speedup.

Reproducibility Details

Data

The experiments evaluated SPHNet on three datasets with different molecular sizes and basis set complexities. All datasets use DFT calculations as ground truth, with MD17 using the PBE exchange-correlation functional and QH9/PubChemQH using B3LYP.

DatasetMoleculesMolecule SizeBasis Set$L_{max}$Functional
MD174 systems3-12 atoms (water, ethanol, malondialdehyde, uracil)def2-SVP4PBE
QH9134k$\leq$ 20 atoms (Stable/Dynamic splits)def2-SVP4B3LYP
PubChemQH50k40-100 atomsdef2-TZVP6B3LYP

Data Availability:

  • MD17 & QH9: Publicly available
  • PubChemQH: Publicly available on Hugging Face (EperLuo/PubChemQH)

Algorithms

Loss Function:

The model learns the residual $\Delta H$:

$$ \begin{aligned} \Delta H &= H_{\text{ref}} - H_{\text{init}} \\ \mathcal{L} &= \text{MAE}(H_{\text{ref}}, H_{\text{pred}}) + \text{MSE}(H_{\text{ref}}, H_{\text{pred}}) \end{aligned} $$

where $H_{\text{init}}$ is a computationally inexpensive initial guess computed via PySCF.

Hyperparameters:

ParameterPubChemQHQH9MD17
Batch Size83210 (uracil: 5)
Training Steps300k260k200k
Warmup Steps1k1k1k
Learning Rate1e-31e-35e-4
Sparsity Rate0.70.40.1-0.3
TSS Epoch $t$333

Sparse Pair Gate: Adapts the interaction graph. It concatenates zero-order features and inner products of atom pairs, then passes them through a linear layer $F_p$ with sigmoid activation to learn a weight $W_p^{ij}$ for every pair. Pairs are kept only if selected by the scheduler ($U_p^{TSS}$). The overhead comes primarily from the linear layer $F_p$.

Sparse TP Gate: Filters triplets $(l_1, l_2, l_3)$ inside the TP operation. Higher-order combinations are more likely to be pruned. Complexity: $\mathcal{O}(L^3)$.

Three-Phase Sparsity Scheduler: Training curriculum designed to optimize the sparse gates effectively:

  • Phase 1 (Random): Random selection ($1-k$ probability) to ensure unbiased weight updates. Complexity: $\mathcal{O}(|U|)$.
  • Phase 2 (Adaptive): Selects top $(1-k)$ percent based on learned magnitude. Complexity: $\mathcal{O}(|U|\log|U|)$.
  • Phase 3 (Fixed): Freezes the connectivity mask for maximum inference speed. No overhead.

Weight Initialization: Learnable sparsity weights ($W$) initialized as all-ones vector.

Models

The model predicts the Hamiltonian matrix $H$ from atomic numbers $Z$ and coordinates $r$.

Inputs: Atomic numbers ($Z$) and 3D coordinates.

Backbone Structure:

  1. Vectorial Node Interaction (x4): Uses long-short range message passing. Extracts vectorial representations ($l=1$) without high-order TPs to save cost.
  2. Spherical Node Interaction (x2): Projects features to high-order spherical harmonics (up to $L_{max}$). The first block increases the maximum order from 0 to $L_{max}$ without the Sparse Pair Gate; the second block applies the Sparse Pair Gate to filter node pairs.
  3. Pair Construction Block (x2): Splits into Diagonal (self-interaction) and Non-Diagonal (cross-interaction) blocks. Both use the Sparse TP Gate to prune cross-order combinations $(l_1, l_2, l_3)$. The Non-Diagonal block also uses the Sparse Pair Gate. The first Pair Construction block does not use the Sparse Pair Gate, to ensure complete information flow.
  4. Expansion Block: Reconstructs the full Hamiltonian matrix from the sparse irreducible representations, exploiting symmetry ($H_{ji} = H_{ij}^T$) to halve computations.

Hardware

  • Training: 4x NVIDIA A100 (80GB)
  • Benchmarking: Single NVIDIA RTX A6000 (46GB)

Paper Information

Citation: Luo, E., Wei, X., Huang, L., Li, Y., Yang, H., Xia, Z., Wang, Z., Liu, C., Shao, B., & Zhang, J. (2025). Efficient and Scalable Density Functional Theory Hamiltonian Prediction through Adaptive Sparsity. Proceedings of the 42nd International Conference on Machine Learning (ICML), Vancouver, Canada.

Publication: ICML 2025

@inproceedings{luo2025efficient,
  title={Efficient and Scalable Density Functional Theory Hamiltonian Prediction through Adaptive Sparsity},
  author={Luo, Erpai and Wei, Xinran and Huang, Lin and Li, Yunyang and Yang, Han and Xia, Zaishuo and Wang, Zun and Liu, Chang and Shao, Bin and Zhang, Jia},
  booktitle={Proceedings of the 42nd International Conference on Machine Learning},
  year={2025}
}

Additional Resources: