SE(3)-Transformers: Equivariant Attention for 3D Data

What kind of paper is this?

This is a method paper that introduces the SE(3)-Transformer, a self-attention mechanism for 3D point clouds and graphs that is equivariant under continuous 3D rotations and translations. It builds on tensor field networks (TFNs) by adding data-dependent attention weights, resolving a known expressiveness limitation of equivariant convolutions.

Why equivariant attention for point clouds?

Point cloud data appears in 3D object scans, molecular structures, and particle simulations. Two properties are essential: handling varying numbers of irregularly sampled points, and invariance to global changes in pose (rotations and translations).

Self-attention handles variable-size inputs naturally and has proven effective across many domains. Tensor field networks provide SE(3)-equivariant convolutions but suffer from a key limitation: their filter kernels are decomposed into learnable radial functions and fixed angular components (spherical harmonics). The angular dependence is completely constrained by the equivariance condition, leaving no learnable degrees of freedom in the angular direction. This has been identified in the literature as severely limiting performance.

The SE(3)-Transformer resolves this by introducing data-dependent attention weights that modulate the angular profile of the kernels while maintaining equivariance.

Architecture: invariant attention meets equivariant values

The core layer combines three components:

$$\mathbf{f}_{\text{out},i}^{\ell} = \underbrace{\mathbf{W}_V^{\ell\ell} \mathbf{f}_{\text{in},i}^{\ell}}_{\text{self-interaction}} + \sum_{k \geq 0} \sum_{j \in \mathcal{N}_i \setminus i} \underbrace{\alpha_{ij}}_{\text{attention}} \underbrace{\mathbf{W}_V^{\ell k}(\mathbf{x}_j - \mathbf{x}_i) \mathbf{f}_{\text{in},j}^k}_{\text{value message}}$$

Invariant attention weights

The attention weights use dot-product attention between equivariant queries and keys:

$$\alpha_{ij} = \frac{\exp(\mathbf{q}_i^\top \mathbf{k}_{ij})}{\sum_{j’ \in \mathcal{N}_i \setminus i} \exp(\mathbf{q}_i^\top \mathbf{k}_{ij’})}$$

Both $\mathbf{q}_i$ and $\mathbf{k}_{ij}$ are constructed using TFN-type linear embeddings, making them SE(3)-equivariant. Their inner product is invariant because SO(3) representations are orthogonal: $\mathbf{q}^\top \mathbf{S}_g^\top \mathbf{S}_g \mathbf{k} = \mathbf{q}^\top \mathbf{k}$.

Equivariant value messages

The value messages use the same TFN kernel structure as tensor field networks: weight kernels $\mathbf{W}_V^{\ell k}(\mathbf{x})$ decomposed into learnable radial functions and Clebsch-Gordan/spherical harmonic angular components. Features are typed by irreducible representation degree $\ell$ (the independent matrix blocks into which SO(3) group actions decompose): type-0 vectors are rotation-invariant scalars, type-1 vectors transform as 3D vectors, and so on.

Angular modulation

The attention weights $\alpha_{ij}$ multiply the value messages, creating data-dependent kernels $\alpha_{ij} \mathbf{W}_V^{\ell k}(\mathbf{x})$. This effectively modulates the angular profile of the fixed spherical harmonic components, adding learnable angular degrees of freedom while preserving equivariance. The authors describe this as one of the first examples of a nonlinear equivariant layer.

Attentive self-interaction

The paper also introduces attentive self-interaction as an alternative to the standard linear self-interaction (analogous to 1x1 convolutions). Instead of fixed learned weights across all points, the weights are generated by an MLP operating on invariant inner products of the input features:

$$w_{i,c’c}^{\ell\ell} = \text{MLP}\left(\bigoplus_{c,c’} \mathbf{f}_{\text{in},i,c’}^{\ell\top} \mathbf{f}_{\text{in},i,c}^{\ell}\right)$$

Experiments

N-body particle simulation

Five charged particles carrying positive or negative charges, exerting repulsive or attractive forces on each other. The task is predicting positions and velocities 500 timesteps ahead. The SE(3)-Transformer achieves 0.0076 MSE on position (vs. 0.0139 for Set Transformer and 0.0151 for TFN), with equivariance error on the order of $10^{-7}$, confirming exact equivariance up to numerical precision.

ScanObjectNN (real-world 3D object classification)

2902 real-world scanned objects across 15 categories. This task is only SO(2)-invariant (gravity axis matters), so the authors provide the z-component as an additional scalar input. With only 128 input points, the SE(3)-Transformer+z achieves 85.0% accuracy, competitive with methods using 1024 points and task-specific architectures. The model learns to ignore the symmetry-breaking z-input when trained on rotation-augmented data.

QM9 molecular property regression

134k molecules with up to 29 atoms, predicting 6 quantum chemical properties. The SE(3)-Transformer achieves competitive results against other equivariant models (TFN, Cormorant), with improvements over TFN on all six targets. Across all three experiments, the SE(3)-Transformer outperforms both a non-equivariant attention baseline (Set Transformer) and equivariant models without attention (TFN).

Practical contributions

The paper includes a PyTorch spherical harmonics implementation that is 10x faster than Scipy on CPU and 100-1000x faster on GPU. For a ScanObjectNN model, this yields roughly 22x speedup of the forward pass compared to the lie-learn library, directly addressing a major bottleneck of TFN-based architectures.

Conclusions and limitations

Adding attention to a roto-translation-equivariant model consistently led to higher accuracy and increased training stability across all three experiments. For large neighbourhoods, attention proved essential for model convergence. The equivariance constraints also improved performance compared to conventional (non-equivariant) attention in all experiments.

The authors note that the SE(3)-Transformer is inherently suited for classification and regression on molecular data and discuss applications in drug research, including early-stage suitability classification of molecules for inhibiting viral reproductive cycles.

Reproducibility

Artifact	Type	License	Notes
se3-transformer-public	Code	MIT	Official PyTorch + DGL implementation

The repository includes code for N-body simulations and QM9 experiments. Hyperparameters and architecture details are provided in the paper’s appendix (4 equivariant layers, representation degrees, channels per degree, learning rates, batch sizes). Hardware requirements are not explicitly stated in the paper.

Paper Information

Citation: Fuchs, F. B., Worrall, D. E., Fischer, V., & Welling, M. (2020). SE(3)-Transformers: 3D Roto-Translation Equivariant Attention Networks. Advances in Neural Information Processing Systems (NeurIPS 2020).

Publication: NeurIPS 2020

Additional Resources:

Citation

@inproceedings{fuchs2020se3,
  title={{SE(3)-Transformers}: 3D Roto-Translation Equivariant Attention Networks},
  author={Fuchs, Fabian B. and Worrall, Daniel E. and Fischer, Volker and Welling, Max},
  booktitle={Advances in Neural Information Processing Systems},
  year={2020}
}

What kind of paper is this?#

Why equivariant attention for point clouds?#

Architecture: invariant attention meets equivariant values#

Invariant attention weights#

Equivariant value messages#

Angular modulation#

Attentive self-interaction#

Experiments#

N-body particle simulation#

ScanObjectNN (real-world 3D object classification)#

QM9 molecular property regression#

Practical contributions#

Conclusions and limitations#

Reproducibility#

Paper Information#

Citation#