Paper Information

Citation: Weiler, M., Geiger, M., Welling, M., Boomsma, W., & Cohen, T. S. (2018). 3D steerable CNNs: Learning rotationally equivariant features in volumetric data. Advances in Neural Information Processing Systems, 31. https://proceedings.neurips.cc/paper/2018/hash/488e4104520c6aab692863cc1dba45af-Abstract.html

Publication: NeurIPS 2018

Additional Resources:

What kind of paper is this?

This is a method paper that introduces a novel neural network architecture, the 3D Steerable CNN. It provides a comprehensive theoretical derivation for the architecture grounded in group representation theory and demonstrates its practical application.

What is the motivation?

The work is motivated by the prevalence of symmetry in problems from the natural sciences. Standard 3D CNNs are not inherently equivariant to 3D rotations, a fundamental symmetry governed by the SE(3) group in many scientific datasets like molecular or protein structures. Building this symmetry directly into the model architecture as an inductive bias is expected to yield more data-efficient, generalizable, and physically meaningful models.

Comparison of standard 3D CNN versus 3D Steerable CNN for handling rotational symmetry. Panel A shows how standard CNNs produce distorted outputs when inputs are rotated, requiring data augmentation. Panel B shows how Steerable CNNs use spherical harmonic kernel bases to produce equivariant geometric field outputs that transform predictably under rotation.
Standard 3D CNNs (Panel A) produce inconsistent feature maps when inputs are rotated, requiring expensive data augmentation. 3D Steerable CNNs (Panel B) use analytically-derived spherical harmonic kernels to produce geometric field outputs that transform equivariantly under rotation.

What is the novelty here?

The core novelty is the rigorous and practical construction of a CNN architecture that is equivariant to 3D rigid body motions (SE(3) group). The key contributions are:

  • Geometric Feature Representation: Features are not just scalar values but are modeled as geometric fields (collections of scalars, vectors, and higher-order tensors) defined over $\mathbb{R}^{3}$. Each type of feature transforms according to an irreducible representation (irrep) of the rotation group SO(3).
  • General Equivariant Convolution: The paper proves that the most general form of an SE(3)-equivariant linear map between these fields is a convolution with a rotation-steerable kernel.
  • Analytical Kernel Basis: The main theoretical breakthrough is the analytical derivation of a complete basis for these steerable kernels. They solve the kernel’s equivariance constraint, $\kappa(rx) = D^{j}(r)\kappa(x)D^{l}(r)^{-1}$, showing the solutions are functions whose angular components are spherical harmonics. The network’s kernels are then parameterized as a learnable linear combination of these pre-computed basis functions, making the implementation a minor modification to standard 3D convolutions.
Spherical harmonics visualization showing the angular basis functions organized by degree l (rows) and order m (columns). Row 0 shows the single s-type orbital (l=0), row 1 shows three p-type orbitals (l=1), row 2 shows five d-type orbitals (l=2), and row 3 shows seven f-type orbitals (l=3).
Spherical harmonics $Y_l^m$ organized by degree $l$ (rows) and order $m$ (columns). These functions form the angular basis for steerable kernels: $l=0$ (scalar), $l=1$ (vector/p-orbital), $l=2$ (rank-2 tensor/d-orbital), $l=3$ (rank-3 tensor/f-orbital). Each degree $l$ has $2l+1$ components.
  • Equivariant Nonlinearity: A novel gated nonlinearity is proposed for non-scalar features. It preserves equivariance by multiplying a feature field by a separately computed, learned scalar field (the gate).

What experiments were performed?

The model’s performance was evaluated on a series of tasks with inherent rotational symmetry:

  1. Tetris Classification: A toy problem to empirically validate the model’s rotational equivariance by training on aligned blocks and testing on randomly rotated ones.
  2. SHREC17 3D Model Classification: A benchmark for classifying complex 3D shapes that are arbitrarily rotated.
  3. Amino Acid Propensity Prediction: A scientific application to predict amino acid types from their 3D atomic environments.
  4. CATH Protein Structure Classification: A challenging task on a new dataset introduced by the authors, requiring classification of global protein architecture, a problem with full SE(3) invariance.

What were the outcomes and conclusions drawn?

The 3D Steerable CNN demonstrated significant advantages due to its built-in equivariance:

  • It was empirically confirmed to be rotationally equivariant, achieving 99% test accuracy on the rotated Tetris dataset, whereas a standard 3D CNN failed with 27% accuracy.
  • It achieved state-of-the-art results on the amino acid prediction task and performed competitively on the SHREC17 benchmark, all while using significantly fewer parameters than the baseline models.
  • On the CATH protein classification task, it dramatically outperformed a deep 3D CNN baseline despite having over 100 times fewer parameters. This performance gap widened as the training data was reduced, highlighting the model’s superior data efficiency.

The paper concludes that 3D Steerable CNNs provide a universal and effective framework for incorporating SE(3) symmetry into deep learning models, leading to improved accuracy and efficiency for tasks involving volumetric data, particularly in scientific domains.

Technical Implementation Details

Data Preprocessing

  • Input Format: All inputs must be voxelized. Point clouds are not used directly.
    • Proteins (CATH): $50^3$ grid, 0.2 nm voxel size. Gaussian density placed at atom centers.
    • 3D Objects (SHREC17): $64^3$ voxel grids.
    • Tetris: $36^3$ input grid.
  • Splitting Strategy: CATH used a 10-fold split (7 train, 1 val, 2 test) strictly separated by “superfamily” level to ensure no data leakage (<40% sequence identity).

Architecture Specifications

Kernel Basis Construction:

  • Constructed from Spherical Harmonics multiplied by Gaussian Radial Shells: $\exp\left(-\frac{1}{2}(||x||-m)^{2}/\sigma^{2}\right)$
  • Anti-aliasing: A radially dependent angular frequency cutoff ($J_{\max}$) is applied to prevent aliasing near the origin.

Normalization: Uses Equivariant Batch Norm. Non-scalar fields are normalized by the average of their norms rather than the values themselves.

Downsampling: Standard strided convolution performs poorly. Must use low-pass filtering (Gaussian blur) before downsampling to maintain equivariance.

Hyperparameters & Training

ParameterValue
OptimizerAdam
LR SchedulerExponential decay (0.94/epoch) after 40 epoch burn-in
Dropout (CATH)0.1 (Capsule-wide convolutional dropout)
Weight Decay (CATH)L1 & L2 regularization: $10^{-8.5}$

Mathematical Formulations for Equivariance

Standard operations like Batch Normalization and ReLU break rotational equivariance. The paper derives equivariant alternatives:

Equivariant Batch Normalization:

Standard BN subtracts a mean, which introduces a preferred direction and breaks symmetry. Instead, use norm-based normalization that scales feature fields by the average of their squared norms:

$$f_{i}(x) \mapsto f_{i}(x) \left( \frac{1}{|B|} \sum_{j \in B} \frac{1}{V} \int dx |f_{j}(x)|^{2} + \epsilon \right)^{-1/2}$$

This scales vector lengths to unit variance on average without subtracting a mean, preserving directional information and symmetry.

Equivariant Nonlinearities:

You cannot apply ReLU to vector components independently (this depends on the coordinate frame). Two approaches:

  1. Norm Nonlinearity (geometric shrinking): Acts on magnitude, preserves direction. Shrinks vectors shorter than learned bias $\beta$ to zero: $$f(x) \mapsto \text{ReLU}(|f(x)| - \beta) \frac{f(x)}{|f(x)|}$$ Note: Found to converge slowly; not used in final models.

  2. Gated Nonlinearity (used in practice): A separate scalar field $s(x)$ passes through sigmoid to create a gate $\sigma(s(x))$, which multiplies the geometric field: $$f_{\text{out}}(x) = f_{\text{in}}(x) \cdot \sigma(s(x))$$ Architecture implication: Requires extra scalar channels ($l=0$) solely for gating higher-order channels ($l>0$).

Voxelization Details:

For CATH protein inputs, Gaussian density is placed at each atom position with standard deviation equal to half the voxel width ($0.5 \times 0.2\text{ nm} = 0.1\text{ nm}$).

Exact Architecture Configurations

Tetris Architecture (4 layers):

LayerField TypesSpatial Size
Input-$36^3$
Layer 14 scalars, 4 vectors ($l=1$), 4 tensors ($l=2$)$40^3$
Layer 216 scalars, 16 vectors, 16 tensors$22^3$ (stride 2)
Layer 332 scalars, 16 vectors, 16 tensors$13^3$ (stride 2)
Output8 scalars (global average pool)-

SHREC17 Architecture (8 layers):

LayersField Types
1-28 scalars, 4 vectors, 2 tensors ($l=2$)
3-416 scalars, 8 vectors, 4 tensors
5-732 scalars, 16 vectors, 8 tensors
8512 scalars
Output55 scalars (classes)

CATH Architecture (ResNet34-inspired):

Block progression: (2,2,2,2)(4,4,4)(8,8,8,8)(16,16,16,16)

Notation: (a,b,c,d) = $a$ scalars ($l=0$), $b$ vectors ($l=1$), $c$ rank-2 tensors ($l=2$), $d$ rank-3 tensors ($l=3$).

Quantitative Results

TaskMetricSteerable CNNBaseline
Tetris (rotated test)Accuracy99%27% (standard 3D CNN)
Amino Acid PropensityAccuracy0.580.50
SHREC17micro/macro MAP1.111.13 (SOTA)
CATHAccuracySignificantly higherDeep 3D CNN (100× more params)

Note: On SHREC17, the model performed comparably to SOTA rather than exceeding it, while using fewer parameters.

Historical Context (From Peer Reviews)

The NeurIPS peer reviews reveal important context about the paper’s structure and claims:

  • Evolution of Experiments: The SHREC17 experiment and the arbitrary rotation test in Tetris were added during the rebuttal phase to address reviewer concerns about the lack of standard computer vision benchmarks. This explains why SHREC17 feels somewhat disconnected from the paper’s “AI for Science” narrative.

  • Continuous vs. Discrete Rotations: The Tetris experiment validates equivariance to continuous ($SO(3)$) rotations, not just discrete 90-degree turns. This distinction is crucial and separates Steerable CNNs from earlier Group CNNs that only handled discrete rotation groups.

  • Terminology Warning: The use of terms like “fiber” and “induced representation” was critiqued by reviewers as being denser than necessary and inconsistent with related work (e.g., Tensor Field Networks). If you find Section 3 difficult, this is a known barrier of this paper. Focus on the resulting kernel constraints rather than the jargon.

  • Parameter Efficiency Quantified: Reviewers highlighted that the main practical win is parameter efficiency. Standard 3D CNNs hit diminishing returns around $10^7$ parameters, whereas Steerable CNNs achieve better results with ~100× fewer parameters ($10^5$).