Abstract

Advanced neural network architectures developed for tasks like natural language processing are often transferred to spatiotemporal forecasting without a deep understanding of which components drive their performance. This can lead to suboptimal results and reinforces the view of these models as “black boxes”. In this work, we deconstruct the core mechanisms of Transformers and Recurrent Neural Networks (RNNs)—namely attention, gating, and recurrence. We then build and test novel hybrid architectures to identify which components are most effective. A key finding is that while adding recurrence is detrimental to Transformers, augmenting RNNs with attention and neural gating consistently improves their forecasting accuracy. Our study reveals that a seldom-used architecture, the Recurrent Highway Network (RHN) enhanced with these mechanisms, emerges as the top-performing model for forecasting high-dimensional chaotic systems.

Key Contributions

  • Deconstructed Core Mechanisms: Identified and isolated the fundamental components of modern sequence models: gating, attention, and recurrence
  • Synthesized Novel Architectures: Proposed and built new hybrid models by treating these core components as interchangeable, tunable hyperparameters
  • Conducted Extensive Benchmarking: Performed ablation studies to evaluate these standard and hybrid models across chaotic dynamical systems and real-world time-series data

Key Findings

  • Augmenting RNNs: Neural gating and attention mechanisms consistently improve the performance of standard RNNs (LSTMs, GRUs, RHNs) in most forecasting tasks
  • The Top Performer for Chaotic Systems: A novel hybrid model—the Recurrent Highway Network (RHN) with attention and neural gating—was the most accurate architecture for forecasting high-dimensional spatiotemporal dynamics
  • Recurrence in Transformers: Adding a notion of recurrence to Transformer models was consistently detrimental to their forecasting performance
  • Robustness on Real-World Data: For the real-world time-series benchmarks, the Pre-Layer Normalization (PreLN) Transformer was the most robust and frequently the best-performing model, benefiting from the addition of neural gating

Significance

This work challenges the “black box” approach of transferring neural network architectures across different domains. Our findings demonstrate that core architectural components should not be treated as fixed but rather as tunable hyperparameters. By carefully selecting these mechanisms, practitioners can achieve significant performance gains and design models better suited for the specific challenges of dynamical systems forecasting.

Citation

@article{heidenreich2024deconstructing,
  title={Deconstructing recurrence, attention, and gating: Investigating the transferability of transformers and gated recurrent neural networks in forecasting of dynamical systems},
  author={Heidenreich, Hunter S and Vlachas, Pantelis R and Koumoutsakos, Petros},
  journal={arXiv preprint arXiv:2410.02654},
  year={2024}
}