Optimizing Sequence Models for Dynamical Systems

Abstract

Advanced neural network architectures developed for tasks like natural language processing are often transferred to spatiotemporal forecasting without a deep understanding of which components drive their performance. This can lead to suboptimal results and reinforces the view of these models as “black boxes”. In this work, we deconstruct the core mechanisms of Transformers and Recurrent Neural Networks (RNNs) (namely attention, gating, and recurrence). We then build and test novel hybrid architectures to identify which components are most effective. A key finding is that while adding recurrence is detrimental to Transformers, augmenting RNNs with attention and neural gating consistently improves their forecasting accuracy. Our study reveals that a seldom-used architecture, the Recurrent Highway Network (RHN) enhanced with these mechanisms, emerges as the top-performing model for forecasting high-dimensional chaotic systems.

Key Contributions

Systematic Ablation: Deconstructed Transformers and RNNs into core mechanisms (attention, gating, recurrence) to isolate performance drivers
Novel Hybrid Architectures: Synthesized and tested new combinations of neural primitives for spatiotemporal forecasting
RHN advantage on chaotic systems: Demonstrated that attention-augmented Recurrent Highway Networks outperform standard Transformers on high-dimensional chaotic systems
Robustness Analysis: Validated models across both clean physics simulations and noisy real-world industrial datasets

Motivation

In modern ML, architectures are often transferred from one domain (like NLP) to another (like physical forecasting) without understanding the underlying mechanics. This “black box” approach leads to suboptimal compute usage and performance ceilings.

Our goal was to break these architectures down. We treated the core mechanisms of Transformers and RNNs (Gating, Attention, and Recurrence) as orthogonal basis vectors. By decoupling these components, we could synthesize and test hybrid architectures to find the best configuration for spatiotemporal forecasting.

Methodological Approach

We built a modular framework to mix and match neural primitives. We systematically evaluated:

Gating Mechanisms: Testing Additive, Learned Rate, Input-Dependent, and Coupled Input-Dependent variants
Attention: Implementing multi-headed attention with relative positional biases
Recurrence: Testing standard cells (LSTM, GRU) against deeper transition cells like Recurrent Highway Networks (RHN)

Neural gating mechanisms: Additive, Learned Rate, Dependent-Coupled, and Dependent variants — The hierarchy of neural gating mechanisms we tested, from simple additive to fully input-dependent.

RNN cell architectures: Elman, LSTM, GRU, and RHN cells — Recurrent cell types compared in our study. The RHN (d) extends processing depth within each timestep.

This ablation isolated exactly which mathematical operation was driving the performance gain.

Key Findings

Recurrent Highway Networks on Chaotic Systems

For high-dimensional chaotic systems like the Multiscale Lorenz-96 shown below, we found that a Recurrent Highway Network (RHN) augmented with Attention and Neural Gating was the top-performing architecture. This hybrid exceeded the forecasting accuracy of standard Transformers, suggesting that deeper recurrence (processing depth per timestep) matters for complex dynamics.

Forecasting comparison on Multiscale Lorenz-96 system — Forecasting the Multiscale Lorenz-96 system. The top row shows the ’texture’ of the chaotic evolution. Notice how the RHN (far right) maintains the coherent wave-like structures for nearly a full Lyapunov time, holding structure longer than the Transformer variants (the plotted window spans two Lyapunov times).

Transformers: Recurrence Hurts, Gating Helps

We attempted to force recurrence into Transformers to give them “memory,” but it consistently hurt performance. However, Neural Gating significantly improved Transformer robustness. For real-world, noisy data (traffic, weather), the Pre-Layer Normalization (PreLN) Transformer with added gating proved to be the most robust model.

Adding Attention to LSTMs and GRUs

We tested on the Kuramoto-Sivashinsky equation, a model of turbulence and flame fronts. We found that standard LSTMs and GRUs are under-optimized for this setting: adding attention to these cells improved their valid-prediction time several-fold, with the best attention-augmented LSTM and GRU reaching roughly 4x and 6.6x their baseline valid-prediction time, respectively (the paper reports the top RNNs at 2-7x baseline on K-S). (On the partially-observed Multiscale Lorenz-96 system the same attention-plus-gating gain is smaller, more than 40%.)

Forecasting comparison on Kuramoto-Sivashinsky system — Forecasting the Kuramoto-Sivashinsky system. The error heatmaps (bottom row) show how prediction quality degrades over time (lighter means larger error). The RHN maintains structural fidelity longer than competing architectures.

Robustness on Real-World Datasets

While chaotic systems test the limits of theory, we also validated our models on seven standard real-world datasets: the four Electricity Transformer Temperature (ETT) subsets plus Traffic, Electricity, and Weather.

Unlike the clean physics simulations, these datasets contain real-world noise and irregularities. In this environment, the Pre-Layer Normalization (PreLN) Transformer proved to be the most robust architecture. While it didn’t always beat the RHN on pure chaos, its stability makes it a strong default choice for general time-series forecasting tasks where training speed and reliability are paramount.

Why This Matters

This work treats architectural components as independently tunable choices rather than fixed defaults, and that framing surfaces a concrete trade-off. Transformers train in only 25-50% of the time the RNNs require (roughly 2-4x faster), while the attention-augmented RNNs give better inference accuracy on the chaotic physical systems. Which mechanism to select depends on whether the training budget or the forecast precision is the binding constraint, and the ablation makes that an informed choice rather than a default one.

The ablation framework here, treating architectural components as independently tunable factors and measuring their marginal contribution, shaped how later evaluation work is structured. The same principle of isolating variables rather than comparing end-to-end black boxes appears in the document processing research, from benchmark construction in page stream segmentation to grounded evaluation in GutenOCR.

The methodology here shares a design philosophy with EigenNoise, which similarly decomposes a neural mechanism (word vector initialization) into theoretically grounded components to isolate what drives performance. Both papers treat model components as testable hypotheses rather than fixed architectural choices.

For broader context on where this fits in the portfolio’s Scientific Machine Learning arc, see the Research overview.

Citation

@misc{heidenreich2024deconstructingrecurrenceattentiongating,
  title={Deconstructing Recurrence, Attention, and Gating: Investigating the transferability of Transformers and Gated Recurrent Neural Networks in forecasting of dynamical systems},
  author={Hunter S. Heidenreich and Pantelis R. Vlachas and Petros Koumoutsakos},
  year={2024},
  eprint={2410.02654},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  url={https://arxiv.org/abs/2410.02654}
}

Abstract#

Key Contributions#

Motivation#

Methodological Approach#

Key Findings#

Recurrent Highway Networks on Chaotic Systems#

Transformers: Recurrence Hurts, Gating Helps#

Adding Attention to LSTMs and GRUs#

Robustness on Real-World Datasets#

Why This Matters#

Related Work#

Citation#