<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Time Series Forecasting on Hunter Heidenreich | Senior AI Research Scientist</title><link>https://hunterheidenreich.com/categories/time-series-forecasting/</link><description>Recent content in Time Series Forecasting on Hunter Heidenreich | Senior AI Research Scientist</description><image><title>Hunter Heidenreich | Senior AI Research Scientist</title><url>https://hunterheidenreich.com/img/avatar.webp</url><link>https://hunterheidenreich.com/img/avatar.webp</link></image><generator>Hugo -- 0.147.7</generator><language>en-US</language><copyright>2026 Hunter Heidenreich</copyright><lastBuildDate>Sun, 31 May 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://hunterheidenreich.com/categories/time-series-forecasting/index.xml" rel="self" type="application/rss+xml"/><item><title>LSTNet: Long- and Short-Term Time Series Network</title><link>https://hunterheidenreich.com/notes/time-series/lstnet-multivariate-time-series/</link><pubDate>Sat, 11 Apr 2026 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/notes/time-series/lstnet-multivariate-time-series/</guid><description>LSTNet combines CNNs, recurrent-skip connections, and autoregressive models to capture both short-term and long-term patterns in multivariate time series.</description><content:encoded><![CDATA[<h2 id="a-deep-learning-framework-for-multivariate-forecasting">A Deep Learning Framework for Multivariate Forecasting</h2>
<p>This is a <strong>Method</strong> paper that introduces the Long- and Short-term Time-series Network (LSTNet), a deep learning architecture specifically designed for multivariate time series forecasting. LSTNet combines convolutional neural networks (CNNs), recurrent neural networks (RNNs) with a novel skip-connection structure, and a traditional autoregressive (AR) component into a unified framework. The architecture targets the challenge of simultaneously capturing both short-term local dependencies and long-term periodic patterns in temporal data.</p>
<h2 id="why-short-term-and-long-term-patterns-need-separate-treatment">Why Short-Term and Long-Term Patterns Need Separate Treatment</h2>
<p>Real-world multivariate time series often exhibit a mixture of repeating patterns at different time scales. Highway traffic, for example, shows daily peaks (morning vs. evening commutes) alongside weekly patterns (weekday vs. weekend behavior). Solar energy output varies with cloud movements on short time scales and with seasonal daylight changes on longer ones. Electricity consumption follows similar daily and weekly cycles.</p>
<p>Traditional autoregressive methods (<a href="https://en.wikipedia.org/wiki/Vector_autoregression">VAR</a>, <a href="https://en.wikipedia.org/wiki/Autoregressive_integrated_moving_average">ARIMA</a>) and <a href="https://en.wikipedia.org/wiki/Gaussian_process">Gaussian Process</a> models struggle to distinguish and jointly model these two kinds of recurring patterns. Standard RNNs, including LSTM and <a href="https://en.wikipedia.org/wiki/Gated_recurrent_unit">GRU</a> variants, theoretically handle long-range dependencies but in practice suffer from <a href="https://en.wikipedia.org/wiki/Vanishing_gradient_problem">gradient vanishing</a> when the period length is large (e.g., 24 hours at hourly resolution, or 168 time steps for weekly patterns). The authors also identify a scale sensitivity problem: neural network models can fail when the magnitude of the input signal changes in non-periodic ways, such as sudden shifts in electricity consumption due to holidays or weather events.</p>
<h2 id="combining-cnns-recurrent-skip-connections-and-autoregression">Combining CNNs, Recurrent-Skip Connections, and Autoregression</h2>
<p>The LSTNet architecture consists of four main components that work together.</p>
<h3 id="convolutional-component">Convolutional Component</h3>
<p>The first layer applies 1D convolution without pooling across the multivariate input. Each filter has width $\omega$ (in the time dimension) and height $n$ (spanning all variables), producing feature maps that capture short-term local dependency patterns among variables:</p>
<p>$$h_k = \text{RELU}(W_k * X + b_k)$$</p>
<p>where $*$ denotes convolution and the input is zero-padded so each output vector has length $T$. The output is a $d_c \times T$ matrix where $d_c$ is the number of filters.</p>
<h3 id="recurrent-component">Recurrent Component</h3>
<p>The CNN output feeds into a GRU-based recurrent layer that uses RELU (rather than the standard tanh) as the hidden update activation:</p>
<p>$$\begin{aligned}
r_t &amp;= \sigma(x_t W_{xr} + h_{t-1} W_{hr} + b_r) \\
u_t &amp;= \sigma(x_t W_{xu} + h_{t-1} W_{hu} + b_u) \\
c_t &amp;= \text{RELU}(x_t W_{xc} + r_t \odot (h_{t-1} W_{hc}) + b_c) \\
h_t &amp;= (1 - u_t) \odot h_{t-1} + u_t \odot c_t
\end{aligned}$$</p>
<h3 id="recurrent-skip-component">Recurrent-Skip Component</h3>
<p>The key architectural innovation is a recurrent structure with temporal skip connections. Instead of connecting to the immediately preceding hidden state $h_{t-1}$, skip links connect to the hidden state from $p$ steps ago ($h_{t-p}$), where $p$ corresponds to the period length of the data (e.g., $p = 24$ for hourly data with daily periodicity):</p>
<p>$$\begin{aligned}
r_t &amp;= \sigma(x_t W_{xr} + h_{t-p} W_{hr} + b_r) \\
u_t &amp;= \sigma(x_t W_{xu} + h_{t-p} W_{hu} + b_u) \\
c_t &amp;= \text{RELU}(x_t W_{xc} + r_t \odot (h_{t-p} W_{hc}) + b_c) \\
h_t &amp;= (1 - u_t) \odot h_{t-p} + u_t \odot c_t
\end{aligned}$$</p>
<p>This design shortens the effective path length for learning periodic dependencies, making optimization easier. A dense layer combines outputs from both recurrent components:</p>
<p>$$h_t^D = W^R h_t^R + \sum_{i=0}^{p-1} W_i^S h_{t-i}^S + b$$</p>
<h3 id="temporal-attention-alternative">Temporal Attention Alternative</h3>
<p>For datasets without clear periodicity, LSTNet offers an attention-based variant (LSTNet-Attn) as an alternative to the recurrent-skip component. The attention mechanism learns to weight hidden representations across the input window adaptively. The attention weights $\alpha_t \in \mathbb{R}^q$ at time $t$ are computed as:</p>
<p>$$\alpha_t = \text{AttnScore}(H_t^R, h_{t-1}^R)$$</p>
<p>where $H_t^R = [h_{t-q}^R, \dots, h_{t-1}^R]$ stacks the RNN hidden representations column-wise and AttnScore is a similarity function (dot product, cosine, or a parameterized MLP). The weighted context vector and final output are:</p>
<p>$$\begin{aligned}
c_t &amp;= H_t \alpha_t \\
h_t^D &amp;= W[c_t;; h_{t-1}^R] + b
\end{aligned}$$</p>
<h3 id="autoregressive-component">Autoregressive Component</h3>
<p>To address the scale insensitivity of neural networks, LSTNet adds a classical autoregressive model in parallel:</p>
<p>$$h_{t,i}^L = \sum_{k=0}^{q^{ar}-1} W_k^{ar} y_{t-k,i} + b^{ar}$$</p>
<p>The final prediction integrates both the neural network and AR outputs:</p>
<p>$$\hat{Y}_t = h_t^D + h_t^L$$</p>
<p>This decomposition separates the prediction into a linear part (handling local scale changes) and a non-linear part (capturing recurring patterns).</p>
<h3 id="objective-function">Objective Function</h3>
<p>LSTNet supports two loss functions, selected via validation performance. The default is the squared (L2) loss:</p>
<p>$$\underset{\Theta}{\text{minimize}} \sum_{t \in \Omega_{\text{Train}}} \left| Y_t - \hat{Y}_{t-h} \right|_F^2$$</p>
<p>Motivated by the strong performance of Linear SVR baselines, LSTNet also supports the absolute (L1) loss, which is more robust to anomalies in real time series data:</p>
<p>$$\underset{\Theta}{\text{minimize}} \sum_{t \in \Omega_{\text{Train}}} \sum_{i=0}^{n-1} \left| Y_{t,i} - \hat{Y}_{t-h,i} \right|$$</p>
<p>where $\Theta$ is the full parameter set, $\Omega_{\text{Train}}$ is the set of training time stamps, $|\cdot|_F$ is the Frobenius norm, and $h$ is the forecast horizon.</p>
<h2 id="evaluation-on-four-benchmark-datasets">Evaluation on Four Benchmark Datasets</h2>
<h3 id="datasets">Datasets</h3>
<table>
  <thead>
      <tr>
          <th>Dataset</th>
          <th>Length</th>
          <th>Variables</th>
          <th>Sample Rate</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Traffic</td>
          <td>17,544</td>
          <td>862</td>
          <td>1 hour</td>
      </tr>
      <tr>
          <td>Solar-Energy</td>
          <td>52,560</td>
          <td>137</td>
          <td>10 minutes</td>
      </tr>
      <tr>
          <td>Electricity</td>
          <td>26,304</td>
          <td>321</td>
          <td>1 hour</td>
      </tr>
      <tr>
          <td>Exchange-Rate</td>
          <td>7,588</td>
          <td>8</td>
          <td>1 day</td>
      </tr>
  </tbody>
</table>
<p>All datasets are split 60/20/20 (train/validation/test) in chronological order. Traffic, Solar-Energy, and Electricity exhibit clear periodic patterns (daily and weekly), while Exchange-Rate shows only short-term local continuity.</p>
<h3 id="baselines">Baselines</h3>
<p>The authors compare against seven methods: AR (univariate autoregression), LRidge (VAR with L2 regularization), LSVR (VAR with SVR objective), TRMF (temporal regularized matrix factorization), GP (Gaussian Process), VAR-MLP (hybrid MLP-autoregressive), and RNN-GRU (standard GRU).</p>
<h3 id="metrics">Metrics</h3>
<p>Two evaluation metrics are used:</p>
<ul>
<li><strong>Root Relative Squared Error (RSE)</strong> (lower is better): A scaled RMSE that normalizes by the standard deviation of the test data, making comparison across datasets readable regardless of data scale:</li>
</ul>
<p>$$\text{RSE} = \frac{\sqrt{\sum_{(i,t) \in \Omega_{\text{Test}}} (Y_{it} - \hat{Y}_{it})^2}}{\sqrt{\sum_{(i,t) \in \Omega_{\text{Test}}} (Y_{it} - \text{mean}(Y))^2}}$$</p>
<ul>
<li><strong>Empirical Correlation Coefficient (CORR)</strong> (higher is better): The average Pearson correlation between predicted and true time series across all $n$ variables:</li>
</ul>
<p>$$\text{CORR} = \frac{1}{n} \sum_{i=1}^{n} \frac{\sum_t (Y_{it} - \text{mean}(Y_i))(\hat{Y}_{it} - \text{mean}(\hat{Y}_i))}{\sqrt{\sum_t (Y_{it} - \text{mean}(Y_i))^2 \sum_t (\hat{Y}_{it} - \text{mean}(\hat{Y}_i))^2}}$$</p>
<h3 id="main-results">Main Results</h3>
<p>The models are evaluated at horizons $h \in {3, 6, 12, 24}$, corresponding to 3-24 hours for Traffic and Electricity, 30-240 minutes for Solar-Energy, and 3-24 days for Exchange-Rate.</p>
<p>LSTNet-Skip achieved the best result in 17 out of 32 (dataset, metric, horizon) combinations, and LSTNet-Attn won 7 more. No other method won more than 3. At horizon 24, the best LSTNet variant improved over RNN-GRU by 9.2% RSE on Solar-Energy (LSTNet-Attn), 11.7% on Traffic (LSTNet-Skip), and 22.2% on Electricity (LSTNet-Skip). On the Exchange-Rate dataset, which lacks periodic patterns, LSTNet performed comparably to AR and LRidge, as expected.</p>
<h3 id="ablation-study">Ablation Study</h3>
<p>Removing each component individually revealed:</p>
<ul>
<li><strong>Without AR</strong>: The largest performance drops across most datasets, confirming the AR component&rsquo;s role in handling scale changes. Visualization showed that LSTNet-Skip successfully tracks sudden magnitude shifts in electricity consumption around the 1000th hour, while the model without AR fails.</li>
<li><strong>Without Skip/CNN</strong>: Significant drops on datasets with periodic patterns, though less consistent than removing AR.</li>
<li><strong>Full LSTNet</strong>: The most robust configuration across all datasets and horizons.</li>
</ul>
<p>A simulation experiment with synthetic autoregressive data confirmed that standard RNN-GRU fails to track non-periodic scale changes, while LSTNet with its AR component adapts properly.</p>
<h2 id="robust-performance-through-architectural-complementarity">Robust Performance Through Architectural Complementarity</h2>
<p>LSTNet&rsquo;s main strength is the complementarity of its components. The CNN captures short-term local patterns, the recurrent-skip layer captures long-term periodic dependencies, and the AR component provides robustness to scale changes. On datasets with strong periodicity (Traffic, Solar-Energy, Electricity), the skip connections provide large gains. On datasets without periodicity (Exchange-Rate), the AR component prevents degradation below competitive baselines.</p>
<p>The primary limitation is that the skip length $p$ in the recurrent-skip component must be manually specified or tuned. For datasets with known periodicity (e.g., hourly data with daily cycles), $p$ is straightforward to set. For datasets without clear periodicity, $p$ must be tuned as a hyperparameter, and the attention-based variant (LSTNet-Attn) offers an alternative that avoids this requirement. Future work directions include automatic period detection and incorporating variable-level attribute information into the convolutional layer.</p>
<hr>
<h2 id="reproducibility-details">Reproducibility Details</h2>
<h3 id="data">Data</h3>
<table>
  <thead>
      <tr>
          <th>Purpose</th>
          <th>Dataset</th>
          <th>Size</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Training/Evaluation</td>
          <td>Traffic</td>
          <td>17,544 x 862</td>
          <td>California DoT highway occupancy, hourly, 2015-2016</td>
      </tr>
      <tr>
          <td>Training/Evaluation</td>
          <td>Solar-Energy</td>
          <td>52,560 x 137</td>
          <td>Solar power from 137 PV plants in Alabama, 10-min intervals, 2006</td>
      </tr>
      <tr>
          <td>Training/Evaluation</td>
          <td>Electricity</td>
          <td>26,304 x 321</td>
          <td>kWh consumption for 321 clients, hourly, 2012-2014</td>
      </tr>
      <tr>
          <td>Training/Evaluation</td>
          <td>Exchange-Rate</td>
          <td>7,588 x 8</td>
          <td>Daily exchange rates for 8 countries, 1990-2016</td>
      </tr>
  </tbody>
</table>
<p>All datasets are publicly available via the <a href="https://github.com/laiguokun/LSTNet">GitHub repository</a>.</p>
<h3 id="algorithms">Algorithms</h3>
<ul>
<li>Optimizer: Adam</li>
<li>Dropout: 0.1 or 0.2 after each layer except input and output</li>
<li>Window size $q$: grid search over ${2^0, 2^1, \ldots, 2^9}$</li>
<li>Skip length $p$: set to 24 for Traffic/Electricity; tuned from $2^1$ to $2^6$ for Solar-Energy/Exchange-Rate</li>
<li>Objective: L2 loss (Eq. 7) or L1 loss (Eq. 9), selected via validation</li>
</ul>
<h3 id="models">Models</h3>
<ul>
<li>Hidden dimensions (Recurrent/CNN): ${50, 100, 200}$</li>
<li>Hidden dimensions (Recurrent-skip): ${20, 50, 100}$</li>
<li>AR regularization: ${0.1, 1, 10}$</li>
</ul>
<h3 id="evaluation">Evaluation</h3>
<table>
  <thead>
      <tr>
          <th>Metric</th>
          <th>Best LSTNet RSE</th>
          <th>Baseline (RNN-GRU)</th>
          <th>Improvement</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>Solar-Energy (h=24)</td>
          <td>0.4403 (Attn)</td>
          <td>0.4852</td>
          <td>9.2%</td>
      </tr>
      <tr>
          <td>Traffic (h=24)</td>
          <td>0.4973 (Skip)</td>
          <td>0.5633</td>
          <td>11.7%</td>
      </tr>
      <tr>
          <td>Electricity (h=24)</td>
          <td>0.1007 (Skip)</td>
          <td>0.1295</td>
          <td>22.2%</td>
      </tr>
  </tbody>
</table>
<h3 id="hardware">Hardware</h3>
<p>Not specified in the paper.</p>
<h3 id="artifacts">Artifacts</h3>
<table>
  <thead>
      <tr>
          <th>Artifact</th>
          <th>Type</th>
          <th>License</th>
          <th>Notes</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td><a href="https://github.com/laiguokun/LSTNet">LSTNet (laiguokun/LSTNet)</a></td>
          <td>Code</td>
          <td>MIT</td>
          <td>Official PyTorch implementation (Python 2.7, PyTorch 0.3.0)</td>
      </tr>
      <tr>
          <td><a href="https://github.com/laiguokun/multivariate-time-series-data">Multivariate Time Series Data (laiguokun/multivariate-time-series-data)</a></td>
          <td>Dataset</td>
          <td>Unknown</td>
          <td>Preprocessed benchmark datasets (Traffic, Solar-Energy, Electricity, Exchange-Rate)</td>
      </tr>
  </tbody>
</table>
<p><strong>Reproducibility status</strong>: Highly Reproducible. Code and all four benchmark datasets are publicly available. Hyperparameter search ranges are fully specified.</p>
<hr>
<h2 id="paper-information">Paper Information</h2>
<p><strong>Citation</strong>: Lai, G., Chang, W.-C., Yang, Y., &amp; Liu, H. (2018). Modeling Long- and Short-Term Temporal Patterns with Deep Neural Networks. <em>The 41st International ACM SIGIR Conference on Research &amp; Development in Information Retrieval (SIGIR &lsquo;18)</em>, 95-104. <a href="https://doi.org/10.1145/3209978.3210006">https://doi.org/10.1145/3209978.3210006</a></p>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@inproceedings</span>{lai2018modeling,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Modeling Long- and Short-Term Temporal Patterns with Deep Neural Networks}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Lai, Guokun and Chang, Wei-Cheng and Yang, Yiming and Liu, Hanxiao}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">booktitle</span>=<span style="color:#e6db74">{The 41st International ACM SIGIR Conference on Research \&amp; Development in Information Retrieval}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">pages</span>=<span style="color:#e6db74">{95--104}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2018}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">doi</span>=<span style="color:#e6db74">{10.1145/3209978.3210006}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item><item><title>Optimizing Sequence Models for Dynamical Systems</title><link>https://hunterheidenreich.com/research/deconstructing-recurrence-attention-gating/</link><pubDate>Tue, 01 Oct 2024 00:00:00 +0000</pubDate><guid>https://hunterheidenreich.com/research/deconstructing-recurrence-attention-gating/</guid><description>Ablation study deconstructing sequence models. Attention-augmented Recurrent Highway Networks outperform Transformers on chaotic systems.</description><content:encoded><![CDATA[<h2 id="abstract">Abstract</h2>
<p>Advanced neural network architectures developed for tasks like natural language processing are often transferred to spatiotemporal forecasting without a deep understanding of which components drive their performance. This can lead to suboptimal results and reinforces the view of these models as &ldquo;black boxes&rdquo;. In this work, we deconstruct the core mechanisms of Transformers and Recurrent Neural Networks (RNNs) (namely attention, gating, and recurrence). We then build and test novel hybrid architectures to identify which components are most effective. A key finding is that while adding recurrence is detrimental to Transformers, augmenting RNNs with attention and neural gating consistently improves their forecasting accuracy. Our study reveals that a seldom-used architecture, the Recurrent Highway Network (RHN) enhanced with these mechanisms, emerges as the top-performing model for forecasting high-dimensional chaotic systems.</p>
<h2 id="key-contributions">Key Contributions</h2>
<ul>
<li><strong>Systematic Ablation</strong>: Deconstructed Transformers and RNNs into core mechanisms (attention, gating, recurrence) to isolate performance drivers</li>
<li><strong>Novel Hybrid Architectures</strong>: Synthesized and tested new combinations of neural primitives for spatiotemporal forecasting</li>
<li><strong>RHN advantage on chaotic systems</strong>: Demonstrated that attention-augmented Recurrent Highway Networks outperform standard Transformers on high-dimensional chaotic systems</li>
<li><strong>Robustness Analysis</strong>: Validated models across both clean physics simulations and noisy real-world industrial datasets</li>
</ul>
<h2 id="motivation">Motivation</h2>
<p>In modern ML, architectures are often transferred from one domain (like NLP) to another (like physical forecasting) without understanding the underlying mechanics. This &ldquo;black box&rdquo; approach leads to suboptimal compute usage and performance ceilings.</p>
<p>Our goal was to break these architectures down. We treated the core mechanisms of <strong>Transformers</strong> and <strong>RNNs</strong> (<strong>Gating, Attention, and Recurrence</strong>) as orthogonal basis vectors. By decoupling these components, we could synthesize and test hybrid architectures to find the best configuration for spatiotemporal forecasting.</p>
<h2 id="methodological-approach">Methodological Approach</h2>
<p>We built a modular framework to mix and match neural primitives. We systematically evaluated:</p>
<ol>
<li><strong>Gating Mechanisms:</strong> Testing Additive, Learned Rate, Input-Dependent, and Coupled Input-Dependent variants</li>
<li><strong>Attention:</strong> Implementing multi-headed attention with relative positional biases</li>
<li><strong>Recurrence:</strong> Testing standard cells (LSTM, GRU) against deeper transition cells like Recurrent Highway Networks (RHN)</li>
</ol>















<figure class="post-figure center ">
    <img src="/img/deconstructing-sequence-prediction/neural-gates.webp"
         alt="Neural gating mechanisms: Additive, Learned Rate, Dependent-Coupled, and Dependent variants"
         title="Neural gating mechanisms: Additive, Learned Rate, Dependent-Coupled, and Dependent variants"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">The hierarchy of neural gating mechanisms we tested, from simple additive to fully input-dependent.</figcaption>
    
</figure>
















<figure class="post-figure center ">
    <img src="/img/deconstructing-sequence-prediction/rnn-cell-types.webp"
         alt="RNN cell architectures: Elman, LSTM, GRU, and RHN cells"
         title="RNN cell architectures: Elman, LSTM, GRU, and RHN cells"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Recurrent cell types compared in our study. The RHN (d) extends processing depth within each timestep.</figcaption>
    
</figure>

<p>This ablation isolated exactly <em>which</em> mathematical operation was driving the performance gain.</p>
<h2 id="key-findings">Key Findings</h2>
<h3 id="recurrent-highway-networks-on-chaotic-systems">Recurrent Highway Networks on Chaotic Systems</h3>
<p>For high-dimensional chaotic systems like the Multiscale Lorenz-96 shown below, we found that a <strong>Recurrent Highway Network (RHN)</strong> augmented with <strong>Attention and Neural Gating</strong> was the top-performing architecture. This hybrid exceeded the forecasting accuracy of standard Transformers, suggesting that deeper recurrence (processing depth per timestep) matters for complex dynamics.</p>















<figure class="post-figure center ">
    <img src="/img/deconstructing-sequence-prediction/multiscale-lorenz.webp"
         alt="Forecasting comparison on Multiscale Lorenz-96 system"
         title="Forecasting comparison on Multiscale Lorenz-96 system"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Forecasting the Multiscale Lorenz-96 system. The top row shows the &rsquo;texture&rsquo; of the chaotic evolution. Notice how the RHN (far right) maintains the coherent wave-like structures for nearly a full Lyapunov time, holding structure longer than the Transformer variants (the plotted window spans two Lyapunov times).</figcaption>
    
</figure>

<h3 id="transformers-recurrence-hurts-gating-helps">Transformers: Recurrence Hurts, Gating Helps</h3>
<p>We attempted to force recurrence into Transformers to give them &ldquo;memory,&rdquo; but it consistently hurt performance. However, <strong>Neural Gating</strong> significantly improved Transformer robustness. For real-world, noisy data (traffic, weather), the <strong>Pre-Layer Normalization (PreLN) Transformer</strong> with added gating proved to be the most robust model.</p>
<h3 id="adding-attention-to-lstms-and-grus">Adding Attention to LSTMs and GRUs</h3>
<p>We tested on the Kuramoto-Sivashinsky equation, a model of turbulence and flame fronts. We found that standard LSTMs and GRUs are under-optimized for this setting: adding <strong>attention</strong> to these cells improved their valid-prediction time several-fold, with the best attention-augmented LSTM and GRU reaching roughly 4x and 6.6x their baseline valid-prediction time, respectively (the paper reports the top RNNs at 2-7x baseline on K-S). (On the partially-observed Multiscale Lorenz-96 system the same attention-plus-gating gain is smaller, more than 40%.)</p>















<figure class="post-figure center ">
    <img src="/img/deconstructing-sequence-prediction/kuramoto-sivashinksy.webp"
         alt="Forecasting comparison on Kuramoto-Sivashinsky system"
         title="Forecasting comparison on Kuramoto-Sivashinsky system"
         
         
         loading="lazy"
         class="post-image">
    
    <figcaption class="post-caption">Forecasting the Kuramoto-Sivashinsky system. The error heatmaps (bottom row) show how prediction quality degrades over time (lighter means larger error). The RHN maintains structural fidelity longer than competing architectures.</figcaption>
    
</figure>

<h3 id="robustness-on-real-world-datasets">Robustness on Real-World Datasets</h3>
<p>While chaotic systems test the limits of theory, we also validated our models on seven standard real-world datasets: the four <strong>Electricity Transformer Temperature (ETT)</strong> subsets plus <strong>Traffic</strong>, <strong>Electricity</strong>, and <strong>Weather</strong>.</p>
<p>Unlike the clean physics simulations, these datasets contain real-world noise and irregularities. In this environment, the <strong>Pre-Layer Normalization (PreLN) Transformer</strong> proved to be the most robust architecture. While it didn&rsquo;t always beat the RHN on pure chaos, its stability makes it a strong default choice for general time-series forecasting tasks where training speed and reliability are paramount.</p>
<h2 id="why-this-matters">Why This Matters</h2>
<p>This work treats architectural components as independently tunable choices rather than fixed defaults, and that framing surfaces a concrete trade-off. Transformers train in only 25-50% of the time the RNNs require (roughly 2-4x faster), while the attention-augmented RNNs give better inference accuracy on the chaotic physical systems. Which mechanism to select depends on whether the training budget or the forecast precision is the binding constraint, and the ablation makes that an informed choice rather than a default one.</p>
<p>The ablation framework here, treating architectural components as independently tunable factors and measuring their marginal contribution, shaped how later evaluation work is structured. The same principle of isolating variables rather than comparing end-to-end black boxes appears in the document processing research, from benchmark construction in page stream segmentation to grounded evaluation in GutenOCR.</p>
<h2 id="related-work">Related Work</h2>
<p>The methodology here shares a design philosophy with <a href="/research/eigennoise-contrastive-prior/">EigenNoise</a>,
which similarly decomposes a neural mechanism (word vector initialization) into theoretically
grounded components to isolate what drives performance. Both papers treat model components as
testable hypotheses rather than fixed architectural choices.</p>
<p>For broader context on where this fits in the portfolio&rsquo;s Scientific Machine Learning arc,
see the <a href="/research/">Research</a> overview.</p>
<h2 id="citation">Citation</h2>
<div class="highlight"><pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;"><code class="language-bibtex" data-lang="bibtex"><span style="display:flex;"><span><span style="color:#a6e22e">@misc</span>{heidenreich2024deconstructingrecurrenceattentiongating,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">title</span>=<span style="color:#e6db74">{Deconstructing Recurrence, Attention, and Gating: Investigating the transferability of Transformers and Gated Recurrent Neural Networks in forecasting of dynamical systems}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">author</span>=<span style="color:#e6db74">{Hunter S. Heidenreich and Pantelis R. Vlachas and Petros Koumoutsakos}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">year</span>=<span style="color:#e6db74">{2024}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">eprint</span>=<span style="color:#e6db74">{2410.02654}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">archivePrefix</span>=<span style="color:#e6db74">{arXiv}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">primaryClass</span>=<span style="color:#e6db74">{cs.LG}</span>,
</span></span><span style="display:flex;"><span>  <span style="color:#a6e22e">url</span>=<span style="color:#e6db74">{https://arxiv.org/abs/2410.02654}</span>
</span></span><span style="display:flex;"><span>}
</span></span></code></pre></div>]]></content:encoded></item></channel></rss>