LSTNet: Long- and Short-Term Time Series Network

A Deep Learning Framework for Multivariate Forecasting

This is a Method paper that introduces the Long- and Short-term Time-series Network (LSTNet), a deep learning architecture specifically designed for multivariate time series forecasting. LSTNet combines convolutional neural networks (CNNs), recurrent neural networks (RNNs) with a novel skip-connection structure, and a traditional autoregressive (AR) component into a unified framework. The architecture targets the challenge of simultaneously capturing both short-term local dependencies and long-term periodic patterns in temporal data.

Why Short-Term and Long-Term Patterns Need Separate Treatment

Real-world multivariate time series often exhibit a mixture of repeating patterns at different time scales. Highway traffic, for example, shows daily peaks (morning vs. evening commutes) alongside weekly patterns (weekday vs. weekend behavior). Solar energy output varies with cloud movements on short time scales and with seasonal daylight changes on longer ones. Electricity consumption follows similar daily and weekly cycles.

Traditional autoregressive methods (VAR, ARIMA) and Gaussian Process models struggle to distinguish and jointly model these two kinds of recurring patterns. Standard RNNs, including LSTM and GRU variants, theoretically handle long-range dependencies but in practice suffer from gradient vanishing when the period length is large (e.g., 24 hours at hourly resolution, or 168 time steps for weekly patterns). The authors also identify a scale sensitivity problem: neural network models can fail when the magnitude of the input signal changes in non-periodic ways, such as sudden shifts in electricity consumption due to holidays or weather events.

Combining CNNs, Recurrent-Skip Connections, and Autoregression

The LSTNet architecture consists of four main components that work together.

Convolutional Component

The first layer applies 1D convolution without pooling across the multivariate input. Each filter has width $\omega$ (in the time dimension) and height $n$ (spanning all variables), producing feature maps that capture short-term local dependency patterns among variables:

$$h_k = \text{RELU}(W_k * X + b_k)$$

where $*$ denotes convolution and the input is zero-padded so each output vector has length $T$. The output is a $d_c \times T$ matrix where $d_c$ is the number of filters.

Recurrent Component

The CNN output feeds into a GRU-based recurrent layer that uses RELU (rather than the standard tanh) as the hidden update activation:

$$\begin{aligned} r_t &= \sigma(x_t W_{xr} + h_{t-1} W_{hr} + b_r) \\ u_t &= \sigma(x_t W_{xu} + h_{t-1} W_{hu} + b_u) \\ c_t &= \text{RELU}(x_t W_{xc} + r_t \odot (h_{t-1} W_{hc}) + b_c) \\ h_t &= (1 - u_t) \odot h_{t-1} + u_t \odot c_t \end{aligned}$$

Recurrent-Skip Component

The key architectural innovation is a recurrent structure with temporal skip connections. Instead of connecting to the immediately preceding hidden state $h_{t-1}$, skip links connect to the hidden state from $p$ steps ago ($h_{t-p}$), where $p$ corresponds to the period length of the data (e.g., $p = 24$ for hourly data with daily periodicity):

$$\begin{aligned} r_t &= \sigma(x_t W_{xr} + h_{t-p} W_{hr} + b_r) \\ u_t &= \sigma(x_t W_{xu} + h_{t-p} W_{hu} + b_u) \\ c_t &= \text{RELU}(x_t W_{xc} + r_t \odot (h_{t-p} W_{hc}) + b_c) \\ h_t &= (1 - u_t) \odot h_{t-p} + u_t \odot c_t \end{aligned}$$

This design shortens the effective path length for learning periodic dependencies, making optimization easier. A dense layer combines outputs from both recurrent components:

$$h_t^D = W^R h_t^R + \sum_{i=0}^{p-1} W_i^S h_{t-i}^S + b$$

Temporal Attention Alternative

For datasets without clear periodicity, LSTNet offers an attention-based variant (LSTNet-Attn) as an alternative to the recurrent-skip component. The attention mechanism learns to weight hidden representations across the input window adaptively. The attention weights $\alpha_t \in \mathbb{R}^q$ at time $t$ are computed as:

$$\alpha_t = \text{AttnScore}(H_t^R, h_{t-1}^R)$$

where $H_t^R = [h_{t-q}^R, \dots, h_{t-1}^R]$ stacks the RNN hidden representations column-wise and AttnScore is a similarity function (dot product, cosine, or a parameterized MLP). The weighted context vector and final output are:

$$\begin{aligned} c_t &= H_t \alpha_t \\ h_t^D &= W[c_t;; h_{t-1}^R] + b \end{aligned}$$

Autoregressive Component

To address the scale insensitivity of neural networks, LSTNet adds a classical autoregressive model in parallel:

$$h_{t,i}^L = \sum_{k=0}^{q^{ar}-1} W_k^{ar} y_{t-k,i} + b^{ar}$$

The final prediction integrates both the neural network and AR outputs:

$$\hat{Y}_t = h_t^D + h_t^L$$

This decomposition separates the prediction into a linear part (handling local scale changes) and a non-linear part (capturing recurring patterns).

Objective Function

LSTNet supports two loss functions, selected via validation performance. The default is the squared (L2) loss:

$$\underset{\Theta}{\text{minimize}} \sum_{t \in \Omega_{\text{Train}}} \left| Y_t - \hat{Y}_{t-h} \right|_F^2$$

Motivated by the strong performance of Linear SVR baselines, LSTNet also supports the absolute (L1) loss, which is more robust to anomalies in real time series data:

$$\underset{\Theta}{\text{minimize}} \sum_{t \in \Omega_{\text{Train}}} \sum_{i=0}^{n-1} \left| Y_{t,i} - \hat{Y}_{t-h,i} \right|$$

where $\Theta$ is the full parameter set, $\Omega_{\text{Train}}$ is the set of training time stamps, $|\cdot|_F$ is the Frobenius norm, and $h$ is the forecast horizon.

Evaluation on Four Benchmark Datasets

Datasets

Dataset	Length	Variables	Sample Rate
Traffic	17,544	862	1 hour
Solar-Energy	52,560	137	10 minutes
Electricity	26,304	321	1 hour
Exchange-Rate	7,588	8	1 day

All datasets are split 60/20/20 (train/validation/test) in chronological order. Traffic, Solar-Energy, and Electricity exhibit clear periodic patterns (daily and weekly), while Exchange-Rate shows only short-term local continuity.

Baselines

The authors compare against seven methods: AR (univariate autoregression), LRidge (VAR with L2 regularization), LSVR (VAR with SVR objective), TRMF (temporal regularized matrix factorization), GP (Gaussian Process), VAR-MLP (hybrid MLP-autoregressive), and RNN-GRU (standard GRU).

Metrics

Two evaluation metrics are used:

Root Relative Squared Error (RSE) (lower is better): A scaled RMSE that normalizes by the standard deviation of the test data, making comparison across datasets readable regardless of data scale:

$$\text{RSE} = \frac{\sqrt{\sum_{(i,t) \in \Omega_{\text{Test}}} (Y_{it} - \hat{Y}_{it})^2}}{\sqrt{\sum_{(i,t) \in \Omega_{\text{Test}}} (Y_{it} - \text{mean}(Y))^2}}$$

Empirical Correlation Coefficient (CORR) (higher is better): The average Pearson correlation between predicted and true time series across all $n$ variables:

$$\text{CORR} = \frac{1}{n} \sum_{i=1}^{n} \frac{\sum_t (Y_{it} - \text{mean}(Y_i))(\hat{Y}_{it} - \text{mean}(\hat{Y}_i))}{\sqrt{\sum_t (Y_{it} - \text{mean}(Y_i))^2 \sum_t (\hat{Y}_{it} - \text{mean}(\hat{Y}_i))^2}}$$

Main Results

The models are evaluated at horizons $h \in {3, 6, 12, 24}$, corresponding to 3-24 hours for Traffic and Electricity, 30-240 minutes for Solar-Energy, and 3-24 days for Exchange-Rate.

LSTNet-Skip achieved the best result in 17 out of 32 (dataset, metric, horizon) combinations, and LSTNet-Attn won 7 more. No other method won more than 3. At horizon 24, the best LSTNet variant improved over RNN-GRU by 9.2% RSE on Solar-Energy (LSTNet-Attn), 11.7% on Traffic (LSTNet-Skip), and 22.2% on Electricity (LSTNet-Skip). On the Exchange-Rate dataset, which lacks periodic patterns, LSTNet performed comparably to AR and LRidge, as expected.

Ablation Study

Removing each component individually revealed:

Without AR: The largest performance drops across most datasets, confirming the AR component’s role in handling scale changes. Visualization showed that LSTNet-Skip successfully tracks sudden magnitude shifts in electricity consumption around the 1000th hour, while the model without AR fails.
Without Skip/CNN: Significant drops on datasets with periodic patterns, though less consistent than removing AR.
Full LSTNet: The most robust configuration across all datasets and horizons.

A simulation experiment with synthetic autoregressive data confirmed that standard RNN-GRU fails to track non-periodic scale changes, while LSTNet with its AR component adapts properly.

Robust Performance Through Architectural Complementarity

LSTNet’s main strength is the complementarity of its components. The CNN captures short-term local patterns, the recurrent-skip layer captures long-term periodic dependencies, and the AR component provides robustness to scale changes. On datasets with strong periodicity (Traffic, Solar-Energy, Electricity), the skip connections provide large gains. On datasets without periodicity (Exchange-Rate), the AR component prevents degradation below competitive baselines.

The primary limitation is that the skip length $p$ in the recurrent-skip component must be manually specified or tuned. For datasets with known periodicity (e.g., hourly data with daily cycles), $p$ is straightforward to set. For datasets without clear periodicity, $p$ must be tuned as a hyperparameter, and the attention-based variant (LSTNet-Attn) offers an alternative that avoids this requirement. Future work directions include automatic period detection and incorporating variable-level attribute information into the convolutional layer.

Reproducibility Details

Data

Purpose	Dataset	Size	Notes
Training/Evaluation	Traffic	17,544 x 862	California DoT highway occupancy, hourly, 2015-2016
Training/Evaluation	Solar-Energy	52,560 x 137	Solar power from 137 PV plants in Alabama, 10-min intervals, 2006
Training/Evaluation	Electricity	26,304 x 321	kWh consumption for 321 clients, hourly, 2012-2014
Training/Evaluation	Exchange-Rate	7,588 x 8	Daily exchange rates for 8 countries, 1990-2016

All datasets are publicly available via the GitHub repository.

Algorithms

Optimizer: Adam
Dropout: 0.1 or 0.2 after each layer except input and output
Window size $q$: grid search over ${2^0, 2^1, \ldots, 2^9}$
Skip length $p$: set to 24 for Traffic/Electricity; tuned from $2^1$ to $2^6$ for Solar-Energy/Exchange-Rate
Objective: L2 loss (Eq. 7) or L1 loss (Eq. 9), selected via validation

Models

Hidden dimensions (Recurrent/CNN): ${50, 100, 200}$
Hidden dimensions (Recurrent-skip): ${20, 50, 100}$
AR regularization: ${0.1, 1, 10}$

Evaluation

Metric	Best LSTNet RSE	Baseline (RNN-GRU)	Improvement
Solar-Energy (h=24)	0.4403 (Attn)	0.4852	9.2%
Traffic (h=24)	0.4973 (Skip)	0.5633	11.7%
Electricity (h=24)	0.1007 (Skip)	0.1295	22.2%

Hardware

Not specified in the paper.

Artifacts

Artifact	Type	License	Notes
LSTNet (laiguokun/LSTNet)	Code	MIT	Official PyTorch implementation (Python 2.7, PyTorch 0.3.0)
Multivariate Time Series Data (laiguokun/multivariate-time-series-data)	Dataset	Unknown	Preprocessed benchmark datasets (Traffic, Solar-Energy, Electricity, Exchange-Rate)

Reproducibility status: Highly Reproducible. Code and all four benchmark datasets are publicly available. Hyperparameter search ranges are fully specified.

Paper Information

Citation: Lai, G., Chang, W.-C., Yang, Y., & Liu, H. (2018). Modeling Long- and Short-Term Temporal Patterns with Deep Neural Networks. The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval (SIGIR ‘18), 95-104. https://doi.org/10.1145/3209978.3210006

@inproceedings{lai2018modeling,
  title={Modeling Long- and Short-Term Temporal Patterns with Deep Neural Networks},
  author={Lai, Guokun and Chang, Wei-Cheng and Yang, Yiming and Liu, Hanxiao},
  booktitle={The 41st International ACM SIGIR Conference on Research \& Development in Information Retrieval},
  pages={95--104},
  year={2018},
  doi={10.1145/3209978.3210006}
}

A Deep Learning Framework for Multivariate Forecasting#

Why Short-Term and Long-Term Patterns Need Separate Treatment#

Combining CNNs, Recurrent-Skip Connections, and Autoregression#

Convolutional Component#

Recurrent Component#

Recurrent-Skip Component#

Temporal Attention Alternative#

Autoregressive Component#

Objective Function#

Evaluation on Four Benchmark Datasets#

Datasets#

Baselines#

Metrics#

Main Results#

Ablation Study#

Robust Performance Through Architectural Complementarity#

Reproducibility Details#

Data#

Algorithms#

Models#

Evaluation#

Hardware#

Artifacts#

Paper Information#