TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Attention as an RNN

Attention as an RNN

Leo Feng, Frederick Tung, Hossein Hajimirsadeghi, Mohamed Osama Ahmed, Yoshua Bengio, Greg Mori

2024-05-22Time Series ForecastingTime SeriesTime Series Classification
PaperPDFCode

Abstract

The advent of Transformers marked a significant breakthrough in sequence modelling, providing a highly performant architecture capable of leveraging GPU parallelism. However, Transformers are computationally expensive at inference time, limiting their applications, particularly in low-resource settings (e.g., mobile and embedded devices). Addressing this, we (1) begin by showing that attention can be viewed as a special Recurrent Neural Network (RNN) with the ability to compute its \textit{many-to-one} RNN output efficiently. We then (2) show that popular attention-based models such as Transformers can be viewed as RNN variants. However, unlike traditional RNNs (e.g., LSTMs), these models cannot be updated efficiently with new tokens, an important property in sequence modelling. Tackling this, we (3) introduce a new efficient method of computing attention's \textit{many-to-many} RNN output based on the parallel prefix scan algorithm. Building on the new attention formulation, we (4) introduce \textbf{Aaren}, an attention-based module that can not only (i) be trained in parallel (like Transformers) but also (ii) be updated efficiently with new tokens, requiring only constant memory for inferences (like traditional RNNs). Empirically, we show Aarens achieve comparable performance to Transformers on $38$ datasets spread across four popular sequential problem settings: reinforcement learning, event forecasting, time series classification, and time series forecasting tasks while being more time and memory-efficient.

Results

TaskDatasetMetricValueModel
Time Series ForecastingETTh1 (336) MultivariateMAE0.55Aaren
Time Series ForecastingETTh1 (336) MultivariateMSE0.65Aaren
Time Series AnalysisETTh1 (336) MultivariateMAE0.55Aaren
Time Series AnalysisETTh1 (336) MultivariateMSE0.65Aaren

Related Papers

The Power of Architecture: Deep Dive into Transformer Architectures for Long-Term Time Series Forecasting2025-07-17MoTM: Towards a Foundation Model for Time Series Imputation based on Continuous Modeling2025-07-17Data Augmentation in Time Series Forecasting through Inverted Framework2025-07-15D3FL: Data Distribution and Detrending for Robust Federated Learning in Non-linear Time-series Data2025-07-15Towards Interpretable Time Series Foundation Models2025-07-10MoFE-Time: Mixture of Frequency Domain Experts for Time-Series Forecasting Models2025-07-09Foundation models for time series forecasting: Application in conformal prediction2025-07-09Bridging the Last Mile of Prediction: Enhancing Time Series Forecasting with Conditional Guided Flow Matching2025-07-09