Flowformer: Linearizing Transformers with Conservation Flows

Haixu Wu, Jialong Wu, Jiehui Xu, Jianmin Wang, Mingsheng Long

2022-02-13Offline RL D4RL Time Series Time Series Analysis

Abstract

Transformers based on the attention mechanism have achieved impressive success in various areas. However, the attention mechanism has a quadratic complexity, significantly impeding Transformers from dealing with numerous tokens and scaling up to bigger models. Previous methods mainly utilize the similarity decomposition and the associativity of matrix multiplication to devise linear-time attention mechanisms. They avoid degeneration of attention to a trivial distribution by reintroducing inductive biases such as the locality, thereby at the expense of model generality and expressiveness. In this paper, we linearize Transformers free from specific inductive biases based on the flow network theory. We cast attention as the information flow aggregated from the sources (values) to the sinks (results) through the learned flow capacities (attentions). Within this framework, we apply the property of flow conservation into attention and propose the Flow-Attention mechanism of linear complexity. By respectively conserving the incoming flow of sinks for source competition and the outgoing flow of sources for sink allocation, Flow-Attention inherently generates informative attentions without using specific inductive biases. Empowered by the Flow-Attention, Flowformer yields strong performance in linear time for wide areas, including long sequence, time series, vision, natural language, and reinforcement learning. The code and settings are available at this repository: https://github.com/thuml/Flowformer.

Results

Task	Dataset	Metric	Value	Model
MuJoCo Games	D4RL	Average Reward	73.5	Flowformer

Related Papers

From Novelty to Imitation: Self-Distilled Rewards for Offline Reinforcement Learning2025-07-17 MoTM: Towards a Foundation Model for Time Series Imputation based on Continuous Modeling2025-07-17 The Power of Architecture: Deep Dive into Transformer Architectures for Long-Term Time Series Forecasting2025-07-17 Emergence of Functionally Differentiated Structures via Mutual Information Optimization in Recurrent Neural Networks2025-07-17 Step-wise Policy for Rare-tool Knowledge (SPaRK): Offline RL that Drives Diverse Tool Use in LLMs2025-07-15 Data Augmentation in Time Series Forecasting through Inverted Framework2025-07-15 D3FL: Data Distribution and Detrending for Robust Federated Learning in Non-linear Time-series Data2025-07-15 Towards Interpretable Time Series Foundation Models2025-07-10