TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Methods/Feedback Memory

Feedback Memory

GeneralIntroduced 20004 papers
Source Paper

Description

Feedback Memory is a type of attention module used in the Feedback Transformer architecture. It allows a transformer to to use the most abstract representations from the past directly as inputs for the current timestep. This means that the model does not form its representation in parallel, but sequentially token by token. More precisely, we replace the context inputs to attention modules with memory vectors that are computed over the past, i.e.:

zl_t=Attn(xl_t,[m_t−τ,…,m_t−1])\mathbf{z}^{l}\_{t} = \text{Attn}\left(\mathbf{x}^{l}\_{t}, \left[\mathbf{m}\_{t−\tau}, \dots, \mathbf{m}\_{t−1}\right]\right)zl_t=Attn(xl_t,[m_t−τ,…,m_t−1])

where a memory vector m_t\mathbf{m}\_{t}m_t is computed by summing the representations of each layer at the ttt-th time step:

m_t=∑L_l=0Softmax(wl)x_tl\mathbf{m}\_{t} = \sum^{L}\_{l=0}\text{Softmax}\left(w^{l}\right)\mathbf{x}\_{t}^{l}m_t=∑L_l=0Softmax(wl)x_tl

where wlw^{l}wl are learnable scalar parameters. Here l=0l = 0l=0 corresponds to token embeddings. The weighting of different layers by a softmax output gives the model more flexibility as it can average them or select one of them. This modification of the self-attention input adapts the computation of the Transformer from parallel to sequential, summarized in the Figure. Indeed, it gives the ability to formulate the representation xl_t+1\mathbf{x}^{l}\_{t+1}xl_t+1 based on past representations from any layer l′l'l′, while in a standard Transformer this is only true for l>l′l > l'l>l′. This change can be viewed as exposing all previous computations to all future computations, providing better representations of the input. Such capacity would allow much shallower models to capture the same level of abstraction as a deeper architecture.

Papers Using This Method

SPIRe: Boosting LLM Inference Throughput with Speculative Decoding2025-04-08Iterative Feedback Network for Unsupervised Point Cloud Registration2024-01-09Do You Know My Emotion? Emotion-Aware Strategy Recognition towards a Persuasive Dialogue System2022-06-24Addressing Some Limitations of Transformers with Feedback Memory2020-02-21