TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Augmenting Self-attention with Persistent Memory

Augmenting Self-attention with Persistent Memory

Sainbayar Sukhbaatar, Edouard Grave, Guillaume Lample, Herve Jegou, Armand Joulin

2019-07-02TranslationLanguage Modelling
PaperPDFCodeCode

Abstract

Transformer networks have lead to important progress in language modeling and machine translation. These models include two consecutive modules, a feed-forward layer and a self-attention layer. The latter allows the network to capture long term dependencies and are often regarded as the key ingredient in the success of Transformers. Building upon this intuition, we propose a new model that solely consists of attention layers. More precisely, we augment the self-attention layers with persistent memory vectors that play a similar role as the feed-forward layer. Thanks to these vectors, we can remove the feed-forward layer without degrading the performance of a transformer. Our evaluation shows the benefits brought by our model on standard character and word level language modeling benchmarks.

Results

TaskDatasetMetricValueModel
Language ModellingWikiText-103Test perplexity20.6All-attention network (36 layers)
Language ModellingWikiText-103Validation perplexity19.7All-attention network (36 layers)
Language ModellingText8Bit per Character (BPC)1.08All-attention network - 36 layers
Language ModellingText8Bit per Character (BPC)1.11All-attention network - 18 layers
Language Modellingenwik8Bit per Character (BPC)1.01All-attention network (18 layers)

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21A Translation of Probabilistic Event Calculus into Markov Decision Processes2025-07-17Making Language Model a Hierarchical Classifier and Generator2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities2025-07-17Assay2Mol: large language model-based drug design using BioAssay context2025-07-16Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16