TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Addressing Some Limitations of Transformers with Feedback ...

Addressing Some Limitations of Transformers with Feedback Memory

Angela Fan, Thibaut Lavril, Edouard Grave, Armand Joulin, Sainbayar Sukhbaatar

2020-02-21Machine TranslationReinforcement LearningTranslationLanguage Modelling
PaperPDFCodeCodeCode(official)Code

Abstract

Transformers have been successfully applied to sequential, auto-regressive tasks despite being feedforward networks. Unlike recurrent neural networks, Transformers use attention to capture temporal relations while processing input tokens in parallel. While this parallelization makes them computationally efficient, it restricts the model from fully exploiting the sequential nature of the input. The representation at a given layer can only access representations from lower layers, rather than the higher level representations already available. In this work, we propose the Feedback Transformer architecture that exposes all previous representations to all future representations, meaning the lowest representation of the current timestep is formed from the highest-level abstract representation of the past. We demonstrate on a variety of benchmarks in language modeling, machine translation, and reinforcement learning that the increased representation capacity can create small, shallow models with much stronger performance than comparable Transformers.

Results

TaskDatasetMetricValueModel
Language ModellingPenn Treebank (Character Level)Bit per Character (BPC)1.16Feedback Transformer
Language ModellingWikiText-103Test perplexity18.2Feedback Transformer (8 layers)
Language ModellingWikiText-103Validation perplexity17.5Feedback Transformer (8 layers)
Language ModellingWikiText-103Test perplexity22.4Feedback Transformer (4 layers)
Language ModellingWikiText-103Validation perplexity21.4Feedback Transformer (4 layers)
Language Modellingenwik8Bit per Character (BPC)0.96Feedback Transformer

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learning2025-07-18VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17Spectral Bellman Method: Unifying Representation and Exploration in RL2025-07-17Aligning Humans and Robots via Reinforcement Learning from Implicit Human Feedback2025-07-17VAR-MATH: Probing True Mathematical Reasoning in Large Language Models via Symbolic Multi-Instance Benchmarks2025-07-17QuestA: Expanding Reasoning Capacity in LLMs via Question Augmentation2025-07-17Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities2025-07-17