TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Adaptive Attention Span in Transformers

Adaptive Attention Span in Transformers

Sainbayar Sukhbaatar, Edouard Grave, Piotr Bojanowski, Armand Joulin

2019-05-19ACL 2019 7Language Modelling
PaperPDFCodeCodeCodeCode(official)CodeCodeCodeCode

Abstract

We propose a novel self-attention mechanism that can learn its optimal attention span. This allows us to extend significantly the maximum context size used in Transformer, while maintaining control over their memory footprint and computational time. We show the effectiveness of our approach on the task of character level language modeling, where we achieve state-of-the-art performances on text8 and enwiki8 by using a maximum context of 8k characters.

Results

TaskDatasetMetricValueModel
Language ModellingText8Bit per Character (BPC)1.0724L Transformer + 8K adaptive span
Language ModellingText8Bit per Character (BPC)1.1112L Transformer + 8K adaptive span
Language Modellingenwik8Bit per Character (BPC)0.98Transformer (24 layers, 8k adaptive span)
Language Modellingenwik8Bit per Character (BPC)1.02Transformer (12 layers, 8k adaptive span)

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21Making Language Model a Hierarchical Classifier and Generator2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities2025-07-17Assay2Mol: large language model-based drug design using BioAssay context2025-07-16Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16InstructFLIP: Exploring Unified Vision-Language Model for Face Anti-spoofing2025-07-16