TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Methods/ALiBi

ALiBi

Attention with Linear Biases

GeneralIntroduced 200019 papers
Source Paper

Description

ALiBi, or Attention with Linear Biases, is a positioning method that allows Transformer language models to consume, at inference time, sequences which are longer than the ones they were trained on.

ALiBi does this without using actual position embeddings. Instead, computing the attention between a certain key and query, ALiBi penalizes the attention value that that query can assign to the key depending on how far away the key and query are. So when a key and query are close by, the penalty is very low, and when they are far away, the penalty is very high.

This method was motivated by the simple reasoning that words that are close-by matter much more than ones that are far away.

This method is as fast as the sinusoidal or absolute embedding methods (the fastest positioning methods there are). It outperforms those methods and Rotary embeddings when evaluating sequences that are longer than the ones the model was trained on (this is known as extrapolation).

Papers Using This Method

A standard transformer and attention with linear biases for molecular conformer generation2025-06-24SeqPE: Transformer with Sequential Position Encoding2025-06-16Context-aware Biases for Length Extrapolation2025-03-11zScore: A Universal Decentralised Reputation System for the Blockchain Economy2025-02-17Linear Recency Bias During Training Improves Transformers' Fit to Reading Times2024-09-17Towards Inducing Document-Level Abilities in Standard Multilingual Neural Machine Translation Models2024-08-21Mitigate Position Bias in Large Language Models via Scaling a Single Dimension2024-06-04Can Perplexity Reflect Large Language Model's Ability in Long Text Understanding?2024-05-09MEP: Multiple Kernel Learning Enhancing Relative Positional Encoding Length Extrapolation2024-03-26Audiobox: Unified Audio Generation with Natural Language Prompts2023-12-25ScorePerformer: Expressive Piano Performance Rendering With Fine-Grained Control2023-11-04HyPE: Attention with Hyperbolic Biases for Relative Positional Encoding2023-10-30BTLM-3B-8K: 7B Parameter Performance in a 3B Parameter Model2023-09-20SlimPajama-DC: Understanding Data Combinations for LLM Training2023-09-19Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale2023-06-23The Impact of Positional Encoding on Length Generalization in Transformers2023-05-31A Vector Quantized Approach for Text to Speech Synthesis on Real-World Spontaneous Speech2023-02-08Dissecting Transformer Length Extrapolation via the Lens of Receptive Field Analysis2022-12-20Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation2021-08-27