RealFormer: Transformer Likes Residual Attention

Ruining He, Anirudh Ravula, Bhargav Kanagal, Joshua Ainslie

2020-12-21Findings (ACL) 2021 8Machine Translation Paraphrase Identification Sentiment Analysis Natural Language Inference Masked Language Modeling Natural Questions Translation Semantic Textual Similarity Linguistic Acceptability Language Modelling

Paper PDF Code Code Code Code(official)Code

Abstract

Transformer is the backbone of modern NLP models. In this paper, we propose RealFormer, a simple and generic technique to create Residual Attention Layer Transformer networks that significantly outperform the canonical Transformer and its variants (BERT, ETC, etc.) on a wide spectrum of tasks including Masked Language Modeling, GLUE, SQuAD, Neural Machine Translation, WikiHop, HotpotQA, Natural Questions, and OpenKP. We also observe empirically that RealFormer stabilizes training and leads to models with sparser attention. Source code and pre-trained checkpoints for RealFormer can be found at https://github.com/google-research/google-research/tree/master/realformer.

Results

Task	Dataset	Metric	Value	Model
Natural Language Inference	MultiNLI	Matched	86.28	RealFormer
Natural Language Inference	MultiNLI	Mismatched	86.34	RealFormer
Semantic Textual Similarity	STS Benchmark	Pearson Correlation	0.9011	RealFormer
Semantic Textual Similarity	STS Benchmark	Spearman Correlation	0.8988	RealFormer
Semantic Textual Similarity	Quora Question Pairs	Accuracy	91.34	RealFormer
Semantic Textual Similarity	Quora Question Pairs	F1	88.28	RealFormer
Sentiment Analysis	SST-2 Binary classification	Accuracy	94.04	RealFormer
Paraphrase Identification	Quora Question Pairs	Accuracy	91.34	RealFormer
Paraphrase Identification	Quora Question Pairs	F1	88.28	RealFormer

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21 AdaptiSent: Context-Aware Adaptive Attention for Multimodal Aspect-Based Sentiment Analysis2025-07-17 A Translation of Probabilistic Event Calculus into Markov Decision Processes2025-07-17 SemCSE: Semantic Contrastive Sentence Embeddings Using LLM-Generated Summaries For Scientific Abstracts2025-07-17 Making Language Model a Hierarchical Classifier and Generator2025-07-17 VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17 The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17 Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities2025-07-17