TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/DeBERTa: Decoding-enhanced BERT with Disentangled Attention

DeBERTa: Decoding-enhanced BERT with Disentangled Attention

Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen

2020-06-05ICLR 2021 1Reading ComprehensionQuestion AnsweringMath Word Problem SolvingSentence CompletionSentiment AnalysisCoreference ResolutionNatural Language InferenceCommon Sense ReasoningNatural Language UnderstandingSemantic Textual SimilarityLinguistic AcceptabilityNamed Entity Recognition (NER)Word Sense Disambiguation
PaperPDFCodeCodeCodeCodeCodeCodeCode(official)CodeCodeCodeCodeCodeCodeCode

Abstract

Recent progress in pre-trained neural language models has significantly improved the performance of many natural language processing (NLP) tasks. In this paper we propose a new model architecture DeBERTa (Decoding-enhanced BERT with disentangled attention) that improves the BERT and RoBERTa models using two novel techniques. The first is the disentangled attention mechanism, where each word is represented using two vectors that encode its content and position, respectively, and the attention weights among words are computed using disentangled matrices on their contents and relative positions, respectively. Second, an enhanced mask decoder is used to incorporate absolute positions in the decoding layer to predict the masked tokens in model pre-training. In addition, a new virtual adversarial training method is used for fine-tuning to improve models' generalization. We show that these techniques significantly improve the efficiency of model pre-training and the performance of both natural language understanding (NLU) and natural langauge generation (NLG) downstream tasks. Compared to RoBERTa-Large, a DeBERTa model trained on half of the training data performs consistently better on a wide range of NLP tasks, achieving improvements on MNLI by +0.9% (90.2% vs. 91.1%), on SQuAD v2.0 by +2.3% (88.4% vs. 90.7%) and RACE by +3.6% (83.2% vs. 86.8%). Notably, we scale up DeBERTa by training a larger version that consists of 48 Transform layers with 1.5 billion parameters. The significant performance boost makes the single DeBERTa model surpass the human performance on the SuperGLUE benchmark (Wang et al., 2019a) for the first time in terms of macro-average score (89.9 versus 89.8), and the ensemble DeBERTa model sits atop the SuperGLUE leaderboard as of January 6, 2021, out performing the human baseline by a decent margin (90.3 versus 89.8).

Results

TaskDatasetMetricValueModel
Reading ComprehensionRACEAccuracy86.8DeBERTalarge
Question AnsweringCOPAAccuracy98.4DeBERTa-Ensemble
Question AnsweringCOPAAccuracy96.8DeBERTa-1.5B
Question AnsweringMultiRCEM63.7DeBERTa-1.5B
Question AnsweringMultiRCF188.2DeBERTa-1.5B
Question AnsweringBoolQAccuracy90.4DeBERTa-1.5B
Question AnsweringSQuAD2.0EM88DeBERTalarge
Question AnsweringSQuAD2.0F190.7DeBERTalarge
Question AnsweringParaMAWPSAccuracy (%)74.1DeBERTa
Common Sense ReasoningSWAGTest90.8DeBERTalarge
Common Sense ReasoningReCoRDEM94.1DeBERTa-1.5B
Common Sense ReasoningReCoRDF194.5DeBERTa-1.5B
Word Sense DisambiguationWords in ContextAccuracy77.5DeBERTa-Ensemble
Word Sense DisambiguationWords in ContextAccuracy76.4DeBERTa-1.5B
Natural Language InferenceWNLIAccuracy94.5DeBERTa
Natural Language InferenceCommitmentBankAccuracy97.2DeBERTa-1.5B
Natural Language InferenceCommitmentBankF194.9DeBERTa-1.5B
Natural Language InferenceMultiNLIMatched91.1DeBERTa (large)
Natural Language InferenceMultiNLIMismatched91.1DeBERTa (large)
Semantic Textual SimilaritySTS BenchmarkAccuracy92.5DeBERTa (large)
Sentiment AnalysisSST-2 Binary classificationAccuracy96.5DeBERTa (large)
Coreference ResolutionWinograd Schema ChallengeAccuracy95.9DeBERTa-1.5B
Linguistic AcceptabilityCoLA DevAccuracy69.5DeBERTa (large)
Math Word Problem SolvingParaMAWPSAccuracy (%)74.1DeBERTa
Mathematical Question AnsweringParaMAWPSAccuracy (%)74.1DeBERTa
Mathematical ReasoningParaMAWPSAccuracy (%)74.1DeBERTa
Sentence CompletionHellaSwagAccuracy93DeBERTa++

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17AdaptiSent: Context-Aware Adaptive Attention for Multimodal Aspect-Based Sentiment Analysis2025-07-17Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes2025-07-17SemCSE: Semantic Contrastive Sentence Embeddings Using LLM-Generated Summaries For Scientific Abstracts2025-07-17Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16