TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/TinyBERT: Distilling BERT for Natural Language Understanding

TinyBERT: Distilling BERT for Natural Language Understanding

Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, Qun Liu

2019-09-23Findings of the Association for Computational Linguistics 2020Question AnsweringParaphrase IdentificationSentiment AnalysisNatural Language InferenceNatural Language UnderstandingSemantic Textual SimilarityLinguistic AcceptabilityKnowledge DistillationLanguage Modelling
PaperPDFCodeCodeCode(official)CodeCodeCodeCodeCodeCodeCode

Abstract

Language model pre-training, such as BERT, has significantly improved the performances of many natural language processing tasks. However, pre-trained language models are usually computationally expensive, so it is difficult to efficiently execute them on resource-restricted devices. To accelerate inference and reduce model size while maintaining accuracy, we first propose a novel Transformer distillation method that is specially designed for knowledge distillation (KD) of the Transformer-based models. By leveraging this new KD method, the plenty of knowledge encoded in a large teacher BERT can be effectively transferred to a small student Tiny-BERT. Then, we introduce a new two-stage learning framework for TinyBERT, which performs Transformer distillation at both the pretraining and task-specific learning stages. This framework ensures that TinyBERT can capture he general-domain as well as the task-specific knowledge in BERT. TinyBERT with 4 layers is empirically effective and achieves more than 96.8% the performance of its teacher BERTBASE on GLUE benchmark, while being 7.5x smaller and 9.4x faster on inference. TinyBERT with 4 layers is also significantly better than 4-layer state-of-the-art baselines on BERT distillation, with only about 28% parameters and about 31% inference time of them. Moreover, TinyBERT with 6 layers performs on-par with its teacher BERTBASE.

Results

TaskDatasetMetricValueModel
Question AnsweringSQuAD1.1 devEM79.7TinyBERT-6 67M
Question AnsweringSQuAD1.1 devF187.5TinyBERT-6 67M
Question AnsweringSQuAD2.0 devEM69.9TinyBERT-6 67M
Question AnsweringSQuAD2.0 devF173.4TinyBERT-6 67M
Natural Language InferenceMultiNLI DevMatched84.5TinyBERT-6 67M
Natural Language InferenceMultiNLI DevMismatched84.5TinyBERT-6 67M
Natural Language InferenceMultiNLIMatched84.6TinyBERT-6 67M
Natural Language InferenceMultiNLIMismatched83.2TinyBERT-6 67M
Natural Language InferenceMultiNLIMatched82.5TinyBERT-4 14.5M
Natural Language InferenceMultiNLIMismatched81.8TinyBERT-4 14.5M
Semantic Textual SimilarityMRPC DevAccuracy86.3TinyBERT-6 67M
Semantic Textual SimilaritySTS BenchmarkPearson Correlation0.799TinyBERT-4 14.5M
Semantic Textual SimilarityQuora Question PairsF171.3TinyBERT
Sentiment AnalysisSST-2 Binary classificationAccuracy93.1TinyBERT-6 67M
Sentiment AnalysisSST-2 Binary classificationAccuracy92.6TinyBERT-4 14.5M
Paraphrase IdentificationQuora Question PairsF171.3TinyBERT
Linguistic AcceptabilityCoLA DevAccuracy54TinyBERT-6 67M

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17AdaptiSent: Context-Aware Adaptive Attention for Multimodal Aspect-Based Sentiment Analysis2025-07-17SemCSE: Semantic Contrastive Sentence Embeddings Using LLM-Generated Summaries For Scientific Abstracts2025-07-17Uncertainty-Aware Cross-Modal Knowledge Distillation with Prototype Learning for Multimodal Brain-Computer Interfaces2025-07-17