TinyBERT: Distilling BERT for Natural Language Understanding

Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, Qun Liu

2019-09-23Findings of the Association for Computational Linguistics 2020Question Answering Paraphrase Identification Sentiment Analysis Natural Language Inference Natural Language Understanding Semantic Textual Similarity Linguistic Acceptability Knowledge Distillation Language Modelling

Paper PDF Code Code Code(official)Code Code Code Code Code Code Code

Abstract

Language model pre-training, such as BERT, has significantly improved the performances of many natural language processing tasks. However, pre-trained language models are usually computationally expensive, so it is difficult to efficiently execute them on resource-restricted devices. To accelerate inference and reduce model size while maintaining accuracy, we first propose a novel Transformer distillation method that is specially designed for knowledge distillation (KD) of the Transformer-based models. By leveraging this new KD method, the plenty of knowledge encoded in a large teacher BERT can be effectively transferred to a small student Tiny-BERT. Then, we introduce a new two-stage learning framework for TinyBERT, which performs Transformer distillation at both the pretraining and task-specific learning stages. This framework ensures that TinyBERT can capture he general-domain as well as the task-specific knowledge in BERT. TinyBERT with 4 layers is empirically effective and achieves more than 96.8% the performance of its teacher BERTBASE on GLUE benchmark, while being 7.5x smaller and 9.4x faster on inference. TinyBERT with 4 layers is also significantly better than 4-layer state-of-the-art baselines on BERT distillation, with only about 28% parameters and about 31% inference time of them. Moreover, TinyBERT with 6 layers performs on-par with its teacher BERTBASE.

Results

Task	Dataset	Metric	Value	Model
Question Answering	SQuAD1.1 dev	EM	79.7	TinyBERT-6 67M
Question Answering	SQuAD1.1 dev	F1	87.5	TinyBERT-6 67M
Question Answering	SQuAD2.0 dev	EM	69.9	TinyBERT-6 67M
Question Answering	SQuAD2.0 dev	F1	73.4	TinyBERT-6 67M
Natural Language Inference	MultiNLI Dev	Matched	84.5	TinyBERT-6 67M
Natural Language Inference	MultiNLI Dev	Mismatched	84.5	TinyBERT-6 67M
Natural Language Inference	MultiNLI	Matched	84.6	TinyBERT-6 67M
Natural Language Inference	MultiNLI	Mismatched	83.2	TinyBERT-6 67M
Natural Language Inference	MultiNLI	Matched	82.5	TinyBERT-4 14.5M
Natural Language Inference	MultiNLI	Mismatched	81.8	TinyBERT-4 14.5M
Semantic Textual Similarity	MRPC Dev	Accuracy	86.3	TinyBERT-6 67M
Semantic Textual Similarity	STS Benchmark	Pearson Correlation	0.799	TinyBERT-4 14.5M
Semantic Textual Similarity	Quora Question Pairs	F1	71.3	TinyBERT
Sentiment Analysis	SST-2 Binary classification	Accuracy	93.1	TinyBERT-6 67M
Sentiment Analysis	SST-2 Binary classification	Accuracy	92.6	TinyBERT-4 14.5M
Paraphrase Identification	Quora Question Pairs	F1	71.3	TinyBERT
Linguistic Acceptability	CoLA Dev	Accuracy	54	TinyBERT-6 67M

TinyBERT: Distilling BERT for Natural Language Understanding

Abstract

Results

Related Papers

TinyBERT: Distilling BERT for Natural Language Understanding

Abstract

Results

Related Papers