Ofir Zafrir, Guy Boudoukh, Peter Izsak, Moshe Wasserblat
Recently, pre-trained Transformer based language models such as BERT and GPT, have shown great improvement in many Natural Language Processing (NLP) tasks. However, these models contain a large amount of parameters. The emergence of even larger and more accurate models such as GPT2 and Megatron, suggest a trend of large pre-trained Transformer models. However, using these large models in production environments is a complex task requiring a large amount of compute, memory and power resources. In this work we show how to perform quantization-aware training during the fine-tuning phase of BERT in order to compress BERT by $4\times$ with minimal accuracy loss. Furthermore, the produced quantized model can accelerate inference speed if it is optimized for 8bit Integer supporting hardware.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Natural Language Inference | QNLI | Accuracy | 93 | Q8BERT (Zafrir et al., 2019) |
| Natural Language Inference | RTE | Accuracy | 84.8 | Q8BERT (Zafrir et al., 2019) |
| Natural Language Inference | MultiNLI | Matched | 85.6 | Q8BERT (Zafrir et al., 2019) |
| Semantic Textual Similarity | MRPC | Accuracy | 89.7 | Q8BERT (Zafrir et al., 2019) |
| Semantic Textual Similarity | STS Benchmark | Pearson Correlation | 0.911 | Q8BERT (Zafrir et al., 2019) |
| Sentiment Analysis | SST-2 Binary classification | Accuracy | 94.7 | Q8BERT (Zafrir et al., 2019) |
| Linguistic Acceptability | CoLA | Accuracy | 65 | Q8BERT (Zafrir et al., 2019) |