TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Prune Once for All: Sparse Pre-Trained Language Models

Prune Once for All: Sparse Pre-Trained Language Models

Ofir Zafrir, Ariel Larey, Guy Boudoukh, Haihao Shen, Moshe Wasserblat

2021-11-10Question AnsweringSentiment AnalysisQuantizationNatural Language InferenceTransfer LearningAll
PaperPDFCode(official)Code

Abstract

Transformer-based language models are applied to a wide range of applications in natural language processing. However, they are inefficient and difficult to deploy. In recent years, many compression algorithms have been proposed to increase the implementation efficiency of large Transformer-based models on target hardware. In this work we present a new method for training sparse pre-trained Transformer language models by integrating weight pruning and model distillation. These sparse pre-trained models can be used to transfer learning for a wide range of tasks while maintaining their sparsity pattern. We demonstrate our method with three known architectures to create sparse pre-trained BERT-Base, BERT-Large and DistilBERT. We show how the compressed sparse pre-trained models we trained transfer their knowledge to five different downstream natural language tasks with minimal accuracy loss. Moreover, we show how to further compress the sparse models' weights to 8bit precision using quantization-aware training. For example, with our sparse pre-trained BERT-Large fine-tuned on SQuADv1.1 and quantized to 8bit we achieve a compression ratio of $40$X for the encoder with less than $1\%$ accuracy loss. To the best of our knowledge, our results show the best compression-to-accuracy ratio for BERT-Base, BERT-Large, and DistilBERT.

Results

TaskDatasetMetricValueModel
Question AnsweringSQuAD1.1 devEM83.35BERT-Large-uncased-PruneOFA (90% unstruct sparse)
Question AnsweringSQuAD1.1 devF190.2BERT-Large-uncased-PruneOFA (90% unstruct sparse)
Question AnsweringSQuAD1.1 devEM83.22BERT-Large-uncased-PruneOFA (90% unstruct sparse, QAT Int8)
Question AnsweringSQuAD1.1 devF190.02BERT-Large-uncased-PruneOFA (90% unstruct sparse, QAT Int8)
Question AnsweringSQuAD1.1 devEM81.1BERT-Base-uncased-PruneOFA (85% unstruct sparse)
Question AnsweringSQuAD1.1 devF188.42BERT-Base-uncased-PruneOFA (85% unstruct sparse)
Question AnsweringSQuAD1.1 devEM80.84BERT-Base-uncased-PruneOFA (85% unstruct sparse, QAT Int8)
Question AnsweringSQuAD1.1 devF188.24BERT-Base-uncased-PruneOFA (85% unstruct sparse, QAT Int8)
Question AnsweringSQuAD1.1 devEM79.83BERT-Base-uncased-PruneOFA (90% unstruct sparse)
Question AnsweringSQuAD1.1 devF187.25BERT-Base-uncased-PruneOFA (90% unstruct sparse)
Question AnsweringSQuAD1.1 devEM78.1DistilBERT-uncased-PruneOFA (85% unstruct sparse)
Question AnsweringSQuAD1.1 devF185.82DistilBERT-uncased-PruneOFA (85% unstruct sparse)
Question AnsweringSQuAD1.1 devEM77.03DistilBERT-uncased-PruneOFA (85% unstruct sparse, QAT Int8)
Question AnsweringSQuAD1.1 devF185.13DistilBERT-uncased-PruneOFA (85% unstruct sparse, QAT Int8)
Question AnsweringSQuAD1.1 devEM76.91DistilBERT-uncased-PruneOFA (90% unstruct sparse)
Question AnsweringSQuAD1.1 devF184.82DistilBERT-uncased-PruneOFA (90% unstruct sparse)
Question AnsweringSQuAD1.1 devEM75.62DistilBERT-uncased-PruneOFA (90% unstruct sparse, QAT Int8)
Question AnsweringSQuAD1.1 devF183.87DistilBERT-uncased-PruneOFA (90% unstruct sparse, QAT Int8)
Natural Language InferenceMultiNLI DevMatched83.74BERT-Large-uncased-PruneOFA (90% unstruct sparse)
Natural Language InferenceMultiNLI DevMismatched84.2BERT-Large-uncased-PruneOFA (90% unstruct sparse)
Natural Language InferenceMultiNLI DevMatched83.47BERT-Large-uncased-PruneOFA (90% unstruct sparse, QAT Int8)
Natural Language InferenceMultiNLI DevMismatched84.08BERT-Large-uncased-PruneOFA (90% unstruct sparse, QAT Int8)
Natural Language InferenceMultiNLI DevMatched82.71BERT-Base-uncased-PruneOFA (85% unstruct sparse)
Natural Language InferenceMultiNLI DevMismatched83.67BERT-Base-uncased-PruneOFA (85% unstruct sparse)
Natural Language InferenceMultiNLI DevMatched81.45BERT-Base-uncased-PruneOFA (90% unstruct sparse)
Natural Language InferenceMultiNLI DevMismatched82.43BERT-Base-uncased-PruneOFA (90% unstruct sparse)
Natural Language InferenceMultiNLI DevMatched81.4BERT-Base-uncased-PruneOFA (85% unstruct sparse, QAT Int8)
Natural Language InferenceMultiNLI DevMismatched82.51BERT-Base-uncased-PruneOFA (85% unstruct sparse, QAT Int8)
Natural Language InferenceMultiNLI DevMatched81.35DistilBERT-uncased-PruneOFA (85% unstruct sparse)
Natural Language InferenceMultiNLI DevMismatched82.03DistilBERT-uncased-PruneOFA (85% unstruct sparse)
Natural Language InferenceMultiNLI DevMatched80.68DistilBERT-uncased-PruneOFA (90% unstruct sparse)
Natural Language InferenceMultiNLI DevMismatched81.47DistilBERT-uncased-PruneOFA (90% unstruct sparse)
Natural Language InferenceMultiNLI DevMatched80.66DistilBERT-uncased-PruneOFA (85% unstruct sparse, QAT Int8)
Natural Language InferenceMultiNLI DevMismatched81.14DistilBERT-uncased-PruneOFA (85% unstruct sparse, QAT Int8)
Natural Language InferenceMultiNLI DevMatched78.8DistilBERT-uncased-PruneOFA (90% unstruct sparse, QAT Int8)
Natural Language InferenceMultiNLI DevMismatched80.4DistilBERT-uncased-PruneOFA (90% unstruct sparse, QAT Int8)

Related Papers

Efficient Deployment of Spiking Neural Networks on SpiNNaker2 for DVS Gesture Recognition Using Neuromorphic Intermediate Representation2025-09-04An End-to-End DNN Inference Framework for the SpiNNaker2 Neuromorphic MPSoC2025-07-18RaMen: Multi-Strategy Multi-Modal Learning for Bundle Construction2025-07-18From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17AdaptiSent: Context-Aware Adaptive Attention for Multimodal Aspect-Based Sentiment Analysis2025-07-17