TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/DistilBERT, a distilled version of BERT: smaller, faster, ...

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Victor Sanh, Lysandre Debut, Julien Chaumond, Thomas Wolf

2019-10-02NeurIPS 2019 12Question AnsweringHate Speech DetectionOnly Connect Walls Dataset Task 1 (Grouping)Sentiment AnalysisNatural Language InferenceTransfer LearningSemantic Textual SimilarityLinguistic AcceptabilityKnowledge DistillationLanguage Modelling
PaperPDFCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCode(official)CodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCode(official)

Abstract

As Transfer Learning from large-scale pre-trained models becomes more prevalent in Natural Language Processing (NLP), operating these large models in on-the-edge and/or under constrained computational training or inference budgets remains challenging. In this work, we propose a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can then be fine-tuned with good performances on a wide range of tasks like its larger counterparts. While most prior work investigated the use of distillation for building task-specific models, we leverage knowledge distillation during the pre-training phase and show that it is possible to reduce the size of a BERT model by 40%, while retaining 97% of its language understanding capabilities and being 60% faster. To leverage the inductive biases learned by larger models during pre-training, we introduce a triple loss combining language modeling, distillation and cosine-distance losses. Our smaller, faster and lighter model is cheaper to pre-train and we demonstrate its capabilities for on-device computations in a proof-of-concept experiment and a comparative on-device study.

Results

TaskDatasetMetricValueModel
Question AnsweringSQuAD1.1 devEM77.7DistilBERT
Question AnsweringSQuAD1.1 devF185.8DistilBERT 66M
Question AnsweringMultiTQHits@18.3DistillBERT
Question AnsweringMultiTQHits@1048.4DistillBERT
Natural Language InferenceWNLIAccuracy44.4DistilBERT 66M
Semantic Textual SimilaritySTS BenchmarkPearson Correlation0.907DistilBERT 66M
Sentiment AnalysisSST-2 Binary classificationAccuracy91.3DistilBERT 66M
Sentiment AnalysisIMDbAccuracy92.82DistilBERT 66M

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21RaMen: Multi-Strategy Multi-Modal Learning for Bundle Construction2025-07-18From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17AdaptiSent: Context-Aware Adaptive Attention for Multimodal Aspect-Based Sentiment Analysis2025-07-17Disentangling coincident cell events using deep transfer learning and compressive sensing2025-07-17