Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, Jeff Dean

2015-03-09Knowledge Distillation

Abstract

A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.

Results

Task	Dataset	Metric	Value	Model
Knowledge Distillation	CIFAR-100	Top-1 Accuracy (%)	73.33	resnet8x4 (T: resnet32x4 S: resnet8x4)
Knowledge Distillation	CIFAR-100	Top-1 Accuracy (%)	72.98	vgg8 (T:vgg13 S:vgg8)
Knowledge Distillation	CIFAR-100	Top-1 Accuracy (%)	70.67	resnet110 (T:resnet110 S:resnet20)
Knowledge Distillation	ImageNet	Top-1 accuracy %	77.14	ADLIK-MO(T: ResNet101 S: ResNet50)
Knowledge Distillation	ImageNet	Top-1 accuracy %	70.66	KD (T: ResNet-34 S:ResNet-18)

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21 Uncertainty-Aware Cross-Modal Knowledge Distillation with Prototype Learning for Multimodal Brain-Computer Interfaces2025-07-17 DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition2025-07-16 HanjaBridge: Resolving Semantic Ambiguity in Korean LLMs via Hanja-Augmented Pre-Training2025-07-15 Feature Distillation is the Better Choice for Model-Heterogeneous Federated Learning2025-07-14 KAT-V1: Kwai-AutoThink Technical Report2025-07-11 Towards Collaborative Fairness in Federated Learning Under Imbalanced Covariate Shift2025-07-11 SFedKD: Sequential Federated Learning with Discrepancy-Aware Multi-Teacher Knowledge Distillation2025-07-11