TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Improving Knowledge Distillation via Regularizing Feature ...

Improving Knowledge Distillation via Regularizing Feature Norm and Direction

Yuzhu Wang, Lechao Cheng, Manni Duan, Yongheng Wang, Zunlei Feng, Shu Kong

2023-05-26Knowledge DistillationDomain Adaptation
PaperPDFCode(official)

Abstract

Knowledge distillation (KD) exploits a large well-trained model (i.e., teacher) to train a small student model on the same dataset for the same task. Treating teacher features as knowledge, prevailing methods of knowledge distillation train student by aligning its features with the teacher's, e.g., by minimizing the KL-divergence between their logits or L2 distance between their intermediate features. While it is natural to believe that better alignment of student features to the teacher better distills teacher knowledge, simply forcing this alignment does not directly contribute to the student's performance, e.g., classification accuracy. In this work, we propose to align student features with class-mean of teacher features, where class-mean naturally serves as a strong classifier. To this end, we explore baseline techniques such as adopting the cosine distance based loss to encourage the similarity between student features and their corresponding class-means of the teacher. Moreover, we train the student to produce large-norm features, inspired by other lines of work (e.g., model pruning and domain adaptation), which find the large-norm features to be more significant. Finally, we propose a rather simple loss term (dubbed ND loss) to simultaneously (1) encourage student to produce large-\emph{norm} features, and (2) align the \emph{direction} of student features and teacher class-means. Experiments on standard benchmarks demonstrate that our explored techniques help existing KD methods achieve better performance, i.e., higher classification accuracy on ImageNet and CIFAR100 datasets, and higher detection precision on COCO dataset. Importantly, our proposed ND loss helps the most, leading to the state-of-the-art performance on these benchmarks. The source code is available at \url{https://github.com/WangYZ1608/Knowledge-Distillation-via-ND}.

Results

TaskDatasetMetricValueModel
Knowledge DistillationCIFAR-100Top-1 Accuracy (%)77.93ReviewKD++(T:resnet-32x4, S:shufflenet-v2)
Knowledge DistillationCIFAR-100Top-1 Accuracy (%)77.68ReviewKD++(T:resnet-32x4, S:shufflenet-v1)
Knowledge DistillationCIFAR-100Top-1 Accuracy (%)76.28DKD++(T:resnet-32x4, S:resnet-8x4)
Knowledge DistillationCIFAR-100Top-1 Accuracy (%)75.66ReviewKD++(T:WRN-40-2, S:WRN-40-1)
Knowledge DistillationCIFAR-100Top-1 Accuracy (%)72.53KD++(T:resnet56, S:resnet20)
Knowledge DistillationCIFAR-100Top-1 Accuracy (%)70.82DKD++(T:resnet50, S:mobilenetv2)
Knowledge DistillationCOCO 2017 valAP@0.561.8ReviewKD++(T: faster rcnn(resnet101), S:faster rcnn(resnet50))
Knowledge DistillationCOCO 2017 valAP@0.7544.94ReviewKD++(T: faster rcnn(resnet101), S:faster rcnn(resnet50))
Knowledge DistillationCOCO 2017 valmAP41.03ReviewKD++(T: faster rcnn(resnet101), S:faster rcnn(resnet50))
Knowledge DistillationCOCO 2017 valAP@0.557.96ReviewKD++(T: faster rcnn(resnet101), S:faster rcnn(resnet18))
Knowledge DistillationCOCO 2017 valAP@0.7540.15ReviewKD++(T: faster rcnn(resnet101), S:faster rcnn(resnet18))
Knowledge DistillationCOCO 2017 valmAP37.43ReviewKD++(T: faster rcnn(resnet101), S:faster rcnn(resnet18))
Knowledge DistillationCOCO 2017 valAP@0.555.18ReviewKD++(T: faster rcnn(resnet101), S:faster rcnn(mobilenet-v2))
Knowledge DistillationCOCO 2017 valAP@0.7537.21ReviewKD++(T: faster rcnn(resnet101), S:faster rcnn(mobilenet-v2))
Knowledge DistillationCOCO 2017 valmAP34.51ReviewKD++(T: faster rcnn(resnet101), S:faster rcnn(mobilenet-v2))
Knowledge DistillationImageNetTop-1 accuracy %83.6KD++(T: regnety-16GF S:ViT-B)
Knowledge DistillationImageNetTop-1 accuracy %79.15KD++(T:resnet-152 S:resnet-101)
Knowledge DistillationImageNetTop-1 accuracy %77.48KD++(T:resnet-152 S:resnet-50)
Knowledge DistillationImageNetTop-1 accuracy %75.53KD++(T:resnet152 S:resnet34)
Knowledge DistillationImageNetTop-1 accuracy %72.96ReviewKD++(T:resnet50, S:mobilenet-v1)
Knowledge DistillationImageNetTop-1 accuracy %72.54KD++(T:resnet-152 S:resnet18)
Knowledge DistillationImageNetTop-1 accuracy %72.54KD++(T:renset101 S:resnet18)
Knowledge DistillationImageNetTop-1 accuracy %72.53KD++(T:resnet50 S:resnet18)
Knowledge DistillationImageNetTop-1 accuracy %72.07KD++(T: ResNet-34 S:ResNet-18)
Knowledge DistillationImageNetTop-1 accuracy %71.84KD++(T:ViT-B, S:resnet18)
Knowledge DistillationImageNetTop-1 accuracy %71.46KD++(T: ViT-S, S:resnet18)

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21Uncertainty-Aware Cross-Modal Knowledge Distillation with Prototype Learning for Multimodal Brain-Computer Interfaces2025-07-17A Privacy-Preserving Semantic-Segmentation Method Using Domain-Adaptation Technique2025-07-17DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition2025-07-16HanjaBridge: Resolving Semantic Ambiguity in Korean LLMs via Hanja-Augmented Pre-Training2025-07-15Feature Distillation is the Better Choice for Model-Heterogeneous Federated Learning2025-07-14Domain Borders Are There to Be Crossed With Federated Few-Shot Adaptation2025-07-14KAT-V1: Kwai-AutoThink Technical Report2025-07-11