TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Knowledge Distillation Based on Transformed Teacher Matching

Knowledge Distillation Based on Transformed Teacher Matching

Kaixiang Zheng, En-hui Yang

2024-02-17Knowledge Distillation
PaperPDFCode(official)

Abstract

As a technique to bridge logit matching and probability distribution matching, temperature scaling plays a pivotal role in knowledge distillation (KD). Conventionally, temperature scaling is applied to both teacher's logits and student's logits in KD. Motivated by some recent works, in this paper, we drop instead temperature scaling on the student side, and systematically study the resulting variant of KD, dubbed transformed teacher matching (TTM). By reinterpreting temperature scaling as a power transform of probability distribution, we show that in comparison with the original KD, TTM has an inherent R\'enyi entropy term in its objective function, which serves as an extra regularization term. Extensive experiment results demonstrate that thanks to this inherent regularization, TTM leads to trained students with better generalization than the original KD. To further enhance student's capability to match teacher's power transformed probability distribution, we introduce a sample-adaptive weighting coefficient into TTM, yielding a novel distillation approach dubbed weighted TTM (WTTM). It is shown, by comprehensive experiments, that although WTTM is simple, it is effective, improves upon TTM, and achieves state-of-the-art accuracy performance. Our source code is available at https://github.com/zkxufo/TTM.

Results

TaskDatasetMetricValueModel
Knowledge DistillationImageNetTop-1 accuracy %77.03WTTM (T: DeiT III-Small S:DeiT-Tiny)
Knowledge DistillationImageNetTop-1 accuracy %73.09WTTM (T:resnet50, S:mobilenet-v1)
Knowledge DistillationImageNetTop-1 accuracy %72.19WTTM (T: ResNet-34 S:ResNet-18)

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21Uncertainty-Aware Cross-Modal Knowledge Distillation with Prototype Learning for Multimodal Brain-Computer Interfaces2025-07-17DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition2025-07-16HanjaBridge: Resolving Semantic Ambiguity in Korean LLMs via Hanja-Augmented Pre-Training2025-07-15Feature Distillation is the Better Choice for Model-Heterogeneous Federated Learning2025-07-14KAT-V1: Kwai-AutoThink Technical Report2025-07-11Towards Collaborative Fairness in Federated Learning Under Imbalanced Covariate Shift2025-07-11SFedKD: Sequential Federated Learning with Discrepancy-Aware Multi-Teacher Knowledge Distillation2025-07-11