TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Weight Averaging Improves Knowledge Distillation under Dom...

Weight Averaging Improves Knowledge Distillation under Domain Shift

Valeriy Berezovskiy, Nikita Morozov

2023-09-20Domain GeneralizationKnowledge Distillation
PaperPDFCode(official)

Abstract

Knowledge distillation (KD) is a powerful model compression technique broadly used in practical deep learning applications. It is focused on training a small student network to mimic a larger teacher network. While it is widely known that KD can offer an improvement to student generalization in i.i.d setting, its performance under domain shift, i.e. the performance of student networks on data from domains unseen during training, has received little attention in the literature. In this paper we make a step towards bridging the research fields of knowledge distillation and domain generalization. We show that weight averaging techniques proposed in domain generalization literature, such as SWAD and SMA, also improve the performance of knowledge distillation under domain shift. In addition, we propose a simplistic weight averaging strategy that does not require evaluation on validation data during training and show that it performs on par with SWAD and SMA when applied to KD. We name our final distillation approach Weight-Averaged Knowledge Distillation (WAKD).

Results

TaskDatasetMetricValueModel
Domain AdaptationPACSAverage Accuracy87.6WAKD (DeiT-Ti)
Domain AdaptationPACSAverage Accuracy86.6WAKD (Resnet-18)
Domain AdaptationOffice-HomeAverage Accuracy70.5WAKD (DeiT-Ti)
Domain AdaptationOffice-HomeAverage Accuracy66.7WAKD (Resnet-18)
Domain GeneralizationPACSAverage Accuracy87.6WAKD (DeiT-Ti)
Domain GeneralizationPACSAverage Accuracy86.6WAKD (Resnet-18)
Domain GeneralizationOffice-HomeAverage Accuracy70.5WAKD (DeiT-Ti)
Domain GeneralizationOffice-HomeAverage Accuracy66.7WAKD (Resnet-18)

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21Simulate, Refocus and Ensemble: An Attention-Refocusing Scheme for Domain Generalization2025-07-17GLAD: Generalizable Tuning for Vision-Language Models2025-07-17MoTM: Towards a Foundation Model for Time Series Imputation based on Continuous Modeling2025-07-17Uncertainty-Aware Cross-Modal Knowledge Distillation with Prototype Learning for Multimodal Brain-Computer Interfaces2025-07-17InstructFLIP: Exploring Unified Vision-Language Model for Face Anti-spoofing2025-07-16DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition2025-07-16HanjaBridge: Resolving Semantic Ambiguity in Korean LLMs via Hanja-Augmented Pre-Training2025-07-15