Weight Averaging Improves Knowledge Distillation under Domain Shift

Valeriy Berezovskiy, Nikita Morozov

2023-09-20Domain Generalization Knowledge Distillation

Abstract

Knowledge distillation (KD) is a powerful model compression technique broadly used in practical deep learning applications. It is focused on training a small student network to mimic a larger teacher network. While it is widely known that KD can offer an improvement to student generalization in i.i.d setting, its performance under domain shift, i.e. the performance of student networks on data from domains unseen during training, has received little attention in the literature. In this paper we make a step towards bridging the research fields of knowledge distillation and domain generalization. We show that weight averaging techniques proposed in domain generalization literature, such as SWAD and SMA, also improve the performance of knowledge distillation under domain shift. In addition, we propose a simplistic weight averaging strategy that does not require evaluation on validation data during training and show that it performs on par with SWAD and SMA when applied to KD. We name our final distillation approach Weight-Averaged Knowledge Distillation (WAKD).

Results

Task	Dataset	Metric	Value	Model
Domain Adaptation	PACS	Average Accuracy	87.6	WAKD (DeiT-Ti)
Domain Adaptation	PACS	Average Accuracy	86.6	WAKD (Resnet-18)
Domain Adaptation	Office-Home	Average Accuracy	70.5	WAKD (DeiT-Ti)
Domain Adaptation	Office-Home	Average Accuracy	66.7	WAKD (Resnet-18)
Domain Generalization	PACS	Average Accuracy	87.6	WAKD (DeiT-Ti)
Domain Generalization	PACS	Average Accuracy	86.6	WAKD (Resnet-18)
Domain Generalization	Office-Home	Average Accuracy	70.5	WAKD (DeiT-Ti)
Domain Generalization	Office-Home	Average Accuracy	66.7	WAKD (Resnet-18)

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21 Simulate, Refocus and Ensemble: An Attention-Refocusing Scheme for Domain Generalization2025-07-17 GLAD: Generalizable Tuning for Vision-Language Models2025-07-17 MoTM: Towards a Foundation Model for Time Series Imputation based on Continuous Modeling2025-07-17 Uncertainty-Aware Cross-Modal Knowledge Distillation with Prototype Learning for Multimodal Brain-Computer Interfaces2025-07-17 InstructFLIP: Exploring Unified Vision-Language Model for Face Anti-spoofing2025-07-16 DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition2025-07-16 HanjaBridge: Resolving Semantic Ambiguity in Korean LLMs via Hanja-Augmented Pre-Training2025-07-15