Sequence-Level Knowledge Distillation

Yoon Kim, Alexander M. Rush

2016-06-25EMNLP 2016 11Machine Translation NMT Translation Knowledge Distillation

Paper PDF Code(official)Code(official)Code Code Code Code

Abstract

Neural machine translation (NMT) offers a novel alternative formulation of translation that is potentially simpler than statistical approaches. However to reach competitive performance, NMT models need to be exceedingly large. In this paper we consider applying knowledge distillation approaches (Bucila et al., 2006; Hinton et al., 2015) that have proven successful for reducing the size of neural models in other domains to the problem of NMT. We demonstrate that standard knowledge distillation applied to word-level prediction can be effective for NMT, and also introduce two novel sequence-level versions of knowledge distillation that further improve performance, and somewhat surprisingly, seem to eliminate the need for beam search (even when applied on the original teacher model). Our best student model runs 10 times faster than its state-of-the-art teacher with little loss in performance. It is also significantly better than a baseline model trained without knowledge distillation: by 4.2/1.7 BLEU with greedy decoding/beam search. Applying weight pruning on top of knowledge distillation results in a student model that has 13 times fewer parameters than the original teacher model, with a decrease of 0.4 BLEU.

Results

Task	Dataset	Metric	Value	Model
Machine Translation	IWSLT2015 Thai-English	BLEU score	14.2	Seq-KD + Seq-Inter + Word-KD
Machine Translation	WMT2014 English-German	BLEU score	18.5	Seq-KD + Seq-Inter + Word-KD

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21 A Translation of Probabilistic Event Calculus into Markov Decision Processes2025-07-17 Uncertainty-Aware Cross-Modal Knowledge Distillation with Prototype Learning for Multimodal Brain-Computer Interfaces2025-07-17 DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition2025-07-16 Function-to-Style Guidance of LLMs for Code Translation2025-07-15 HanjaBridge: Resolving Semantic Ambiguity in Korean LLMs via Hanja-Augmented Pre-Training2025-07-15 Feature Distillation is the Better Choice for Model-Heterogeneous Federated Learning2025-07-14 KAT-V1: Kwai-AutoThink Technical Report2025-07-11