TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Sequence-Level Knowledge Distillation

Sequence-Level Knowledge Distillation

Yoon Kim, Alexander M. Rush

2016-06-25EMNLP 2016 11Machine TranslationNMTTranslationKnowledge Distillation
PaperPDFCode(official)Code(official)CodeCodeCodeCode

Abstract

Neural machine translation (NMT) offers a novel alternative formulation of translation that is potentially simpler than statistical approaches. However to reach competitive performance, NMT models need to be exceedingly large. In this paper we consider applying knowledge distillation approaches (Bucila et al., 2006; Hinton et al., 2015) that have proven successful for reducing the size of neural models in other domains to the problem of NMT. We demonstrate that standard knowledge distillation applied to word-level prediction can be effective for NMT, and also introduce two novel sequence-level versions of knowledge distillation that further improve performance, and somewhat surprisingly, seem to eliminate the need for beam search (even when applied on the original teacher model). Our best student model runs 10 times faster than its state-of-the-art teacher with little loss in performance. It is also significantly better than a baseline model trained without knowledge distillation: by 4.2/1.7 BLEU with greedy decoding/beam search. Applying weight pruning on top of knowledge distillation results in a student model that has 13 times fewer parameters than the original teacher model, with a decrease of 0.4 BLEU.

Results

TaskDatasetMetricValueModel
Machine TranslationIWSLT2015 Thai-EnglishBLEU score14.2Seq-KD + Seq-Inter + Word-KD
Machine TranslationWMT2014 English-GermanBLEU score18.5Seq-KD + Seq-Inter + Word-KD

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21A Translation of Probabilistic Event Calculus into Markov Decision Processes2025-07-17Uncertainty-Aware Cross-Modal Knowledge Distillation with Prototype Learning for Multimodal Brain-Computer Interfaces2025-07-17DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition2025-07-16Function-to-Style Guidance of LLMs for Code Translation2025-07-15HanjaBridge: Resolving Semantic Ambiguity in Korean LLMs via Hanja-Augmented Pre-Training2025-07-15Feature Distillation is the Better Choice for Model-Heterogeneous Federated Learning2025-07-14KAT-V1: Kwai-AutoThink Technical Report2025-07-11