TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Autoregressive Knowledge Distillation through Imitation Le...

Autoregressive Knowledge Distillation through Imitation Learning

Alexander Lin, Jeremy Wohlwend, Howard Chen, Tao Lei

2020-09-15EMNLP 2020 11Machine TranslationText GenerationImitation LearningTranslationKnowledge Distillation
PaperPDFCodeCode(official)

Abstract

The performance of autoregressive models on natural language generation tasks has dramatically improved due to the adoption of deep, self-attentive architectures. However, these gains have come at the cost of hindering inference speed, making state-of-the-art models cumbersome to deploy in real-world, time-sensitive settings. We develop a compression technique for autoregressive models that is driven by an imitation learning perspective on knowledge distillation. The algorithm is designed to address the exposure bias problem. On prototypical language generation tasks such as translation and summarization, our method consistently outperforms other distillation algorithms, such as sequence-level knowledge distillation. Student models trained with our method attain 1.4 to 4.8 BLEU/ROUGE points higher than those trained from scratch, while increasing inference speed by up to 14 times in comparison to the teacher model.

Results

TaskDatasetMetricValueModel
Machine TranslationIWSLT2014 German-EnglishBLEU score35.4ImitKD + Full

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21Making Language Model a Hierarchical Classifier and Generator2025-07-17The Imitation Game: Turing Machine Imitator is Length Generalizable Reasoner2025-07-17Supervised Fine Tuning on Curated Data is Reinforcement Learning (and can be improved)2025-07-17A Translation of Probabilistic Event Calculus into Markov Decision Processes2025-07-17Uncertainty-Aware Cross-Modal Knowledge Distillation with Prototype Learning for Multimodal Brain-Computer Interfaces2025-07-17Mitigating Object Hallucinations via Sentence-Level Early Intervention2025-07-16DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition2025-07-16