Two-stage Textual Knowledge Distillation for End-to-End Spoken Language Understanding

Seongbin Kim, Gyuwan Kim, Seongjin Shin, Sangmin Lee

2020-10-25Speech Recognition Automatic Speech Recognition Automatic Speech Recognition (ASR)speech-recognition Data Augmentation Spoken Language Understanding Knowledge Distillation

Paper PDF Code(official)

Abstract

End-to-end approaches open a new way for more accurate and efficient spoken language understanding (SLU) systems by alleviating the drawbacks of traditional pipeline systems. Previous works exploit textual information for an SLU model via pre-training with automatic speech recognition or fine-tuning with knowledge distillation. To utilize textual information more effectively, this work proposes a two-stage textual knowledge distillation method that matches utterance-level representations and predicted logits of two modalities during pre-training and fine-tuning, sequentially. We use vq-wav2vec BERT as a speech encoder because it captures general and rich features. Furthermore, we improve the performance, especially in a low-resource scenario, with data augmentation methods by randomly masking spans of discrete audio tokens and contextualized hidden representations. Consequently, we push the state-of-the-art on the Fluent Speech Commands, achieving 99.7% test accuracy in the full dataset setting and 99.5% in the 10% subset setting. Throughout the ablation studies, we empirically verify that all used methods are crucial to the final performance, providing the best practice for spoken language understanding. Code is available at https://github.com/clovaai/textual-kd-slu.

Results

Task	Dataset	Metric	Value	Model
Dialogue	Fluent Speech Commands	Accuracy (%)	99.7	textual-kd-slu
Spoken Language Understanding	Fluent Speech Commands	Accuracy (%)	99.7	textual-kd-slu
Dialogue Understanding	Fluent Speech Commands	Accuracy (%)	99.7	textual-kd-slu

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21 Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine2025-07-17 NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech2025-07-17 Overview of the TalentCLEF 2025: Skill and Job Title Intelligence for Human Capital Management2025-07-17 Pixel Perfect MegaMed: A Megapixel-Scale Vision-Language Foundation Model for Generating High Resolution Medical Images2025-07-17 Uncertainty-Aware Cross-Modal Knowledge Distillation with Prototype Learning for Multimodal Brain-Computer Interfaces2025-07-17 Similarity-Guided Diffusion for Contrastive Sequential Recommendation2025-07-16 DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition2025-07-16