TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Integration of Pre-trained Networks with Continuous Token ...

Integration of Pre-trained Networks with Continuous Token Interface for End-to-End Spoken Language Understanding

Seunghyun Seo, Donghyun Kwak, Bowon Lee

2021-04-15intent-classificationslot-fillingSlot FillingSpoken Language UnderstandingMulti-Task LearningIntent Classification and Slot FillingKnowledge DistillationIntent ClassificationLanguage Modelling
PaperPDF

Abstract

Most End-to-End (E2E) SLU networks leverage the pre-trained ASR networks but still lack the capability to understand the semantics of utterances, crucial for the SLU task. To solve this, recently proposed studies use pre-trained NLU networks. However, it is not trivial to fully utilize both pre-trained networks; many solutions were proposed, such as Knowledge Distillation, cross-modal shared embedding, and network integration with Interface. We propose a simple and robust integration method for the E2E SLU network with novel Interface, Continuous Token Interface (CTI), the junctional representation of the ASR and NLU networks when both networks are pre-trained with the same vocabulary. Because the only difference is the noise level, we directly feed the ASR network's output to the NLU network. Thus, we can train our SLU network in an E2E manner without additional modules, such as Gumbel-Softmax. We evaluate our model using SLURP, a challenging SLU dataset and achieve state-of-the-art scores on both intent classification and slot filling tasks. We also verify the NLU network, pre-trained with Masked Language Model, can utilize a noisy textual representation of CTI. Moreover, we show our model can be trained with multi-task learning from heterogeneous data even after integration with CTI.

Results

TaskDatasetMetricValueModel
DialogueFluent Speech CommandsAccuracy (%)99.7Wav2Vec2.0-Classifier
Spoken Language UnderstandingFluent Speech CommandsAccuracy (%)99.7Wav2Vec2.0-Classifier
Dialogue UnderstandingFluent Speech CommandsAccuracy (%)99.7Wav2Vec2.0-Classifier

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21SGCL: Unifying Self-Supervised and Supervised Learning for Graph Recommendation2025-07-17Uncertainty-Aware Cross-Modal Knowledge Distillation with Prototype Learning for Multimodal Brain-Computer Interfaces2025-07-17Making Language Model a Hierarchical Classifier and Generator2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities2025-07-17DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition2025-07-16