Speech Model Pre-training for End-to-End Spoken Language Understanding

Loren Lugosch, Mirco Ravanelli, Patrick Ignoto, Vikrant Singh Tomar, Yoshua Bengio

2019-04-07Speech-to-Text Spoken Language Understanding

Abstract

Whereas conventional spoken language understanding (SLU) systems map speech to text, and then text to intent, end-to-end SLU systems map speech directly to intent through a single trainable model. Achieving high accuracy with these end-to-end models without a large amount of training data is difficult. We propose a method to reduce the data requirements of end-to-end SLU in which the model is first pre-trained to predict words and phonemes, thus learning good features for SLU. We introduce a new SLU dataset, Fluent Speech Commands, and show that our method improves performance both when the full dataset is used for training and when only a small subset is used. We also describe preliminary experiments to gauge the model's ability to generalize to new phrases not heard during training.

Results

Task	Dataset	Metric	Value	Model
Dialogue	Fluent Speech Commands	Accuracy (%)	98.8	Pooling classifier pre-trained using force-aligned phoneme and word labels on LibriSpeech
Spoken Language Understanding	Fluent Speech Commands	Accuracy (%)	98.8	Pooling classifier pre-trained using force-aligned phoneme and word labels on LibriSpeech
Dialogue Understanding	Fluent Speech Commands	Accuracy (%)	98.8	Pooling classifier pre-trained using force-aligned phoneme and word labels on LibriSpeech

Related Papers

An Empirical Evaluation of AI-Powered Non-Player Characters' Perceived Realism and Performance in Virtual Reality Environments2025-07-14 LM-SPT: LM-Aligned Semantic Distillation for Speech Tokenization2025-06-20 End-to-End Speech Translation for Low-Resource Languages Using Weakly Labeled Data2025-06-19 I Speak and You Find: Robust 3D Visual Grounding with Noisy and Ambiguous Speech Inputs2025-06-17 S2ST-Omni: An Efficient and Scalable Multilingual Speech-to-Speech Translation Framework via Seamless Speech-Text Alignment and Streaming Speech Generation2025-06-11 Advancing STT for Low-Resource Real-World Speech2025-06-10 MMSU: A Massive Multi-task Spoken Language Understanding and Reasoning Benchmark2025-06-05 Speech-to-Text Translation with Phoneme-Augmented CoT: Enhancing Cross-Lingual Transfer in Low-Resource Scenarios2025-05-30