TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Sequential End-to-End Intent and Slot Label Classification...

Sequential End-to-End Intent and Slot Label Classification and Localization

Yiran Cao, Nihal Potdar, Anderson R. Avila

2021-06-08Speech RecognitionAutomatic Speech RecognitionAutomatic Speech Recognition (ASR)speech-recognitionSpoken Language UnderstandingTemporal LocalizationClassification
PaperPDF

Abstract

Human-computer interaction (HCI) is significantly impacted by delayed responses from a spoken dialogue system. Hence, end-to-end (e2e) spoken language understanding (SLU) solutions have recently been proposed to decrease latency. Such approaches allow for the extraction of semantic information directly from the speech signal, thus bypassing the need for a transcript from an automatic speech recognition (ASR) system. In this paper, we propose a compact e2e SLU architecture for streaming scenarios, where chunks of the speech signal are processed continuously to predict intent and slot values. Our model is based on a 3D convolutional neural network (3D-CNN) and a unidirectional long short-term memory (LSTM). We compare the performance of two alignment-free losses: the connectionist temporal classification (CTC) method and its adapted version, namely connectionist temporal localization (CTL). The latter performs not only the classification but also localization of sequential audio events. The proposed solution is evaluated on the Fluent Speech Command dataset and results show our model ability to process incoming speech signal, reaching accuracy as high as 98.97 % for CTC and 98.78 % for CTL on single-label classification, and as high as 95.69 % for CTC and 95.28 % for CTL on two-label prediction.

Results

TaskDatasetMetricValueModel
DialogueFluent Speech CommandsAccuracy (%)99.33D-CNN+LSTM+CE
Spoken Language UnderstandingFluent Speech CommandsAccuracy (%)99.33D-CNN+LSTM+CE
Dialogue UnderstandingFluent Speech CommandsAccuracy (%)99.33D-CNN+LSTM+CE

Related Papers

Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine2025-07-17NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech2025-07-17Adversarial attacks to image classification systems using evolutionary algorithms2025-07-17Efficient Calisthenics Skills Classification through Foreground Instance Selection and Depth Estimation2025-07-16Safeguarding Federated Learning-based Road Condition Classification2025-07-16WhisperKit: On-device Real-time ASR with Billion-Scale Transformers2025-07-14AI-Enhanced Pediatric Pneumonia Detection: A CNN-Based Approach Using Data Augmentation and Generative Adversarial Networks (GANs)2025-07-13VisualSpeaker: Visually-Guided 3D Avatar Lip Synthesis2025-07-08