TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/SLUE: New Benchmark Tasks for Spoken Language Understandin...

SLUE: New Benchmark Tasks for Spoken Language Understanding Evaluation on Natural Speech

Suwon Shon, Ankita Pasad, Felix Wu, Pablo Brusco, Yoav Artzi, Karen Livescu, Kyu J. Han

2021-11-19Speech RecognitionSpeaker IdentificationAutomatic Speech RecognitionAutomatic Speech Recognition (ASR)Sentiment Analysisspeech-recognitionnamed-entity-recognitionNamed Entity RecognitionSpoken Language UnderstandingNamed Entity Recognition (NER)
PaperPDFCode(official)

Abstract

Progress in speech processing has been facilitated by shared datasets and benchmarks. Historically these have focused on automatic speech recognition (ASR), speaker identification, or other lower-level tasks. Interest has been growing in higher-level spoken language understanding tasks, including using end-to-end models, but there are fewer annotated datasets for such tasks. At the same time, recent work shows the possibility of pre-training generic representations and then fine-tuning for several tasks using relatively little labeled data. We propose to create a suite of benchmark tasks for Spoken Language Understanding Evaluation (SLUE) consisting of limited-size labeled training sets and corresponding evaluation sets. This resource would allow the research community to track progress, evaluate pre-trained representations for higher-level tasks, and study open questions such as the utility of pipeline versus end-to-end approaches. We present the first phase of the SLUE benchmark suite, consisting of named entity recognition, sentiment analysis, and ASR on the corresponding datasets. We focus on naturally produced (not read or synthesized) speech, and freely available datasets. We provide new transcriptions and annotations on subsets of the VoxCeleb and VoxPopuli datasets, evaluation metrics and results for baseline models, and an open-source toolkit to reproduce the baselines and evaluate new models.

Results

TaskDatasetMetricValueModel
Speech RecognitionSLUEVoxCeleb (Dev)9.1W2V2-L-LL60K (+ TED-LIUM 3 LM)
Speech RecognitionSLUEVoxCeleb (Test)10.8W2V2-L-LL60K (+ TED-LIUM 3 LM)
Speech RecognitionSLUEVoxPopuli (Dev)9.1W2V2-L-LL60K (+ TED-LIUM 3 LM)
Speech RecognitionSLUEVoxPopuli (Test)9.3W2V2-L-LL60K (+ TED-LIUM 3 LM)
Speech RecognitionSLUEVoxCeleb (Dev)13.2W2V2-B-LS960 (+ TED-LIUM 3 LM)
Speech RecognitionSLUEVoxCeleb (Test)15.8W2V2-B-LS960 (+ TED-LIUM 3 LM)
Speech RecognitionSLUEVoxPopuli (Dev)12W2V2-B-LS960 (+ TED-LIUM 3 LM)
Speech RecognitionSLUEVoxPopuli (Test)12.2W2V2-B-LS960 (+ TED-LIUM 3 LM)
Speech RecognitionSLUEVoxCeleb (Dev)11.8W2V2-L-LL60K (+ in-domain LM)
Speech RecognitionSLUEVoxCeleb (Test)13.8W2V2-L-LL60K (+ in-domain LM)
Speech RecognitionSLUEVoxPopuli (Dev)12W2V2-L-LL60K (+ in-domain LM)
Speech RecognitionSLUEVoxPopuli (Test)12.5W2V2-L-LL60K (+ in-domain LM)
Speech RecognitionSLUEVoxCeleb (Dev)11W2V2-L-LL60K
Speech RecognitionSLUEVoxCeleb (Test)13.5W2V2-L-LL60K
Speech RecognitionSLUEVoxPopuli (Dev)14W2V2-L-LL60K
Speech RecognitionSLUEVoxPopuli (Test)12.1W2V2-L-LL60K
Speech RecognitionSLUEVoxCeleb (Dev)15.2W2V2-B-LS960 (+ in-domain LM)
Speech RecognitionSLUEVoxCeleb (Test)18.2W2V2-B-LS960 (+ in-domain LM)
Speech RecognitionSLUEVoxPopuli (Dev)14.6W2V2-B-LS960 (+ in-domain LM)
Speech RecognitionSLUEVoxPopuli (Test)15.2W2V2-B-LS960 (+ in-domain LM)
Speech RecognitionSLUEVoxCeleb (Dev)17.2W2V2-B-LS960
Speech RecognitionSLUEVoxCeleb (Test)20.5W2V2-B-LS960
Speech RecognitionSLUEVoxPopuli (Dev)17.2W2V2-B-LS960
Speech RecognitionSLUEVoxPopuli (Test)17.9W2V2-B-LS960
Speech RecognitionSLUEVoxCeleb (Dev)19.6HuBERT-B-LS960
Speech RecognitionSLUEVoxCeleb (Test)21.2HuBERT-B-LS960
Speech RecognitionSLUEVoxPopuli (Dev)18.6HuBERT-B-LS960
Speech RecognitionSLUEVoxPopuli (Test)19.1HuBERT-B-LS960
Speech RecognitionSLUEVoxCeleb (Dev)29.9W2V2-B-VP100K
Speech RecognitionSLUEVoxCeleb (Test)33.4W2V2-B-VP100K
Speech RecognitionSLUEVoxPopuli (Dev)21.6W2V2-B-VP100K
Speech RecognitionSLUEVoxPopuli (Test)22.4W2V2-B-VP100K
Sentiment AnalysisSLUEF1 (%)63.3W2V2-L-LL60K (pipeline approach, uses LM)
Sentiment AnalysisSLUERecall (%) 60.4W2V2-L-LL60K (pipeline approach, uses LM)
Sentiment AnalysisSLUEF1 (%)63.3W2V2-L-LL60K (pipeline approach)
Sentiment AnalysisSLUERecall (%) 60.2W2V2-L-LL60K (pipeline approach)
Sentiment AnalysisSLUEF1 (%)62.9W2V2-B-LS960 (pipeline approach, uses LM)
Sentiment AnalysisSLUERecall (%) 60W2V2-B-LS960 (pipeline approach, uses LM)
Sentiment AnalysisSLUEF1 (%)61.8W2V2-B-LS960 (pipeline approach)
Sentiment AnalysisSLUERecall (%) 59W2V2-B-LS960 (pipeline approach)
Sentiment AnalysisSLUEF1 (%)48.5W2V2-L-LL60K (e2e approach)
Sentiment AnalysisSLUERecall (%) 49.2W2V2-L-LL60K (e2e approach)
Sentiment AnalysisSLUEF1 (%)48HuBERT-B-LS960 (e2e approach)
Sentiment AnalysisSLUERecall (%) 47.5HuBERT-B-LS960 (e2e approach)
Sentiment AnalysisSLUEF1 (%)46.6W2V2-B-LS960 (e2e approach)
Sentiment AnalysisSLUERecall (%) 46W2V2-B-LS960 (e2e approach)
Sentiment AnalysisSLUEF1 (%)38.4W2V2-B-VP100K (e2e approach)
Sentiment AnalysisSLUERecall (%) 38.7W2V2-B-VP100K (e2e approach)
Named Entity Recognition (NER)SLUEF1 (%)69.6W2V2-L-LL60K (pipeline approach, uses LM)
Named Entity Recognition (NER)SLUElabel-F1 (%)82.2W2V2-L-LL60K (pipeline approach, uses LM)
Named Entity Recognition (NER)SLUEF1 (%)68W2V2-B-LS960 (pipeline approach, uses LM)
Named Entity Recognition (NER)SLUElabel-F1 (%)79.8W2V2-B-LS960 (pipeline approach, uses LM)
Named Entity Recognition (NER)SLUEF1 (%)64.8W2V2-L-LL60K (e2e approach, uses LM)
Named Entity Recognition (NER)SLUElabel-F1 (%)73.3W2V2-L-LL60K (e2e approach, uses LM)
Named Entity Recognition (NER)SLUEF1 (%)63.4W2V2-B-LS960 (e2e approach, uses LM)
Named Entity Recognition (NER)SLUElabel-F1 (%)71.7W2V2-B-LS960 (e2e approach, uses LM)
Named Entity Recognition (NER)SLUEF1 (%)61.9HuBERT-B-LS960 (e2e approach, uses LM)
Named Entity Recognition (NER)SLUElabel-F1 (%)70.3HuBERT-B-LS960 (e2e approach, uses LM)
Named Entity Recognition (NER)SLUEF1 (%)61.8W2V2-B-VP100K (e2e approach, uses LM)
Named Entity Recognition (NER)SLUElabel-F1 (%)69.8W2V2-B-VP100K (e2e approach, uses LM)
Named Entity Recognition (NER)SLUEF1 (%)57.8W2V2-L-LL60K (pipeline approach)
Named Entity Recognition (NER)SLUElabel-F1 (%)78.8W2V2-L-LL60K (pipeline approach)
Named Entity Recognition (NER)SLUEF1 (%)50.9W2V2-L-LL60K (e2e approach)
Named Entity Recognition (NER)SLUElabel-F1 (%)64.7W2V2-L-LL60K (e2e approach)
Named Entity Recognition (NER)SLUEF1 (%)50.2W2V2-B-LS960 (e2e approach)
Named Entity Recognition (NER)SLUElabel-F1 (%)64W2V2-B-LS960 (e2e approach)
Named Entity Recognition (NER)SLUEF1 (%)49.8HuBERT-B-LS960 (e2e approach)
Named Entity Recognition (NER)SLUElabel-F1 (%)62.9HuBERT-B-LS960 (e2e approach)
Named Entity Recognition (NER)SLUEF1 (%)49.5W2V2-B-LS960 (pipeline approach)
Named Entity Recognition (NER)SLUElabel-F1 (%)74.2W2V2-B-LS960 (pipeline approach)
Named Entity Recognition (NER)SLUEF1 (%)47.9W2V2-B-VP100K (e2e approach)
Named Entity Recognition (NER)SLUElabel-F1 (%)60.8W2V2-B-VP100K (e2e approach)

Related Papers

Task-Specific Audio Coding for Machines: Machine-Learned Latent Features Are Codes for That Machine2025-07-17NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal Vocalizations with Emotion Annotations for Text-to-Speech2025-07-17AdaptiSent: Context-Aware Adaptive Attention for Multimodal Aspect-Based Sentiment Analysis2025-07-17AI Wizards at CheckThat! 2025: Enhancing Transformer-Based Embeddings with Sentiment for Subjectivity Detection in News Articles2025-07-15DCR: Quantifying Data Contamination in LLMs Evaluation2025-07-15WhisperKit: On-device Real-time ASR with Billion-Scale Transformers2025-07-14SentiDrop: A Multi Modal Machine Learning model for Predicting Dropout in Distance Learning2025-07-14GNN-CNN: An Efficient Hybrid Model of Convolutional and Graph Neural Networks for Text Representation2025-07-10