SLUE: New Benchmark Tasks for Spoken Language Understanding Evaluation on Natural Speech

Suwon Shon, Ankita Pasad, Felix Wu, Pablo Brusco, Yoav Artzi, Karen Livescu, Kyu J. Han

2021-11-19Speech Recognition Speaker Identification Automatic Speech Recognition Automatic Speech Recognition (ASR)Sentiment Analysis speech-recognition named-entity-recognition Named Entity Recognition Spoken Language Understanding Named Entity Recognition (NER)

Paper PDF Code(official)

Abstract

Progress in speech processing has been facilitated by shared datasets and benchmarks. Historically these have focused on automatic speech recognition (ASR), speaker identification, or other lower-level tasks. Interest has been growing in higher-level spoken language understanding tasks, including using end-to-end models, but there are fewer annotated datasets for such tasks. At the same time, recent work shows the possibility of pre-training generic representations and then fine-tuning for several tasks using relatively little labeled data. We propose to create a suite of benchmark tasks for Spoken Language Understanding Evaluation (SLUE) consisting of limited-size labeled training sets and corresponding evaluation sets. This resource would allow the research community to track progress, evaluate pre-trained representations for higher-level tasks, and study open questions such as the utility of pipeline versus end-to-end approaches. We present the first phase of the SLUE benchmark suite, consisting of named entity recognition, sentiment analysis, and ASR on the corresponding datasets. We focus on naturally produced (not read or synthesized) speech, and freely available datasets. We provide new transcriptions and annotations on subsets of the VoxCeleb and VoxPopuli datasets, evaluation metrics and results for baseline models, and an open-source toolkit to reproduce the baselines and evaluate new models.

Results

Task	Dataset	Metric	Value	Model
Speech Recognition	SLUE	VoxCeleb (Dev)	9.1	W2V2-L-LL60K (+ TED-LIUM 3 LM)
Speech Recognition	SLUE	VoxCeleb (Test)	10.8	W2V2-L-LL60K (+ TED-LIUM 3 LM)
Speech Recognition	SLUE	VoxPopuli (Dev)	9.1	W2V2-L-LL60K (+ TED-LIUM 3 LM)
Speech Recognition	SLUE	VoxPopuli (Test)	9.3	W2V2-L-LL60K (+ TED-LIUM 3 LM)
Speech Recognition	SLUE	VoxCeleb (Dev)	13.2	W2V2-B-LS960 (+ TED-LIUM 3 LM)
Speech Recognition	SLUE	VoxCeleb (Test)	15.8	W2V2-B-LS960 (+ TED-LIUM 3 LM)
Speech Recognition	SLUE	VoxPopuli (Dev)	12	W2V2-B-LS960 (+ TED-LIUM 3 LM)
Speech Recognition	SLUE	VoxPopuli (Test)	12.2	W2V2-B-LS960 (+ TED-LIUM 3 LM)
Speech Recognition	SLUE	VoxCeleb (Dev)	11.8	W2V2-L-LL60K (+ in-domain LM)
Speech Recognition	SLUE	VoxCeleb (Test)	13.8	W2V2-L-LL60K (+ in-domain LM)
Speech Recognition	SLUE	VoxPopuli (Dev)	12	W2V2-L-LL60K (+ in-domain LM)
Speech Recognition	SLUE	VoxPopuli (Test)	12.5	W2V2-L-LL60K (+ in-domain LM)
Speech Recognition	SLUE	VoxCeleb (Dev)	11	W2V2-L-LL60K
Speech Recognition	SLUE	VoxCeleb (Test)	13.5	W2V2-L-LL60K
Speech Recognition	SLUE	VoxPopuli (Dev)	14	W2V2-L-LL60K
Speech Recognition	SLUE	VoxPopuli (Test)	12.1	W2V2-L-LL60K
Speech Recognition	SLUE	VoxCeleb (Dev)	15.2	W2V2-B-LS960 (+ in-domain LM)
Speech Recognition	SLUE	VoxCeleb (Test)	18.2	W2V2-B-LS960 (+ in-domain LM)
Speech Recognition	SLUE	VoxPopuli (Dev)	14.6	W2V2-B-LS960 (+ in-domain LM)
Speech Recognition	SLUE	VoxPopuli (Test)	15.2	W2V2-B-LS960 (+ in-domain LM)
Speech Recognition	SLUE	VoxCeleb (Dev)	17.2	W2V2-B-LS960
Speech Recognition	SLUE	VoxCeleb (Test)	20.5	W2V2-B-LS960
Speech Recognition	SLUE	VoxPopuli (Dev)	17.2	W2V2-B-LS960
Speech Recognition	SLUE	VoxPopuli (Test)	17.9	W2V2-B-LS960
Speech Recognition	SLUE	VoxCeleb (Dev)	19.6	HuBERT-B-LS960
Speech Recognition	SLUE	VoxCeleb (Test)	21.2	HuBERT-B-LS960
Speech Recognition	SLUE	VoxPopuli (Dev)	18.6	HuBERT-B-LS960
Speech Recognition	SLUE	VoxPopuli (Test)	19.1	HuBERT-B-LS960
Speech Recognition	SLUE	VoxCeleb (Dev)	29.9	W2V2-B-VP100K
Speech Recognition	SLUE	VoxCeleb (Test)	33.4	W2V2-B-VP100K
Speech Recognition	SLUE	VoxPopuli (Dev)	21.6	W2V2-B-VP100K
Speech Recognition	SLUE	VoxPopuli (Test)	22.4	W2V2-B-VP100K
Sentiment Analysis	SLUE	F1 (%)	63.3	W2V2-L-LL60K (pipeline approach, uses LM)
Sentiment Analysis	SLUE	Recall (%)	60.4	W2V2-L-LL60K (pipeline approach, uses LM)
Sentiment Analysis	SLUE	F1 (%)	63.3	W2V2-L-LL60K (pipeline approach)
Sentiment Analysis	SLUE	Recall (%)	60.2	W2V2-L-LL60K (pipeline approach)
Sentiment Analysis	SLUE	F1 (%)	62.9	W2V2-B-LS960 (pipeline approach, uses LM)
Sentiment Analysis	SLUE	Recall (%)	60	W2V2-B-LS960 (pipeline approach, uses LM)
Sentiment Analysis	SLUE	F1 (%)	61.8	W2V2-B-LS960 (pipeline approach)
Sentiment Analysis	SLUE	Recall (%)	59	W2V2-B-LS960 (pipeline approach)
Sentiment Analysis	SLUE	F1 (%)	48.5	W2V2-L-LL60K (e2e approach)
Sentiment Analysis	SLUE	Recall (%)	49.2	W2V2-L-LL60K (e2e approach)
Sentiment Analysis	SLUE	F1 (%)	48	HuBERT-B-LS960 (e2e approach)
Sentiment Analysis	SLUE	Recall (%)	47.5	HuBERT-B-LS960 (e2e approach)
Sentiment Analysis	SLUE	F1 (%)	46.6	W2V2-B-LS960 (e2e approach)
Sentiment Analysis	SLUE	Recall (%)	46	W2V2-B-LS960 (e2e approach)
Sentiment Analysis	SLUE	F1 (%)	38.4	W2V2-B-VP100K (e2e approach)
Sentiment Analysis	SLUE	Recall (%)	38.7	W2V2-B-VP100K (e2e approach)
Named Entity Recognition (NER)	SLUE	F1 (%)	69.6	W2V2-L-LL60K (pipeline approach, uses LM)
Named Entity Recognition (NER)	SLUE	label-F1 (%)	82.2	W2V2-L-LL60K (pipeline approach, uses LM)
Named Entity Recognition (NER)	SLUE	F1 (%)	68	W2V2-B-LS960 (pipeline approach, uses LM)
Named Entity Recognition (NER)	SLUE	label-F1 (%)	79.8	W2V2-B-LS960 (pipeline approach, uses LM)
Named Entity Recognition (NER)	SLUE	F1 (%)	64.8	W2V2-L-LL60K (e2e approach, uses LM)
Named Entity Recognition (NER)	SLUE	label-F1 (%)	73.3	W2V2-L-LL60K (e2e approach, uses LM)
Named Entity Recognition (NER)	SLUE	F1 (%)	63.4	W2V2-B-LS960 (e2e approach, uses LM)
Named Entity Recognition (NER)	SLUE	label-F1 (%)	71.7	W2V2-B-LS960 (e2e approach, uses LM)
Named Entity Recognition (NER)	SLUE	F1 (%)	61.9	HuBERT-B-LS960 (e2e approach, uses LM)
Named Entity Recognition (NER)	SLUE	label-F1 (%)	70.3	HuBERT-B-LS960 (e2e approach, uses LM)
Named Entity Recognition (NER)	SLUE	F1 (%)	61.8	W2V2-B-VP100K (e2e approach, uses LM)
Named Entity Recognition (NER)	SLUE	label-F1 (%)	69.8	W2V2-B-VP100K (e2e approach, uses LM)
Named Entity Recognition (NER)	SLUE	F1 (%)	57.8	W2V2-L-LL60K (pipeline approach)
Named Entity Recognition (NER)	SLUE	label-F1 (%)	78.8	W2V2-L-LL60K (pipeline approach)
Named Entity Recognition (NER)	SLUE	F1 (%)	50.9	W2V2-L-LL60K (e2e approach)
Named Entity Recognition (NER)	SLUE	label-F1 (%)	64.7	W2V2-L-LL60K (e2e approach)
Named Entity Recognition (NER)	SLUE	F1 (%)	50.2	W2V2-B-LS960 (e2e approach)
Named Entity Recognition (NER)	SLUE	label-F1 (%)	64	W2V2-B-LS960 (e2e approach)
Named Entity Recognition (NER)	SLUE	F1 (%)	49.8	HuBERT-B-LS960 (e2e approach)
Named Entity Recognition (NER)	SLUE	label-F1 (%)	62.9	HuBERT-B-LS960 (e2e approach)
Named Entity Recognition (NER)	SLUE	F1 (%)	49.5	W2V2-B-LS960 (pipeline approach)
Named Entity Recognition (NER)	SLUE	label-F1 (%)	74.2	W2V2-B-LS960 (pipeline approach)
Named Entity Recognition (NER)	SLUE	F1 (%)	47.9	W2V2-B-VP100K (e2e approach)
Named Entity Recognition (NER)	SLUE	label-F1 (%)	60.8	W2V2-B-VP100K (e2e approach)

Abstract

Results

Task	Dataset	Metric	Value	Model
Speech Recognition	SLUE	VoxCeleb (Dev)	9.1	W2V2-L-LL60K (+ TED-LIUM 3 LM)
Speech Recognition	SLUE	VoxCeleb (Test)	10.8	W2V2-L-LL60K (+ TED-LIUM 3 LM)
Speech Recognition	SLUE	VoxPopuli (Dev)	9.1	W2V2-L-LL60K (+ TED-LIUM 3 LM)
Speech Recognition	SLUE	VoxPopuli (Test)	9.3	W2V2-L-LL60K (+ TED-LIUM 3 LM)
Speech Recognition	SLUE	VoxCeleb (Dev)	13.2	W2V2-B-LS960 (+ TED-LIUM 3 LM)
Speech Recognition	SLUE	VoxCeleb (Test)	15.8	W2V2-B-LS960 (+ TED-LIUM 3 LM)
Speech Recognition	SLUE	VoxPopuli (Dev)	12	W2V2-B-LS960 (+ TED-LIUM 3 LM)
Speech Recognition	SLUE	VoxPopuli (Test)	12.2	W2V2-B-LS960 (+ TED-LIUM 3 LM)
Speech Recognition	SLUE	VoxCeleb (Dev)	11.8	W2V2-L-LL60K (+ in-domain LM)
Speech Recognition	SLUE	VoxCeleb (Test)	13.8	W2V2-L-LL60K (+ in-domain LM)
Speech Recognition	SLUE	VoxPopuli (Dev)	12	W2V2-L-LL60K (+ in-domain LM)
Speech Recognition	SLUE	VoxPopuli (Test)	12.5	W2V2-L-LL60K (+ in-domain LM)
Speech Recognition	SLUE	VoxCeleb (Dev)	11	W2V2-L-LL60K
Speech Recognition	SLUE	VoxCeleb (Test)	13.5	W2V2-L-LL60K
Speech Recognition	SLUE	VoxPopuli (Dev)	14	W2V2-L-LL60K
Speech Recognition	SLUE	VoxPopuli (Test)	12.1	W2V2-L-LL60K
Speech Recognition	SLUE	VoxCeleb (Dev)	15.2	W2V2-B-LS960 (+ in-domain LM)
Speech Recognition	SLUE	VoxCeleb (Test)	18.2	W2V2-B-LS960 (+ in-domain LM)
Speech Recognition	SLUE	VoxPopuli (Dev)	14.6	W2V2-B-LS960 (+ in-domain LM)
Speech Recognition	SLUE	VoxPopuli (Test)	15.2	W2V2-B-LS960 (+ in-domain LM)
Speech Recognition	SLUE	VoxCeleb (Dev)	17.2	W2V2-B-LS960
Speech Recognition	SLUE	VoxCeleb (Test)	20.5	W2V2-B-LS960
Speech Recognition	SLUE	VoxPopuli (Dev)	17.2	W2V2-B-LS960
Speech Recognition	SLUE	VoxPopuli (Test)	17.9	W2V2-B-LS960
Speech Recognition	SLUE	VoxCeleb (Dev)	19.6	HuBERT-B-LS960
Speech Recognition	SLUE	VoxCeleb (Test)	21.2	HuBERT-B-LS960
Speech Recognition	SLUE	VoxPopuli (Dev)	18.6	HuBERT-B-LS960
Speech Recognition	SLUE	VoxPopuli (Test)	19.1	HuBERT-B-LS960
Speech Recognition	SLUE	VoxCeleb (Dev)	29.9	W2V2-B-VP100K
Speech Recognition	SLUE	VoxCeleb (Test)	33.4	W2V2-B-VP100K
Speech Recognition	SLUE	VoxPopuli (Dev)	21.6	W2V2-B-VP100K
Speech Recognition	SLUE	VoxPopuli (Test)	22.4	W2V2-B-VP100K
Sentiment Analysis	SLUE	F1 (%)	63.3	W2V2-L-LL60K (pipeline approach, uses LM)
Sentiment Analysis	SLUE	Recall (%)	60.4	W2V2-L-LL60K (pipeline approach, uses LM)
Sentiment Analysis	SLUE	F1 (%)	63.3	W2V2-L-LL60K (pipeline approach)
Sentiment Analysis	SLUE	Recall (%)	60.2	W2V2-L-LL60K (pipeline approach)
Sentiment Analysis	SLUE	F1 (%)	62.9	W2V2-B-LS960 (pipeline approach, uses LM)
Sentiment Analysis	SLUE	Recall (%)	60	W2V2-B-LS960 (pipeline approach, uses LM)
Sentiment Analysis	SLUE	F1 (%)	61.8	W2V2-B-LS960 (pipeline approach)
Sentiment Analysis	SLUE	Recall (%)	59	W2V2-B-LS960 (pipeline approach)
Sentiment Analysis	SLUE	F1 (%)	48.5	W2V2-L-LL60K (e2e approach)
Sentiment Analysis	SLUE	Recall (%)	49.2	W2V2-L-LL60K (e2e approach)
Sentiment Analysis	SLUE	F1 (%)	48	HuBERT-B-LS960 (e2e approach)
Sentiment Analysis	SLUE	Recall (%)	47.5	HuBERT-B-LS960 (e2e approach)
Sentiment Analysis	SLUE	F1 (%)	46.6	W2V2-B-LS960 (e2e approach)
Sentiment Analysis	SLUE	Recall (%)	46	W2V2-B-LS960 (e2e approach)
Sentiment Analysis	SLUE	F1 (%)	38.4	W2V2-B-VP100K (e2e approach)
Sentiment Analysis	SLUE	Recall (%)	38.7	W2V2-B-VP100K (e2e approach)
Named Entity Recognition (NER)	SLUE	F1 (%)	69.6	W2V2-L-LL60K (pipeline approach, uses LM)
Named Entity Recognition (NER)	SLUE	label-F1 (%)	82.2	W2V2-L-LL60K (pipeline approach, uses LM)
Named Entity Recognition (NER)	SLUE	F1 (%)	68	W2V2-B-LS960 (pipeline approach, uses LM)
Named Entity Recognition (NER)	SLUE	label-F1 (%)	79.8	W2V2-B-LS960 (pipeline approach, uses LM)
Named Entity Recognition (NER)	SLUE	F1 (%)	64.8	W2V2-L-LL60K (e2e approach, uses LM)
Named Entity Recognition (NER)	SLUE	label-F1 (%)	73.3	W2V2-L-LL60K (e2e approach, uses LM)
Named Entity Recognition (NER)	SLUE	F1 (%)	63.4	W2V2-B-LS960 (e2e approach, uses LM)
Named Entity Recognition (NER)	SLUE	label-F1 (%)	71.7	W2V2-B-LS960 (e2e approach, uses LM)
Named Entity Recognition (NER)	SLUE	F1 (%)	61.9	HuBERT-B-LS960 (e2e approach, uses LM)
Named Entity Recognition (NER)	SLUE	label-F1 (%)	70.3	HuBERT-B-LS960 (e2e approach, uses LM)
Named Entity Recognition (NER)	SLUE	F1 (%)	61.8	W2V2-B-VP100K (e2e approach, uses LM)
Named Entity Recognition (NER)	SLUE	label-F1 (%)	69.8	W2V2-B-VP100K (e2e approach, uses LM)
Named Entity Recognition (NER)	SLUE	F1 (%)	57.8	W2V2-L-LL60K (pipeline approach)
Named Entity Recognition (NER)	SLUE	label-F1 (%)	78.8	W2V2-L-LL60K (pipeline approach)
Named Entity Recognition (NER)	SLUE	F1 (%)	50.9	W2V2-L-LL60K (e2e approach)
Named Entity Recognition (NER)	SLUE	label-F1 (%)	64.7	W2V2-L-LL60K (e2e approach)
Named Entity Recognition (NER)	SLUE	F1 (%)	50.2	W2V2-B-LS960 (e2e approach)
Named Entity Recognition (NER)	SLUE	label-F1 (%)	64	W2V2-B-LS960 (e2e approach)
Named Entity Recognition (NER)	SLUE	F1 (%)	49.8	HuBERT-B-LS960 (e2e approach)
Named Entity Recognition (NER)	SLUE	label-F1 (%)	62.9	HuBERT-B-LS960 (e2e approach)
Named Entity Recognition (NER)	SLUE	F1 (%)	49.5	W2V2-B-LS960 (pipeline approach)
Named Entity Recognition (NER)	SLUE	label-F1 (%)	74.2	W2V2-B-LS960 (pipeline approach)
Named Entity Recognition (NER)	SLUE	F1 (%)	47.9	W2V2-B-VP100K (e2e approach)
Named Entity Recognition (NER)	SLUE	label-F1 (%)	60.8	W2V2-B-VP100K (e2e approach)

SLUE: New Benchmark Tasks for Spoken Language Understanding Evaluation on Natural Speech

Abstract

Results

Related Papers

SLUE: New Benchmark Tasks for Spoken Language Understanding Evaluation on Natural Speech

Abstract

Results

Related Papers