TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/VoxLingua107: a Dataset for Spoken Language Recognition

VoxLingua107: a Dataset for Spoken Language Recognition

Jörgen Valk, Tanel Alumäe

2020-11-25Action DetectionLanguage IdentificationSpoken language identificationActivity DetectionSpeaker Diarization
PaperPDFCodeCode(official)

Abstract

This paper investigates the use of automatically collected web audio data for the task of spoken language recognition. We generate semi-random search phrases from language-specific Wikipedia data that are then used to retrieve videos from YouTube for 107 languages. Speech activity detection and speaker diarization are used to extract segments from the videos that contain speech. Post-filtering is used to remove segments from the database that are likely not in the given language, increasing the proportion of correctly labeled segments to 98%, based on crowd-sourced verification. The size of the resulting training set (VoxLingua107) is 6628 hours (62 hours per language on the average) and it is accompanied by an evaluation set of 1609 verified utterances. We use the data to build language recognition models for several spoken language identification tasks. Experiments show that using the automatically retrieved training data gives competitive results to using hand-labeled proprietary datasets. The dataset is publicly available.

Results

TaskDatasetMetricValueModel
DialogueVOXLINGUA1070..5sec12.3Noisy
DialogueVOXLINGUA1075..20sec6.1Noisy
DialogueVOXLINGUA107Average7.1Noisy
DialogueVOXLINGUA1070..5sec13.4Cleaned
DialogueVOXLINGUA1075..20sec6.6Cleaned
DialogueVOXLINGUA107Average7.6Cleaned
DialogueLRE0710 sec2.61CNN-LDE
DialogueLRE073 sec8.25CNN-LDE
DialogueLRE0730 sec1.16CNN-LDE
DialogueLRE07Average4CNN-LDE
DialogueLRE0710 sec2.49CNN-SAP
DialogueLRE073 sec8.59CNN-SAP
DialogueLRE0730 sec1.09CNN-SAP
DialogueLRE07Average4.06CNN-SAP
DialogueLRE0710 sec3.14Resnet34 (cleaned data)
DialogueLRE073 sec9.39Resnet34 (cleaned data)
DialogueLRE0730 sec1.9Resnet34 (cleaned data)
DialogueLRE07Average4.81Resnet34 (cleaned data)
DialogueLRE0710 sec3.33Resnet34 (noisy data)
DialogueLRE073 sec10.58Resnet34 (noisy data)
DialogueLRE0730 sec1.72Resnet34 (noisy data)
DialogueLRE07Average5.21Resnet34 (noisy data)
DialogueLRE0710 sec4.54Fusion of models
DialogueLRE073 sec15.29Fusion of models
DialogueLRE0730 sec1.3Fusion of models
DialogueLRE07Average7.04Fusion of models
DialogueLRE0710 sec5.9GMM-MMI
DialogueLRE073 sec17.28GMM-MMI
DialogueLRE0730 sec2.1GMM-MMI
DialogueLRE07Average8.42GMM-MMI
DialogueLRE0710 sec6.28Phonotactic
DialogueLRE073 sec18.59Phonotactic
DialogueLRE0730 sec1.34Phonotactic
DialogueLRE07Average8.73Phonotactic
DialogueLRE0710 sec7.84Kaldi i-vector DNN
DialogueLRE073 sec19.67Kaldi i-vector DNN
DialogueLRE0730 sec3.31Kaldi i-vector DNN
DialogueLRE07Average10.27Kaldi i-vector DNN
DialogueLRE0710 sec11.93Kaldi i-vector
DialogueLRE073 sec26.04Kaldi i-vector
DialogueLRE0730 sec4.52Kaldi i-vector
DialogueLRE07Average14.17Kaldi i-vector
DialogueKALAKA-3EC0.022Model on the automatically filtered (cleaned) data
DialogueKALAKA-3EO0.058Model on the automatically filtered (cleaned) data
DialogueKALAKA-3PC0.041Model on the automatically filtered (cleaned) data
DialogueKALAKA-3PO0.056Model on the automatically filtered (cleaned) data
DialogueKALAKA-3EC0.033Model on the noisy data
DialogueKALAKA-3EO0.059Model on the noisy data
DialogueKALAKA-3PC0.055Model on the noisy data
DialogueKALAKA-3PO0.083Model on the noisy data
Spoken Language UnderstandingVOXLINGUA1070..5sec12.3Noisy
Spoken Language UnderstandingVOXLINGUA1075..20sec6.1Noisy
Spoken Language UnderstandingVOXLINGUA107Average7.1Noisy
Spoken Language UnderstandingVOXLINGUA1070..5sec13.4Cleaned
Spoken Language UnderstandingVOXLINGUA1075..20sec6.6Cleaned
Spoken Language UnderstandingVOXLINGUA107Average7.6Cleaned
Spoken Language UnderstandingLRE0710 sec2.61CNN-LDE
Spoken Language UnderstandingLRE073 sec8.25CNN-LDE
Spoken Language UnderstandingLRE0730 sec1.16CNN-LDE
Spoken Language UnderstandingLRE07Average4CNN-LDE
Spoken Language UnderstandingLRE0710 sec2.49CNN-SAP
Spoken Language UnderstandingLRE073 sec8.59CNN-SAP
Spoken Language UnderstandingLRE0730 sec1.09CNN-SAP
Spoken Language UnderstandingLRE07Average4.06CNN-SAP
Spoken Language UnderstandingLRE0710 sec3.14Resnet34 (cleaned data)
Spoken Language UnderstandingLRE073 sec9.39Resnet34 (cleaned data)
Spoken Language UnderstandingLRE0730 sec1.9Resnet34 (cleaned data)
Spoken Language UnderstandingLRE07Average4.81Resnet34 (cleaned data)
Spoken Language UnderstandingLRE0710 sec3.33Resnet34 (noisy data)
Spoken Language UnderstandingLRE073 sec10.58Resnet34 (noisy data)
Spoken Language UnderstandingLRE0730 sec1.72Resnet34 (noisy data)
Spoken Language UnderstandingLRE07Average5.21Resnet34 (noisy data)
Spoken Language UnderstandingLRE0710 sec4.54Fusion of models
Spoken Language UnderstandingLRE073 sec15.29Fusion of models
Spoken Language UnderstandingLRE0730 sec1.3Fusion of models
Spoken Language UnderstandingLRE07Average7.04Fusion of models
Spoken Language UnderstandingLRE0710 sec5.9GMM-MMI
Spoken Language UnderstandingLRE073 sec17.28GMM-MMI
Spoken Language UnderstandingLRE0730 sec2.1GMM-MMI
Spoken Language UnderstandingLRE07Average8.42GMM-MMI
Spoken Language UnderstandingLRE0710 sec6.28Phonotactic
Spoken Language UnderstandingLRE073 sec18.59Phonotactic
Spoken Language UnderstandingLRE0730 sec1.34Phonotactic
Spoken Language UnderstandingLRE07Average8.73Phonotactic
Spoken Language UnderstandingLRE0710 sec7.84Kaldi i-vector DNN
Spoken Language UnderstandingLRE073 sec19.67Kaldi i-vector DNN
Spoken Language UnderstandingLRE0730 sec3.31Kaldi i-vector DNN
Spoken Language UnderstandingLRE07Average10.27Kaldi i-vector DNN
Spoken Language UnderstandingLRE0710 sec11.93Kaldi i-vector
Spoken Language UnderstandingLRE073 sec26.04Kaldi i-vector
Spoken Language UnderstandingLRE0730 sec4.52Kaldi i-vector
Spoken Language UnderstandingLRE07Average14.17Kaldi i-vector
Spoken Language UnderstandingKALAKA-3EC0.022Model on the automatically filtered (cleaned) data
Spoken Language UnderstandingKALAKA-3EO0.058Model on the automatically filtered (cleaned) data
Spoken Language UnderstandingKALAKA-3PC0.041Model on the automatically filtered (cleaned) data
Spoken Language UnderstandingKALAKA-3PO0.056Model on the automatically filtered (cleaned) data
Spoken Language UnderstandingKALAKA-3EC0.033Model on the noisy data
Spoken Language UnderstandingKALAKA-3EO0.059Model on the noisy data
Spoken Language UnderstandingKALAKA-3PC0.055Model on the noisy data
Spoken Language UnderstandingKALAKA-3PO0.083Model on the noisy data
Dialogue UnderstandingVOXLINGUA1070..5sec12.3Noisy
Dialogue UnderstandingVOXLINGUA1075..20sec6.1Noisy
Dialogue UnderstandingVOXLINGUA107Average7.1Noisy
Dialogue UnderstandingVOXLINGUA1070..5sec13.4Cleaned
Dialogue UnderstandingVOXLINGUA1075..20sec6.6Cleaned
Dialogue UnderstandingVOXLINGUA107Average7.6Cleaned
Dialogue UnderstandingLRE0710 sec2.61CNN-LDE
Dialogue UnderstandingLRE073 sec8.25CNN-LDE
Dialogue UnderstandingLRE0730 sec1.16CNN-LDE
Dialogue UnderstandingLRE07Average4CNN-LDE
Dialogue UnderstandingLRE0710 sec2.49CNN-SAP
Dialogue UnderstandingLRE073 sec8.59CNN-SAP
Dialogue UnderstandingLRE0730 sec1.09CNN-SAP
Dialogue UnderstandingLRE07Average4.06CNN-SAP
Dialogue UnderstandingLRE0710 sec3.14Resnet34 (cleaned data)
Dialogue UnderstandingLRE073 sec9.39Resnet34 (cleaned data)
Dialogue UnderstandingLRE0730 sec1.9Resnet34 (cleaned data)
Dialogue UnderstandingLRE07Average4.81Resnet34 (cleaned data)
Dialogue UnderstandingLRE0710 sec3.33Resnet34 (noisy data)
Dialogue UnderstandingLRE073 sec10.58Resnet34 (noisy data)
Dialogue UnderstandingLRE0730 sec1.72Resnet34 (noisy data)
Dialogue UnderstandingLRE07Average5.21Resnet34 (noisy data)
Dialogue UnderstandingLRE0710 sec4.54Fusion of models
Dialogue UnderstandingLRE073 sec15.29Fusion of models
Dialogue UnderstandingLRE0730 sec1.3Fusion of models
Dialogue UnderstandingLRE07Average7.04Fusion of models
Dialogue UnderstandingLRE0710 sec5.9GMM-MMI
Dialogue UnderstandingLRE073 sec17.28GMM-MMI
Dialogue UnderstandingLRE0730 sec2.1GMM-MMI
Dialogue UnderstandingLRE07Average8.42GMM-MMI
Dialogue UnderstandingLRE0710 sec6.28Phonotactic
Dialogue UnderstandingLRE073 sec18.59Phonotactic
Dialogue UnderstandingLRE0730 sec1.34Phonotactic
Dialogue UnderstandingLRE07Average8.73Phonotactic
Dialogue UnderstandingLRE0710 sec7.84Kaldi i-vector DNN
Dialogue UnderstandingLRE073 sec19.67Kaldi i-vector DNN
Dialogue UnderstandingLRE0730 sec3.31Kaldi i-vector DNN
Dialogue UnderstandingLRE07Average10.27Kaldi i-vector DNN
Dialogue UnderstandingLRE0710 sec11.93Kaldi i-vector
Dialogue UnderstandingLRE073 sec26.04Kaldi i-vector
Dialogue UnderstandingLRE0730 sec4.52Kaldi i-vector
Dialogue UnderstandingLRE07Average14.17Kaldi i-vector
Dialogue UnderstandingKALAKA-3EC0.022Model on the automatically filtered (cleaned) data
Dialogue UnderstandingKALAKA-3EO0.058Model on the automatically filtered (cleaned) data
Dialogue UnderstandingKALAKA-3PC0.041Model on the automatically filtered (cleaned) data
Dialogue UnderstandingKALAKA-3PO0.056Model on the automatically filtered (cleaned) data
Dialogue UnderstandingKALAKA-3EC0.033Model on the noisy data
Dialogue UnderstandingKALAKA-3EO0.059Model on the noisy data
Dialogue UnderstandingKALAKA-3PC0.055Model on the noisy data
Dialogue UnderstandingKALAKA-3PO0.083Model on the noisy data

Related Papers

CBF-AFA: Chunk-Based Multi-SSL Fusion for Automatic Fluency Assessment2025-06-25MultiHuman-Testbench: Benchmarking Image Generation for Multiple Humans2025-06-25Efficient and Generalizable Speaker Diarization via Structured Pruning of Self-Supervised Models2025-06-23Distributed Activity Detection for Cell-Free Hybrid Near-Far Field Communications2025-06-17M3SD: Multi-modal, Multi-scenario and Multi-language Speaker Diarization Dataset2025-06-17Exploring Speaker Diarization with Mixture of Experts2025-06-17Seewo's Submission to MLC-SLM: Lessons learned from Speech Reasoning Language Models2025-06-16SC-SOT: Conditioning the Decoder on Diarized Speaker Information for End-to-End Overlapped Speech Recognition2025-06-15