Datasets

199 machine learning datasets

199 dataset results

MediaSpeech

MediaSpeech is a media speech dataset (you might have guessed this) built with the purpose of testing Automated Speech Recognition (ASR) systems performance. The dataset consists of short speech segments automatically extracted from media videos available on YouTube and manually transcribed, with some pre- and post-processing. The dataset contains 10 hours of speech for each language provided. This release contains audio datasets in French, Arabic, Turkish and Spanish, and is a part of a larger private dataset.

5 papers4 benchmarksSpeech

CSRC (Children Speech Recognition Challenge)

CSRC is a collection of data for Children Speech Recognition. The data for this challenge is divided into 3 datasets, referred to as A (Adult speech training set), C1 (Children speech training set) and C2 (Children conversation training set). All dataset combined amount to 400 hours of Mandarin speech data.

5 papers0 benchmarksSpeech

OLR 2021

The OLR 2021 dataset contains the data for the Oriental Language Recognition (OLR) 2021 Challenge, which intends to improve the performance of language recognition systems and speech recognition systems within multilingual scenarios.

5 papers0 benchmarksSpeech

DeToxy (DeToxy: A Large-Scale Multimodal Dataset for Toxicity Classification in Spoken Utterances)

DeToxy is a publicly available toxicity annotated dataset for the English language. DeToxy is sourced from various openly available speech databases and consists of over 2 million utterances. The dataset would act as a benchmark for the relatively new and un-explored Spoken Language Processing task of detecting toxicity from spoken utterances and boost further research in this space.

5 papers0 benchmarksSpeech

PodcastFillers

The PodcastFillers dataset consists of 199 full-length podcast episodes in English with manually annotated filler words and automatically generated transcripts. The podcast audio recordings, sourced from SoundCloud, are CC-licensed, gender-balanced, and total 145 hours of audio from over 350 speakers. The annotations are provided under a non-commercial license and consist of 85,803 manually annotated audio events including approximately 35,000 filler words (“uh” and “um”) and 50,000 non-filler events such as breaths, music, laughter, repeated words, and noise. The annotated events are also provided as pre-processed 1-second audio clips. The dataset also includes automatically generated speech transcripts from a speech-to-text system. A detailed description is provided in Dataset.

5 papers1 benchmarksSpeech

Talking With Hands 16.2M

This is a 16.2-million frame (50-hour) multimodal dataset of two-person face-to-face spontaneous conversations. This dataset features synchronized body and finger motion as well as audio data. It represents the largest motion capture and audio dataset of natural conversations to date. The statistical analysis verifies strong intraperson and interperson covariance of arm, hand, and speech features, potentially enabling new directions on data-driven social behavior analysis, prediction, and synthesis.

5 papers0 benchmarks3D, Speech

RealMAN (A Real-Recorded and Annotated Microphone Array Dataset for Dynamic Speech Enhancement and Localization)

The Audio Signal and Information Processing Lab at Westlake University, in collaboration with AISHELL, has released the Real-recorded and annotated Microphone Array speech&Noise (RealMAN) dataset, which provides annotated multi-channel speech and noise recordings for dynamic speech enhancement and localization:

5 papers7 benchmarksAudio, Speech

KazakhTTS

KazakhTTS is an open-source speech synthesis dataset for Kazakh, a low-resource language spoken by over 13 million people worldwide. The dataset consists of about 91 hours of transcribed audio recordings spoken by two professional speakers (female and male). It is the first publicly available large-scale dataset developed to promote Kazakh text-to-speech (TTS) applications in both academia and industry.

4 papers0 benchmarksSpeech, Texts

Kosp2e

Kosp2e (read as `kospi'), is a corpus that allows Korean speech to be translated into English text in an end-to-end manner

4 papers0 benchmarksSpeech

REAL-M

Real-M is a crowd-sourced speech-separation corpus of real-life mixtures. The mixtures are recorded in different acoustic environments using a wide variety of recording devices such as laptops and smartphones, thus reflecting more closely potential application scenarios.

4 papers0 benchmarksSpeech

RTASC (ROBIN Technical Acquisition Speech Corpus)

The ROBIN Technical Acquisition Speech Corpus (ROBINTASC) was developed within the ROBIN project. Its main purpose was to improve the behaviour of a conversational agent, allowing human-machine interaction in the context of purchasing technical equipment. It contains over 6 hours of read speech in Romanian language. We provide text files, associated speech files (WAV, 44.1KHz, 16-bit, single channel), annotated text files in CoNLL-U format.

4 papers0 benchmarksSpeech, Tabular, Texts

DISRPT2019 (DISRPT2019 shared task on Discourse Unit Segmentation and Connective Detection)

The DISRPT 2019 workshop introduces the first iteration of a cross-formalism shared task on discourse unit segmentation. Since all major discourse parsing frameworks imply a segmentation of texts into segments, learning segmentations for and from diverse resources is a promising area for converging methods and insights. We provide training, development and test datasets from all available languages and treebanks in the RST, SDRT and PDTB formalisms, using a uniform format. Because different corpora, languages and frameworks use different guidelines for segmentation, the shared task is meant to promote design of flexible methods for dealing with various guidelines, and help to push forward the discussion of standards for discourse units. For datasets which have treebanks, we will evaluate in two different scenarios: with and without gold syntax, or otherwise using provided automatic parses for comparison.

4 papers0 benchmarksSpeech, Texts

ExVo2022 (ICML ExVo 2022 Workshop & Competition Data)

Baseline code for the three tracks of ExVo 2022 competition.

4 papers0 benchmarksSpeech

ReMASC

We introduce a new database of voice recordings with the goal of supporting research on vulnerabilities and protection of voice-controlled systems. In contrast to prior efforts, the proposed database contains genuine and replayed recordings of voice commands obtained in realistic usage scenarios and using state-of-the-art voice assistant development kits. Specifically, the database contains recordings from four systems (each with a different microphone array) in a variety of environmental conditions with different forms of background noise and relative positions between speaker and device. To the best of our knowledge, this is the first database that has been specifically designed for the protection of voice controlled systems (VCS) against various forms of replay attacks.

4 papers0 benchmarksSpeech

EdAcc (Edinburgh International Accents of English Corpus)

The Edinburgh International Accents of English Corpus (EdAcc) is a new automatic speech recognition (ASR) dataset composed of 40 hours of English dyadic conversations between speakers with a diverse set of accents. EdAcc includes a wide range of first and second-language varieties of English and a linguistic background profile of each speaker.

4 papers0 benchmarksSpeech

SpeechInstruct

SpeechInstruct is a large-scale cross-modal speech instruction dataset. It contains 37,969 quadruplets composed of speech instructions, text instructions, text responses, and speech responses.

4 papers0 benchmarksSpeech, Texts

VietMed (VietMed: A Dataset and Benchmark for Automatic Speech Recognition of Vietnamese in the Medical Domain)

We introduced a Vietnamese speech recognition dataset in the medical domain comprising 16h of labeled medical speech, 1000h of unlabeled medical speech and 1200h of unlabeled general-domain speech. To our best knowledge, VietMed is by far the world’s largest public medical speech recognition dataset in 7 aspects: total duration, number of speakers, diseases, recording conditions, speaker roles, unique medical terms and accents. VietMed is also by far the largest public Vietnamese speech dataset in terms of total duration. Additionally, we are the first to present a medical ASR dataset covering all ICD-10 disease groups and all accents within a country.

4 papers3 benchmarksAudio, Medical, Speech, Texts

LibriVoxDeEn

LibriVoxDeEn is a corpus of sentence-aligned triples of German audio, German text, and English translation, based on German audiobooks. The speech translation data consist of 110 hours of audio material aligned to over 50k parallel sentences. An even larger dataset comprising 547 hours of German speech aligned to German text is available for speech recognition. The audio data is read speech and thus low in disfluencies.

3 papers0 benchmarksSpeech

TaL Corpus (The Tongue and Lips Corpus)

The Tongue and Lips (TaL) corpus is a multi-speaker corpus of ultrasound images of the tongue and video images of lips. This corpus contains synchronised imaging data of extraoral (lips) and intraoral (tongue) articulators from 82 native speakers of English.

3 papers0 benchmarksAudio, Speech, Texts, Videos

EMOVIE

EMOVIE is a Mandarin emotion speech dataset including 9,724 samples with audio files and its emotion human-labeled annotation.

3 papers0 benchmarksSpeech

PreviousPage 5 of 10Next