Datasets

199 machine learning datasets

199 dataset results

Auto-KWS

Auto-KWS is a dataset for customized keyword spotting, the task of detecting spoken keywords. The dataset closely resembles real world scenarios, as each recorder is assigned with an unique wake-up word and can choose their recording environment and familiar dialect freely.

1 papers0 benchmarksSpeech

NISQA Speech Quality Corpus

The NISQA Corpus includes more than 14,000 speech samples with simulated (e.g. codecs, packet-loss, background noise) and live (e.g. mobile phone, Zoom, Skype, WhatsApp) conditions. Each file is labelled with subjective ratings of the overall quality and the quality dimensions Noisiness, Coloration, Discontinuity, and Loudness. In total, it contains more than 97,000 human ratings for each of the dimensions and the overall MOS.

1 papers0 benchmarksAudio, Speech

CSI

CSI is a criminal conversational dataset for speaker identification built from the CSI television show. The authors collected transcripts of 39 episodes and video/audio of 4 episodes. Each episode involves on average more than 30 speakers. Utterances last on average 3 to 4 seconds. There are around 45 to 50 distinct scenes/conversations per episode.

1 papers0 benchmarksSpeech

DR-VCTK (Device Recorded VCTK)

This dataset is a new variant of the voice cloning toolkit (VCTK) dataset: device-recorded VCTK (DR-VCTK), where the high-quality speech signals recorded in a semi-anechoic chamber using professional audio devices are played back and re-recorded in office environments using relatively inexpensive consumer devices.

1 papers0 benchmarksSpeech

Voice Navigation

Voice Navigation is a large-scale dataset of Chinese speech for slot filling, containing more than 830,000 samples.

1 papers0 benchmarksSpeech

Vāksañcayaḥ (Sanskrit Speech Corpus by IIT Bombay)

This Sanskrit speech corpus has more than 78 hours of audio data and contains recordings of 45,953 sentences with a sampling rate of 22KHz. The content is mainly readings of texts spanning over various Śāstras of Saṃskṛtam literature and also includes contemporary stories, radio program, extempore discourse, etc.

1 papers0 benchmarksSpeech, Texts

CrowdSpeech

CrowdSpeech is a publicly available large-scale dataset of crowdsourced audio transcriptions. It contains annotations for more than 20 hours of English speech from more than 1,000 crowd workers.

1 papers0 benchmarksSpeech

VESUS (Varied Emotion in Syntactically Uniform Speech)

The Varied Emotion in Syntactically Uniform Speech (VESUS) repository is a lexically controlled database collected by the NSA lab. Here, actors read a semantically neutral script of words, phrases, and sentences with different emotional inflections. VESUS contains 252 distinct phrases, each read by 10 actors in 5 emotional states (neutral, angry, happy, sad, fearful).

1 papers0 benchmarksSpeech

Well-being Dataset (Cambridge Well-being Dataset for Psychological Distress Analysis)

The dataset is a private dataset collected for automatic analysis of psychological distress. It contains self-reported distress labels provided by human volunteers. The dataset consists of 30-min interview recordings of participants.

1 papers1 benchmarksAudio, Speech, Time series, Videos

ICASSP 2021 Acoustic Echo Cancellation Challenge

The ICASSP 2021 Acoustic Echo Cancellation Challenge is intended to stimulate research in the area of acoustic echo cancellation (AEC), which is an important part of speech enhancement and still a top issue in audio communication and conferencing systems. Many recent AEC studies report good performance on synthetic datasets where the train and test samples come from the same underlying distribution. However, the AEC performance often degrades significantly on real recordings. Also, most of the conventional objective metrics such as echo return loss enhancement (ERLE) and perceptual evaluation of speech quality (PESQ) do not correlate well with subjective speech quality tests in the presence of background noise and reverberation found in realistic environments. In this challenge, we open source two large datasets to train AEC models under both single talk and double talk scenarios. These datasets consist of recordings from more than 2,500 real audio devices and human speakers in real en

1 papers0 benchmarksAudio, Speech

INTERSPEECH 2021 Acoustic Echo Cancellation Challenge

The INTERSPEECH 2021 Acoustic Echo Cancellation Challenge is intended to stimulate research in the area of acoustic echo cancellation (AEC), which is an important part of speech enhancement and still a top issue in audio communication and conferencing systems. Many recent AEC studies report reasonable performance on synthetic datasets where the train and test samples come from the same underlying distribution. However, the AEC performance often degrades significantly on real recordings. Also, most of the conventional objective metrics such as echo return loss enhancement (ERLE) and perceptual evaluation of speech quality (PESQ) do not correlate well with subjective speech quality tests in the presence of background noise and reverberation found in realistic environments. In this challenge, we open source two large datasets to train AEC models under both single talk and double talk scenarios. These datasets consist of recordings from more than 5,000 real audio devices and human speakers

1 papers0 benchmarksAudio, Speech

AVASpeech-SMAD (AVASpeech-SMAD: A Strongly Labelled Speech and Music Activity Detection Dataset with Label Co-Occurrence)

We propose a dataset, AVASpeech-SMAD, to assist speech and music activity detection research. With frame-level music labels, the proposed dataset extends the existing AVASpeech dataset, which originally consists of 45 hours of audio and speech activity labels. To the best of our knowledge, the proposed AVASpeech-SMAD is the first open-source dataset that features strong polyphonic labels for both music and speech. The dataset was manually annotated and verified via an iterative cross-checking process. A simple automatic examination was also implemented to further improve the quality of the labels. Evaluation results from two state-of-the-art SMAD systems are also provided as a benchmark for future reference.

1 papers0 benchmarksAudio, Music, Speech

CLIPS (Corpora e Lessici dell'Italiano Parlato e Scritto)

CLIPS, ovvero Corpora e Lessici dell'Italiano Parlato e Scritto, è uno degli otto progetti (Progetto n. 2) del Cluster C18 "LINGUISTICA COMPUTAZIONALE: RICERCHE MONOLINGUI E MULTILINGUI" (Legge 488), finanziato dal Ministero dell'Istruzione, dell'Università e della Ricerca (MIUR).

1 papers0 benchmarksSpeech, Texts

EmoSpeech

EmoSpeech contains keywords with diverse emotions and background sounds, presented to explore new challenges in audio analysis.

1 papers0 benchmarksAudio, Speech

TR_AR_S2S

Dubbed series are gaining a lot of popularity in recent years with strong support from major media service providers. Such popularity is fueled by studies that showed that dubbed versions of TV shows are more popular than their subtitled equivalents.

1 papers0 benchmarksSpeech

Persian Preschool Cognition Speech

Data collection was conducted by asking some adults from social media and some students from an elementary school to participate in our experiment. Table.1 shows the number of data gathered for recognizing each color. Due to the fact that two words are used for black in Persian, the number of black samples is more. In addition, because the color recognition is a RAN task, a sequence of data has been gathered. Table.2 depicts the number of sequence data for colors. For the meaningless words, 12 voices have been gathered on average for each word (there are 40 meaningless words in this task).

1 papers0 benchmarksSpeech

MC_GRID (Multi_Channel_Grid)

Here we release the dataset (Multi_Channel_Grid, abbreviated as MC_Grid) used in our paper LIMUSE: LIGHTWEIGHT MULTI-MODAL SPEAKER EXTRACTION.

1 papers0 benchmarksAudio, Speech, Videos

WHAMR_ext

WHAMR_ext is an extension to the WHAMR corpus with larger RT60 values (between 1s and 3s)

1 papers5 benchmarksAudio, Speech

EVI

The EVI dataset is a challenging, multilingual spoken-dialogue dataset with 5,506 dialogues in English, Polish, and French. The dataset can be used to develop and benchmark conversational systems for user authentication tasks, i.e. speaker enrolment (E), speaker verification (V), speaker identification (I).

1 papers0 benchmarksDialog, Speech, Tabular, Texts

TAT (Taiwanese Across Taiwan)

Taiwanese Across Taiwan (TAT) corpus is a Large-Scale database of Native Taiwanese Article/Reading Speech collected across Taiwan. This corpus contains native Taiwanese speech of various accent across Taiwan. The corpus is annotated twice for use in voice recognition research. The corpus contains recording from 100 native speakers, each with length of 30 minutes making a total of 100 hours of speech data.

1 papers2 benchmarksSpeech

PreviousPage 8 of 10Next