199 machine learning datasets
199 dataset results
Dataset Summary Speech Brown is a comprehensive, synthetic, and diverse paired speech-text dataset in 15 categories, covering a wide range of topics from fiction to religion. This dataset consists of over 55,000 sentence-level samples.
This dataset is designed to help train simple machine learning models that serve educational and research purposes in the speech recognition domain, mainly for keyword spotting tasks.
MediBeng Dataset The MediBeng dataset contains synthetic code-switched dialogues in Bengali and English for training models in speech recognition (ASR), text-to-speech (TTS), and machine translation in clinical settings. The dataset is available under the CC-BY-4.0 license.
The Watch Your Mouth dataset is a custom silent speech dataset consisting of depth-only recordings of users silently mouthing full English sentences, captured using consumer-grade depth cameras such as the iPhone TrueDepth sensor. Sentences were carefully curated to cover diverse visemic and phonetic patterns, supporting the development of models capable of generalizing across varied speech content. Each sentence-level utterance provides a temporally aligned depth sequence and corresponding ground truth text. Please see more details in the paper Watch Your Mouth: Silent Speech Recognition with Depth Sensing.
The TED VCR Video Retrieval Dataset is a multimodal collection derived from publicly available TED Talks. It contains thousands of talks filtered to retain only those with meaningful topic labels, producing a long-tail, multi-label taxonomy. For each talk the dataset provides automatic speech-recognition transcripts, slide- and scene-level OCR text, and frame-level visual captions—three textual channels used in VCR retrieval experiments. The data are split into 80 % train, 10 % validation, and 10 % test while preserving the original topic distribution, leaving 542 talks as a held-out test set. Two ready-to-download archives accompany the release: 4.2 GB of trimmed MP4 videos with metadata and 1.8 GB of pre-computed CLIP and Whisper embeddings, both shared under the non-commercial CC BY-NC-ND 4.0 license.
Uses same clean speech as VoiceBank+Demand but more noise types. Features much lower SNRs ([−10, −5, 0, 5, 10, 15, 20] dB) compared to VoiceBank+Demand and has babble and speech shaped noise in both train, validation, and test set.
The Persian Consonant Vowel Combination (PCVC) dataset is a phoneme based speech dataset, and also the first free Persian speech dataset to help Persian speech researchers. This dataset contains of 23 Persian consonants and 6 vowels. The sound samples are all possible combinations of vowels and consonants (138 samples for each speaker) with a length of 30000 data samples. The sample rate of all speech samples is 48000 which means there are 48000 sound samples in every 1 second. In each sample, sound starts with consonant and then there is a vowel sound and at last there is silent. length of silence is dependent on length of combination of consonant and vowel. For example if combination ends in 20000th data sample, so the rest of 10000 sample (until 30000, the length of each sound sample) are silence.
The Arabic Speech Corpus (1.5 GB) is a Modern Standard Arabic (MSA) speech corpus for speech synthesis. The corpus contains phonetic and orthographic transcriptions of more than 3.7 hours of MSA speech aligned with recorded speech on the phoneme level. The annotations include word stress marks on the individual phonemes The Speech corpus has been developed as part of PhD work carried out by Nawar Halabi at the University of Southampton. The corpus was recorded in south Levantine Arabic (Damascian accent) using a professional studio. Synthesized speech as an output using this corpus has produced a high quality, natural voice.
The CMU Wilderness Multilingual Speech Dataset is a dataset of over 700 different languages providing audio, aligned text and word pronunciations. On average each language provides around 20 hours of sentence-lengthed transcriptions.
Deeply vocal characterizer is a human nonverbal vocalization dataset. This sample dataset consists of about 0.6 hours(56.7 hours in the full set) of audio(16 kHz, 16-bit, mono) across 16 human nonverbal vocalization classes, including throat-clearing, coughing, laughing, panting, and etc. The audio contents are crowdsourced by the general public of South Korea.
Deeply Korean read speech corpus contains pairs of Korean speakers reading a script with 3 distinct text sentiments (negative, neutral, positive), with 3 distinct voice sentiments (negative, neutral, positive), are recorded. The recordings took place in 3 different types of places, which are an anechoic chamber, studio apartment, and dance studio, of which the level of reverberation differs. And in order to examine the effect of the distance of mic from the source and device, every experiment is recorded at 3 distinct distances with 2 types of smartphone, iPhone X, and Galaxy S7.
Deeply Parent-Child Vocal Interaction contains the interaction of 24 pairs of parent and child(total 48 speakers), such as reading fairy tales, singing children’s songs, conversing, and others, is recorded. The recordings took place in 3 different types of places, which are an anechoic chamber, studio apartment, and dance studio, of which the level of reverberation differs. And in order to examine the effect of the distance of mic from the source and device, every experiment is recorded at 3 distinct distances) with 2 types of smartphone, iPhone X, and Galaxy S7.
BABEL is a multilingual corpus of conversational telephone speech from IARPA, which includes Asian and African languages.
FluencyBank is a shared database for the study of fluency development. Participants include typically-developing monolingual and bilingual children, children and adults who stutter (C/AWS) or who clutter (C/AWC), and second language learners.
A Chinese Mandarin speech corpus by Beijing DataTang Technology Co., Ltd, containing 200 hours of speech data from 600 speakers. The transcription accuracy for each sentence is larger than 98%. Aidatatang_200zh is a free Chinese Mandarin speech corpus provided by Beijing DataTang Technology Co., Ltd under Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International Public License. The contents and the corresponding descriptions of the corpus include:
This MCCS dataset is the first large-scale Mandarin Chinese Cued Speech dataset. This dataset covers 23 major categories of scenarios (e.g, communication, transportation and shoping) and 72 subcategories of scenarios (e.g, meeting, dating and introduction). It is recorded by four skilled native Mandarn Chinese Cued Speech cuers with portable cameras on the mobile phones. The Cued Speech videos are recorded with 30fps and 1280x720 format. We provide the raw Cued Speech videos, text file (with 1000 sentences) and corresponding annotations which contains two kind of data annotation. One is continuious video annotation with ELAN, the other is discrete audio annotations with Praat.
We designed an emotional speech database that can be used for emotion recognition as well as recognition and synthesis of speech with various emotions. The database was designed by compiling tweets acquired from Twitter and selecting emotion- dependent tweets considering phonetic and prosodic balance. We classified gathered tweets into four emotions: joy, anger, sadness and neutral, and then selected 50 sentences from sentences of each emotion based on the entropy-based algorithm. We compared the selected sentence sets with randomly selected sentence sets from aspects of phonetic and prosodic balance and sentence length, and confirmed that the sets selected by the algorithm were more balanced. Next, we recorded emotional speech based on the selected sentences. Then, we evaluated the speech from the viewpoint of emotional recognition and emotional speech recognition.
About
The Storytelling Video Dataset is a high-quality, human-reviewed multimodal dataset featuring over 700 full-body video recordings of native Russian speakers. Each video is 10+ minutes long and includes synchronized speech, facial expressions, gestures, and emotional variation. The dataset is ideal for research and development in: