199 machine learning datasets
199 dataset results
GigaST is a large-scale pseudo speech translation (ST) corpus. The corpus was created by translating the text in GigaSpeech, an English ASR corpus, into German and Chinese. The training set is translated by a strong machine translation system and the test set was translated by human. ST models trained with an addition of the corpus obtain new state-of-the-art results on the MuST-C English-German benchmark test set.
Europarl-ASR (EN) is a 1300-hour English-language speech and text corpus of parliamentary debates for (streaming) Automatic Speech Recognition training and benchmarking, speech data filtering and speech data verbatimization, based on European Parliament speeches and their official transcripts (1996-2020). Includes dev-test sets for streaming ASR benchmarking, made up of 18 hours of manually revised speeches. The availability of manual non-verbatim and verbatim transcripts for dev-test speeches makes this corpus also useful for the assessment of automatic filtering and verbatimization techniques. The corpus is released under an open licence at https://www.mllp.upv.es/europarl-asr/
Recent advancements in speech generation models have been significantly driven by the use of large-scale training data. However, producing highly spontaneous, human-like speech remains a challenge due to the scarcity of large, diverse, and spontaneous speech datasets. In response, we introduce Emilia, the first large-scale, multilingual, and diverse speech generation dataset. Emilia starts with over 101k hours of speech across six languages, covering a wide range of speaking styles to enable more natural and spontaneous speech generation. To facilitate the scale-up of Emilia, we also present Emilia-Pipe, the first open-source preprocessing pipeline designed to efficiently transform raw, in-the-wild speech data into high-quality training data with speech annotations. Experimental results demonstrate the effectiveness of both Emilia and Emilia-Pipe. Demos are available at: https://emilia-dataset.github.io/Emilia-Demo-Page/.
EasyCall is a new dysarthric speech command dataset in Italian. The dataset consists of 21386 audio recordings from 24 healthy and 31 dysarthric speakers, whose individual degree of speech impairment was assessed by neurologists through the Therapy Outcome Measure. The corpus aims at providing a resource for the development of ASR-based assistive technologies for patients with dysarthria. In particular, it may be exploited to develop a voice-controlled contact application for commercial smartphones, aiming at improving dysarthric patients' ability to communicate with their family and caregivers. Before recording the dataset, participants were administered a survey to evaluate which commands are more likely to be employed by dysarthric individuals in a voice-controlled contact application. In addition, the dataset includes a list of non-commands (i.e., words near/inside commands or phonetically close to commands) that can be leveraged to build a more robust command recognition system.
VoicePrivacy 2020 is a dataset for developing anonymization solutions for speech technology. It is built from subsets of existing datasets such as: LibriSpeech, LibriTTS, VoxCeleb1, VoxCeleb2 and VCTK.
The People's Speech is a free-to-download 30,000-hour and growing supervised conversational English speech recognition dataset licensed for academic and commercial usage under CC-BY-SA (with a CC-BY subset). The data is collected via searching the Internet for appropriately licensed audio data with existing transcriptions.
The rise of singing voice synthesis presents critical challenges to artists and industry stakeholders over unauthorized voice usage. Unlike synthesized speech, synthesized singing voices are typically released in songs containing strong background music that may hide synthesis artifacts. Additionally, singing voices present different acoustic and linguistic characteristics from speech utterances. These unique properties make singing voice deepfake detection a relevant but significantly different problem from synthetic speech detection. In this work, we propose the singing voice deepfake detection task. We first present SingFake, the first curated in-the-wild dataset consisting of 28.93 hours of bonafide and 29.40 hours of deepfake song clips in five languages from 40 singers. We provide a train/val/test split where the test sets include various scenarios. We then use SingFake to evaluate four state-of-the-art speech countermeasure systems trained on speech utterances. We find these sys
A large-scale (105K conversations) media dialog dataset collected from news interview transcripts.
Libri-adhoc40 is a synchronized speech corpus which collects the replayed Librispeech data from loudspeakers by ad-hoc microphone arrays of 40 strongly synchronized distributed nodes in a real office environment. Besides, to provide the evaluation target for speech frontend processing and other applications, the authors also recorded the replayed speech in an anechoic chamber.
FMFCC-A is a large publicly-available Mandarin dataset for synthetic speech detection, which contains 40,000 synthesized Mandarin utterances that generated by 11 Mandarin TTS systems and two Mandarin VC systems, and 10,000 genuine Mandarin utterance collected from 58 speakers. The FMFCCA dataset is divided into the training, development and evaluation sets, which are used for the research of detection of synthesised Mandarin speech under various previously unknown speech synthesis systems or audio post-processing operations.
ASCEND (A Spontaneous Chinese-English Dataset) introduces a high-quality resource of spontaneous multi-turn conversational dialogue Chinese code-switching corpus collected in Hong Kong. ASCEND includes 23 bilinguals that are fluent in both Chinese and English and consists of 10.62 hours clean speech corpus.
Common Phone is a gender-balanced, multilingual corpus recorded from more than 76.000 contributors via Mozilla's Common Voice project. It comprises around 116 hours of speech enriched with automatically generated phonetic segmentation.
SpeechMatrix is a large-scale multilingual corpus of speech-to-speech translations mined from real speech of European Parliament recordings. It contains speech alignments in 136 language pairs with a total of 418 thousand hours of speech.
Open Images is a computer vision dataset covering ~9 million images with labels spanning thousands of object categories. A subset of 1.9M includes diverse annotations types.
The scarcity of high-quality and multi-task singing datasets significantly hinders the development of diverse controllable and personalized singing tasks, as existing singing datasets suffer from low quality, limited diversity of languages and singers, absence of multi-technique information and realistic music scores, and poor task suitability. To tackle these problems, we present GTSinger, a large Global, multi-Technique, free-to-use, high-quality singing corpus with realistic music scores, designed for all singing tasks, along with its benchmarks. Particularly, (1) we collect 80.59 hours of high-quality songs, forming the largest recorded singing dataset; (2) 20 professional singers across nine languages offer diverse timbres and styles; (3) we provide controlled comparison and phoneme-level annotations of six singing techniques, helping technique modeling and control; (4) GTSinger offers realistic music scores, assisting real-world musical composition; (5) singing voices are accompa
This dataset contains speech recordings along with speaker physical parameters (height, weight, shoulder size, age ) as well as regional information and linguistic information.
ClovaCall is a new large-scale Korean call-based speech corpus under a goal-oriented dialog scenario from more than 11,000 people. The raw dataset of ClovaCall includes approximately 112,000 pairs of a short sentence and its corresponding spoken utterance in a restaurant reservation domain.
Libri-Adapt aims to support unsupervised domain adaptation research on speech recognition models.
A large-scale corpus for phonetic typology, with aligned segments and estimated phoneme-level labels in 690 readings spanning 635 languages, along with acoustic-phonetic measures of vowels and sibilants.
Timers and Such is an open source dataset of spoken English commands for common voice control use cases involving numbers. The dataset has four intents, corresponding to four common offline voice assistant uses: SetTimer, SetAlarm, SimpleMath, and UnitConversion. The semantic label for each utterance is a dictionary with the intent and a number of slots.