Datasets

199 machine learning datasets

199 dataset results

GigaST

GigaST is a large-scale pseudo speech translation (ST) corpus. The corpus was created by translating the text in GigaSpeech, an English ASR corpus, into German and Chinese. The training set is translated by a strong machine translation system and the test set was translated by human. ST models trained with an addition of the corpus obtain new state-of-the-art results on the MuST-C English-German benchmark test set.

8 papers0 benchmarksSpeech

Europarl-ASR

Europarl-ASR (EN) is a 1300-hour English-language speech and text corpus of parliamentary debates for (streaming) Automatic Speech Recognition training and benchmarking, speech data filtering and speech data verbatimization, based on European Parliament speeches and their official transcripts (1996-2020). Includes dev-test sets for streaming ASR benchmarking, made up of 18 hours of manually revised speeches. The availability of manual non-verbatim and verbatim transcripts for dev-test speeches makes this corpus also useful for the assessment of automatic filtering and verbatimization techniques. The corpus is released under an open licence at https://www.mllp.upv.es/europarl-asr/

8 papers0 benchmarksSpeech

Emilia Dataset (An Extensive, Multilingual, and Diverse Speech Dataset for Large-Scale Speech Generation)

Recent advancements in speech generation models have been significantly driven by the use of large-scale training data. However, producing highly spontaneous, human-like speech remains a challenge due to the scarcity of large, diverse, and spontaneous speech datasets. In response, we introduce Emilia, the first large-scale, multilingual, and diverse speech generation dataset. Emilia starts with over 101k hours of speech across six languages, covering a wide range of speaking styles to enable more natural and spontaneous speech generation. To facilitate the scale-up of Emilia, we also present Emilia-Pipe, the first open-source preprocessing pipeline designed to efficiently transform raw, in-the-wild speech data into high-quality training data with speech annotations. Experimental results demonstrate the effectiveness of both Emilia and Emilia-Pipe. Demos are available at: https://emilia-dataset.github.io/Emilia-Demo-Page/.

8 papers0 benchmarksSpeech

EasyCall

EasyCall is a new dysarthric speech command dataset in Italian. The dataset consists of 21386 audio recordings from 24 healthy and 31 dysarthric speakers, whose individual degree of speech impairment was assessed by neurologists through the Therapy Outcome Measure. The corpus aims at providing a resource for the development of ASR-based assistive technologies for patients with dysarthria. In particular, it may be exploited to develop a voice-controlled contact application for commercial smartphones, aiming at improving dysarthric patients' ability to communicate with their family and caregivers. Before recording the dataset, participants were administered a survey to evaluate which commands are more likely to be employed by dysarthric individuals in a voice-controlled contact application. In addition, the dataset includes a list of non-commands (i.e., words near/inside commands or phonetically close to commands) that can be leveraged to build a more robust command recognition system.

7 papers0 benchmarksSpeech

VoicePrivacy 2020

VoicePrivacy 2020 is a dataset for developing anonymization solutions for speech technology. It is built from subsets of existing datasets such as: LibriSpeech, LibriTTS, VoxCeleb1, VoxCeleb2 and VCTK.

7 papers0 benchmarksSpeech

The People’s Speech

The People's Speech is a free-to-download 30,000-hour and growing supervised conversational English speech recognition dataset licensed for academic and commercial usage under CC-BY-SA (with a CC-BY subset). The data is collected via searching the Internet for appropriately licensed audio data with existing transcriptions.

7 papers0 benchmarksSpeech

SingFake (SingFake: Singing Voice Deepfake Detection)

The rise of singing voice synthesis presents critical challenges to artists and industry stakeholders over unauthorized voice usage. Unlike synthesized speech, synthesized singing voices are typically released in songs containing strong background music that may hide synthesis artifacts. Additionally, singing voices present different acoustic and linguistic characteristics from speech utterances. These unique properties make singing voice deepfake detection a relevant but significantly different problem from synthetic speech detection. In this work, we propose the singing voice deepfake detection task. We first present SingFake, the first curated in-the-wild dataset consisting of 28.93 hours of bonafide and 29.40 hours of deepfake song clips in five languages from 40 singers. We provide a train/val/test split where the test sets include various scenarios. We then use SingFake to evaluate four state-of-the-art speech countermeasure systems trained on speech utterances. We find these sys

7 papers0 benchmarksAudio, Music, Speech

Interview

A large-scale (105K conversations) media dialog dataset collected from news interview transcripts.

6 papers0 benchmarksSpeech

Libri-adhoc40

Libri-adhoc40 is a synchronized speech corpus which collects the replayed Librispeech data from loudspeakers by ad-hoc microphone arrays of 40 strongly synchronized distributed nodes in a real office environment. Besides, to provide the evaluation target for speech frontend processing and other applications, the authors also recorded the replayed speech in an anechoic chamber.

6 papers0 benchmarksSpeech

FMFCC-A

FMFCC-A is a large publicly-available Mandarin dataset for synthetic speech detection, which contains 40,000 synthesized Mandarin utterances that generated by 11 Mandarin TTS systems and two Mandarin VC systems, and 10,000 genuine Mandarin utterance collected from 58 speakers. The FMFCCA dataset is divided into the training, development and evaluation sets, which are used for the research of detection of synthesised Mandarin speech under various previously unknown speech synthesis systems or audio post-processing operations.

6 papers0 benchmarksSpeech

ASCEND

ASCEND (A Spontaneous Chinese-English Dataset) introduces a high-quality resource of spontaneous multi-turn conversational dialogue Chinese code-switching corpus collected in Hong Kong. ASCEND includes 23 bilinguals that are fluent in both Chinese and English and consists of 10.62 hours clean speech corpus.

6 papers0 benchmarksAudio, Speech

Common Phone

Common Phone is a gender-balanced, multilingual corpus recorded from more than 76.000 contributors via Mozilla's Common Voice project. It comprises around 116 hours of speech enriched with automatically generated phonetic segmentation.

6 papers0 benchmarksSpeech

SpeechMatrix

SpeechMatrix is a large-scale multilingual corpus of speech-to-speech translations mined from real speech of European Parliament recordings. It contains speech alignments in 136 language pairs with a total of 418 thousand hours of speech.

6 papers0 benchmarksSpeech

Open Images V7

Open Images is a computer vision dataset covering ~9 million images with labels spanning thousands of object categories. A subset of 1.9M includes diverse annotations types.

6 papers0 benchmarksImages, Speech, Texts

GTSinger (GTSinger: A Global Multi-Technique Singing Corpus with Realistic Music Scores for All Singing Tasks)

The scarcity of high-quality and multi-task singing datasets significantly hinders the development of diverse controllable and personalized singing tasks, as existing singing datasets suffer from low quality, limited diversity of languages and singers, absence of multi-technique information and realistic music scores, and poor task suitability. To tackle these problems, we present GTSinger, a large Global, multi-Technique, free-to-use, high-quality singing corpus with realistic music scores, designed for all singing tasks, along with its benchmarks. Particularly, (1) we collect 80.59 hours of high-quality songs, forming the largest recorded singing dataset; (2) 20 professional singers across nine languages offer diverse timbres and styles; (3) we provide controlled comparison and phoneme-level annotations of six singing techniques, helping technique modeling and control; (4) GTSinger offers realistic music scores, assisting real-world musical composition; (5) singing voices are accompa

6 papers0 benchmarksAudio, Music, Speech, Texts

NISP (NITK-IISc Multilingual Multi-accent Speaker Profiling)

This dataset contains speech recordings along with speaker physical parameters (height, weight, shoulder size, age ) as well as regional information and linguistic information.

5 papers0 benchmarksSpeech

ClovaCall

ClovaCall is a new large-scale Korean call-based speech corpus under a goal-oriented dialog scenario from more than 11,000 people. The raw dataset of ClovaCall includes approximately 112,000 pairs of a short sentence and its corresponding spoken utterance in a restaurant reservation domain.

5 papers0 benchmarksSpeech

Libri-Adapt

Libri-Adapt aims to support unsupervised domain adaptation research on speech recognition models.

5 papers0 benchmarksSpeech

VoxClamantis

A large-scale corpus for phonetic typology, with aligned segments and estimated phoneme-level labels in 690 readings spanning 635 languages, along with acoustic-phonetic measures of vowels and sibilants.

5 papers0 benchmarksSpeech

Timers and Such

Timers and Such is an open source dataset of spoken English commands for common voice control use cases involving numbers. The dataset has four intents, corresponding to four common offline voice assistant uses: SetTimer, SetAlarm, SimpleMath, and UnitConversion. The semantic label for each utterance is a dictionary with the intent and a number of slots.

5 papers3 benchmarksSpeech

PreviousPage 4 of 10Next