Datasets

199 machine learning datasets

199 dataset results

modified_shemo

A modification on the ShEMO dataset with help of an Automatic Speech Recognition (ASR) system.

Greek Parliament Proceedings is a curated dataset of the Greek Parliament Proceedings that extends chronologically from 1989 up to 2020. It consists of more than 1 million speeches with extensive metadata, extracted from 5,355 parliamentary record files.

1 papers0 benchmarksSpeech

LibriS2S

LibriS2S is a Speech to Speech Translation (S2ST) dataset build further upon existing resources. The dataset provides English-German speech and text quadruplets ranging just over 50 hours for both languages.

1 papers0 benchmarksSpeech, Texts

Voxceleb-3D

A dataset for voice and 3D face structure study. It contains about 1.4K identities with their 3D face models and voice data. 3D face models are fitted from VGGFace using BFM 3D models, and voice data are processed from Voxceleb

1 papers10 benchmarks3d meshes, Speech

Quechua-SER

Quechua Collao corpus for automatic emotion recognition in speech. Audios are provided, alongside csv files with labels from 4 annotators for valence, arousal, and dominance values, using a 1 to 5 scale.

1 papers4 benchmarksAudio, Speech

MSNER (Multilingual Spoken Named Entity Recognition)

This dataset contains named entities annotations for European Parliament recordings in Dutch, French, German and Spanish. The entity annotation scheme follows OntoNotes v5. The original unannotated dataset is VoxPopuli.

1 papers0 benchmarksSpeech, Texts

inaGVAD (InaGVAD : a Challenging French TV and Radio Corpus annotated for Voice Activity Detection and Speaker Gender Segmentation)

InaGVAD is a Voice Activity Detection (VAD) and Speaker Gender Segmentation (SGS) dataset designed for representing the acoustic diversity of French TV and Radio programs. InaGVAD detailed description, together with a benchmark of 6 freely available VAD systems and 3 SGS systems, is provided in a paper presented in LREC-COLING 2024.

1 papers0 benchmarksAudio, Music, Speech

VibraVox (rigid in-ear microphone)

This is the in-ear rigid earpiece-embedded microphone variant of the VibraVox dataset.

1 papers8 benchmarksAudio, Speech, Texts

VibraVox (soft in-ear microphone)

This is the in-ear comply foam-embedded microphone variant of the VibraVox dataset.

1 papers8 benchmarksAudio, Speech, Texts

VibraVox (throat microphone)

This is the throat microphone (laryngophone) variant of the VibraVox dataset.

1 papers8 benchmarksAudio, Speech, Texts

VibraVox (forehead accelerometer)

This is the forehead accelerometer variant of the VibraVox dataset.

1 papers8 benchmarksAudio, Speech, Texts

VibraVox (temple vibration pickup)

This is the temple vibration pickup variant of the VibraVox dataset.

1 papers8 benchmarksAudio, Speech, Texts

VibraVox (headset microphone)

This is the reference headset microphone variant of the VibraVox dataset.

1 papers4 benchmarksAudio, Speech, Texts

TTSDS Synthetic Speech

100 samples each of synthetic speech generated by 9 moderns TTS systems. They all use the same subset of speaker-text pairs for conditioning.

1 papers0 benchmarksSpeech

VedantaNY-10M

VedantaNY-10M is a curated dataset of over 750 hours of transcripts from public discourses on the Indian philosophy of Advaita Vedanta. Sourced from 612 YouTube lectures by Swami Sarvapriyananda of the Vedanta Society of New York (VSNY), the dataset contains ~10 million tokens. These lectures offer a comprehensive exposition of Advaita Vedanta, making the dataset an invaluable resource for philosophy and linguistics research.

1 papers0 benchmarksSpeech, Texts

DreamVoiceDB

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

1 papers0 benchmarksSpeech

CAS-VSR-S101

A new large-scale, in-thewild Mandarin dataset, CAS-VSR-S101 with 101.1 hours of data. The videos are sourced from broadcast news and conversational programs in Chinese, covering a highly diverse set of topics, speakers and filming conditions. The lengths of the utterances are naturally distributed between 0.01s and 10.57s, and image qualities and resolutions vary. News accounts for 82.4% of the programs. 70.4% of the utterances depict news anchors, hosts and correspondents, while 29.6% are those of interviewees and guests. In addition, at a ratio of approximately 1.5 : 1, male and female appearances are relatively balanced. It is divided into train, validation and test sets by TV channels to minimize speaker overlap, and at a ratio of roughly 8 : 1 : 1.5 in terms of duration. The validation and test sets are composed of programs broadcast on provincial TV channels. The dataset is available for academic use under a license.

1 papers4 benchmarksAudio, Speech, Texts, Videos

CUCO Database (A voice and speech corpus of patients who underwent upper airway surgery in pre-and post-operative states)

Many research articles have explored the impact of surgical interventions on voice and speech evaluations, but advances are limited by the lack of publicly accessible datasets. To address this, a comprehensive corpus of 107 Spanish Castilian speakers was recorded, including control speakers and patients who underwent upper airway surgeries such as Tonsillectomy, Functional Endoscopic Sinus Surgery, and Septoplasty. The dataset contains 3,800 audio files, averaging 35.51 ± 5.91 recordings per patient. This resource enables systematic investigation of the effects of upper respiratory tract surgery on voice and speech. Previous studies using this corpus have shown no relevant changes in key acoustic parameters for sustained vowel phonation, consistent with initial hypotheses. However, the analysis of speech recordings, particularly nasalised segments, remains open for further research. Additionally, this dataset facilitates the study of the impact of upper airway surgery on speaker recogn

1 papers0 benchmarksAudio, Speech

United-Syn-Med

The United-Syn-Med dataset is a specialized medical speech dataset designed to evaluate and improve Automatic Speech Recognition (ASR) systems within the healthcare domain. It comprises English medical speech recordings, with a particular focus on medical terminology and clinical conversations. The dataset is well-suited for various ASR tasks, including speech recognition, transcription, and classification, facilitating the development of models tailored for medical contexts.

1 papers0 benchmarksAudio, Speech

Sagalee

Speech Recognition Dataset for Oromo Language. 📊 Key features of Sagalee: 100 hours of read speech. 283 gender balanced speakers * Covers different dialects in Oromo language * Open source for research

1 papers2 benchmarksAudio, Speech

PreviousPage 9 of 10Next