199 machine learning datasets
199 dataset results
A modification on the ShEMO dataset with help of an Automatic Speech Recognition (ASR) system.
Greek Parliament Proceedings is a curated dataset of the Greek Parliament Proceedings that extends chronologically from 1989 up to 2020. It consists of more than 1 million speeches with extensive metadata, extracted from 5,355 parliamentary record files.
LibriS2S is a Speech to Speech Translation (S2ST) dataset build further upon existing resources. The dataset provides English-German speech and text quadruplets ranging just over 50 hours for both languages.
A dataset for voice and 3D face structure study. It contains about 1.4K identities with their 3D face models and voice data. 3D face models are fitted from VGGFace using BFM 3D models, and voice data are processed from Voxceleb
Quechua Collao corpus for automatic emotion recognition in speech. Audios are provided, alongside csv files with labels from 4 annotators for valence, arousal, and dominance values, using a 1 to 5 scale.
This dataset contains named entities annotations for European Parliament recordings in Dutch, French, German and Spanish. The entity annotation scheme follows OntoNotes v5. The original unannotated dataset is VoxPopuli.
InaGVAD is a Voice Activity Detection (VAD) and Speaker Gender Segmentation (SGS) dataset designed for representing the acoustic diversity of French TV and Radio programs. InaGVAD detailed description, together with a benchmark of 6 freely available VAD systems and 3 SGS systems, is provided in a paper presented in LREC-COLING 2024.
This is the in-ear rigid earpiece-embedded microphone variant of the VibraVox dataset.
This is the in-ear comply foam-embedded microphone variant of the VibraVox dataset.
This is the throat microphone (laryngophone) variant of the VibraVox dataset.
This is the forehead accelerometer variant of the VibraVox dataset.
This is the temple vibration pickup variant of the VibraVox dataset.
This is the reference headset microphone variant of the VibraVox dataset.
100 samples each of synthetic speech generated by 9 moderns TTS systems. They all use the same subset of speaker-text pairs for conditioning.
VedantaNY-10M is a curated dataset of over 750 hours of transcripts from public discourses on the Indian philosophy of Advaita Vedanta. Sourced from 612 YouTube lectures by Swami Sarvapriyananda of the Vedanta Society of New York (VSNY), the dataset contains ~10 million tokens. These lectures offer a comprehensive exposition of Advaita Vedanta, making the dataset an invaluable resource for philosophy and linguistics research.
Click to add a brief description of the dataset (Markdown and LaTeX enabled).
A new large-scale, in-thewild Mandarin dataset, CAS-VSR-S101 with 101.1 hours of data. The videos are sourced from broadcast news and conversational programs in Chinese, covering a highly diverse set of topics, speakers and filming conditions. The lengths of the utterances are naturally distributed between 0.01s and 10.57s, and image qualities and resolutions vary. News accounts for 82.4% of the programs. 70.4% of the utterances depict news anchors, hosts and correspondents, while 29.6% are those of interviewees and guests. In addition, at a ratio of approximately 1.5 : 1, male and female appearances are relatively balanced. It is divided into train, validation and test sets by TV channels to minimize speaker overlap, and at a ratio of roughly 8 : 1 : 1.5 in terms of duration. The validation and test sets are composed of programs broadcast on provincial TV channels. The dataset is available for academic use under a license.
Many research articles have explored the impact of surgical interventions on voice and speech evaluations, but advances are limited by the lack of publicly accessible datasets. To address this, a comprehensive corpus of 107 Spanish Castilian speakers was recorded, including control speakers and patients who underwent upper airway surgeries such as Tonsillectomy, Functional Endoscopic Sinus Surgery, and Septoplasty. The dataset contains 3,800 audio files, averaging 35.51ā±ā5.91 recordings per patient. This resource enables systematic investigation of the effects of upper respiratory tract surgery on voice and speech. Previous studies using this corpus have shown no relevant changes in key acoustic parameters for sustained vowel phonation, consistent with initial hypotheses. However, the analysis of speech recordings, particularly nasalised segments, remains open for further research. Additionally, this dataset facilitates the study of the impact of upper airway surgery on speaker recogn
The United-Syn-Med dataset is a specialized medical speech dataset designed to evaluate and improve Automatic Speech Recognition (ASR) systems within the healthcare domain. It comprises English medical speech recordings, with a particular focus on medical terminology and clinical conversations. The dataset is well-suited for various ASR tasks, including speech recognition, transcription, and classification, facilitating the development of models tailored for medical contexts.
Speech Recognition Dataset for Oromo Language. š Key features of Sagalee: 100 hours of read speech. 283 gender balanced speakers * Covers different dialects in Oromo language * Open source for research