486 machine learning datasets
486 dataset results
We construct a large-scale conducting motion dataset, named ConductorMotion100, by deploying pose estimation on conductor view videos of concert performance recordings collected from online video platforms. The construction of ConductorMotion100 removes the need for expensive motion-capture equipment and makes full use of massive online video resources. As a result, the scale of ConductorMotion100 has reached an unprecedented length of 100 hours.
OpenSpeaks Voice: Odia is a large speech dataset in the Odia language of India that is stewarded by Subhashish Panigrahi and is hosted at the O Foundation. It currently hosts over 70,000 audio files under a Universal Public Domain (CC0 1.0) Release. Of these, 66,000, hosted on Wikimedia Commons, include pronunciation of words and phrases, and the remaining 4,400 include pronunciation of sentences and are hosted on Mozilla Common Voice. The files on Wikimedia Commons were also released n 2023 as four physical media in the form of DVD-ROMs titled OpenSpeaks Voice: Odia Volume I, OpenSpeaks Voice: Odia Volume II, OpenSpeaks Voice: Balesoria-Odia Volume I, and OpenSpeaks Voice: Balesoria-Odia Volume II. The dataset uses Free/Libre and Open Source Software, primarily using web-based platforms such as Lingua Libre and Common Voice. Other tools used for this project include Kathabhidhana, developed by Panigrahi by forking the Voice Recorder for Tamil Wiktionary by Shrinivasan T, and Spell4wik
3D-Speaker is a large-scale speech corpus designed to facilitate the research of speech representation disentanglement. 3DSpeaker contains over 10,000 speakers, each of whom are simultaneously recorded by multiple Devices, locating at different Distances, and some speakers are speaking multiple Dialects. The controlled combinations of multi-dimensional audio data yield a matrix of a diverse blend of speech representations entanglement, thereby motivating intriguing methods to untangle them.
This dataset is based on the Spiking Heidelberg Digits (SHD) dataset. Sample inputs consist of two spike encoded digits sampled uniformly at random from the SHD dataset and concatenated, with the target being the sum of the digits (irrespective of language). The train and test split remain the same, with the test set consisting of 16k such samples based on the SHD test set.
The ARTE database, so far, contains 13 acoustic environments that were recorded with a purpose-built 62-channel microphone array in various locations around Sydney (Australia), and was decoded into the higher-order Ambisonics (HOA) format.
Quechua Collao corpus for automatic emotion recognition in speech. Audios are provided, alongside csv files with labels from 4 annotators for valence, arousal, and dominance values, using a 1 to 5 scale.
Synthetic Speech Attribution Dataset.
In this dataset an uppertorso humanoid robot with 7-DOF arm explored 100 different objects belonging to 20 different categories using 10 behaviors: Look, Crush, Grasp, Hold, Lift, Drop, Poke, Push, Shake and Tap.
The primary data of the SaGA corpus are made up of 25 dialogs of interlocutors (50), who engage in a spatial communication task combining direction-giving and sight description. Six of those dialogues with data only from the direction giver are available including audio (.wav) and video (.mp4) data. The secondary data consists of annotations (*.eaf) of gestures and speech-gesture referents, which have been completely and systematically annotated based on an annotation grid (cf. the SaGA documentation). The corpus is comprised of of 9881 isolated words and 1764 isolated gestures. The stimulus is a model of a town presented in a Virtual Reality (VR) environment. Upon finishing a "bus ride" through the VR town along five landmarks, a router explained the route as well as the wayside landmarks to an unknown and naive follower. The SaGA Corpus was curated for CLARIN as part of the Curation Project "Editing and Integration of Multimodal Resources in CLARIN-D" by the CLARIN-D Working Group 6
Click to add a brief description of the dataset (Markdown and LaTeX enabled).
This dataset contains two types of audio recordings. The first set of audio recordings consists of MEMS microphone response to acoustic activities (e.g., 19 participants reading provided text in front of the Google Home Smart Assistant). The second set of audio recordings consists of MEMS microphone response to photo-acoustic activities (laser modulated--with audio recordings of 19 participants, firing at the MEMS microphone of Google Home Smart Assistant). A total of 19 students (10 male and 9 female) were enrolled for data collection. All participants were asked to read the following 5 sentences in the microphone, Hey Google, Open the garage door, Hey Google, Close the garage door, Hey Google, Turn the light on, Hey Google, Turn the light off, Hey Google, What is the weather today?. Each audio sample was injected into the microphone through a laser, and the response of the microphone was recorded. This method produced a total data set of 95 acoustic- and 95 laser-induced audio record
A Brazilian Portuguese TTS dataset featuring a female voice recorded with high quality in a controlled environment, with neutral emotion and more than 20 hours of recordings. with neutral emotion and more than 20 hours of recordings. Our dataset aims to facilitate transfer learning for researchers and developers working on TTS applications: a highly professional neutral female voice can serve as a good warm-up stage for learning language-specific structures, pronunciation and other non-individual characteristics of speech, leaving to further training procedures only to learn the specific adaptations needed (e.g. timbre, emotion and prosody). This can surely help enabling the accommodation of a more diverse range of female voices in Brazilian Portuguese. By doing so, we also hope to contribute to the development of accessible and high-quality TTS systems for several use cases such as virtual assistants, audiobooks, language learning tools and accessibility solutions.
A database containing high sampling rate recordings of a single speaker reading sentences in Brazilian Portuguese with neutral voice, along with the corresponding text corpus. Intended for speech synthesis and automatic speech recognition applications, the dataset contains text extracted from a popular Brazilian news TV program, totalling roughly 20 h of audio spoken by a trained individual in a controlled environment. The text was normalized in the recording process and special textual occurrences (e.g. acronyms, numbers, foreign names etc.) were replaced by their phonetic translation to a readable text in Portuguese. There are no noticeable accidental sounds and background noise has been kept to a minimum in all audio samples.
Manually labelled dataset of bird recordings from the species of interest inhabiting in the wetlands of the "Aiguamolls del Empord`{a}" natural park in Girona, Spain. The dataset includes 5,795 annotated audio clips generated from a source of 1,098 recordings retrieved from the Xeno-Canto portal, adding up to a total of 201.6 minutes (12,096 seconds) of vocalizations of different lengths, alongside with their corresponding annotations.
Click to add a brief description of the dataset (Markdown and LaTeX enabled).
Dataset Description:
We present a multilingual test set for conducting speech intelligibility tests in the form of diagnostic rhyme tests. The materials currently contain audio recordings in 5 languages and further extensions are in progress. For Mandarin Chinese, we provide recordings for a consonant contrast test as well as a tonal contrast test. Further information on the audio data, test procedure and software to set up a full survey which can be deployed on crowdsourcing platforms is provided in our paper [arXiv preprint] and GitHub repository. We welcome contributions to this open-source project.
Provided in the linked paper.
Click to add a brief description of the dataset (Markdown and LaTeX enabled).
The full version of ReefSet used in Williams et al. (2024). This dataset contains strongly labeled audio clips from coral reef habitats, taken across 16 unique datasets from 11 countries. This dataset can be used to test transfer learning performance of audio embedding models.