Datasets

486 machine learning datasets

486 dataset results

ConductorMotion100

We construct a large-scale conducting motion dataset, named ConductorMotion100, by deploying pose estimation on conductor view videos of concert performance recordings collected from online video platforms. The construction of ConductorMotion100 removes the need for expensive motion-capture equipment and makes full use of massive online video resources. As a result, the scale of ConductorMotion100 has reached an unprecedented length of 100 hours.

1 papers0 benchmarksAudio, Music

OpenSpeaks Voice: Odia

OpenSpeaks Voice: Odia is a large speech dataset in the Odia language of India that is stewarded by Subhashish Panigrahi and is hosted at the O Foundation. It currently hosts over 70,000 audio files under a Universal Public Domain (CC0 1.0) Release. Of these, 66,000, hosted on Wikimedia Commons, include pronunciation of words and phrases, and the remaining 4,400 include pronunciation of sentences and are hosted on Mozilla Common Voice. The files on Wikimedia Commons were also released n 2023 as four physical media in the form of DVD-ROMs titled OpenSpeaks Voice: Odia Volume I, OpenSpeaks Voice: Odia Volume II, OpenSpeaks Voice: Balesoria-Odia Volume I, and OpenSpeaks Voice: Balesoria-Odia Volume II. The dataset uses Free/Libre and Open Source Software, primarily using web-based platforms such as Lingua Libre and Common Voice. Other tools used for this project include Kathabhidhana, developed by Panigrahi by forking the Voice Recorder for Tamil Wiktionary by Shrinivasan T, and Spell4wik

1 papers0 benchmarksAudio

3D-Speaker

3D-Speaker is a large-scale speech corpus designed to facilitate the research of speech representation disentanglement. 3DSpeaker contains over 10,000 speakers, each of whom are simultaneously recorded by multiple Devices, locating at different Distances, and some speakers are speaking multiple Dialects. The controlled combinations of multi-dimensional audio data yield a matrix of a diverse blend of speech representations entanglement, thereby motivating intriguing methods to untangle them.

1 papers0 benchmarksAudio

SHD - Adding (Spiking Heidelberg Digits - Adding)

This dataset is based on the Spiking Heidelberg Digits (SHD) dataset. Sample inputs consist of two spike encoded digits sampled uniformly at random from the SHD dataset and concatenated, with the target being the sum of the digits (irrespective of language). The train and test split remain the same, with the test set consisting of 16k such samples based on the SHD test set.

1 papers1 benchmarksAudio

ARTE (Ambisonics Recordings of Typical Environments)

The ARTE database, so far, contains 13 acoustic environments that were recorded with a purpose-built 62-channel microphone array in various locations around Sydney (Australia), and was decoded into the higher-order Ambisonics (HOA) format.

1 papers0 benchmarks3D, Audio

Quechua-SER

Quechua Collao corpus for automatic emotion recognition in speech. Audios are provided, alongside csv files with labels from 4 annotators for valence, arousal, and dominance values, using a 1 to 5 scale.

1 papers4 benchmarksAudio, Speech

Synthetic Speech Attribution

Synthetic Speech Attribution Dataset.

1 papers0 benchmarksAudio

CY101 Dataset

In this dataset an uppertorso humanoid robot with 7-DOF arm explored 100 different objects belonging to 20 different categories using 10 behaviors: Look, Crush, Grasp, Hold, Lift, Drop, Poke, Push, Shake and Tap.

1 papers0 benchmarksActions, Audio, Images, Interactive, RGB Video, Texts, Time series, Videos

SaGA (The Bielefeld Speech and Gesture Alignment Corpus (SaGA))

The primary data of the SaGA corpus are made up of 25 dialogs of interlocutors (50), who engage in a spatial communication task combining direction-giving and sight description. Six of those dialogues with data only from the direction giver are available including audio (.wav) and video (.mp4) data. The secondary data consists of annotations (*.eaf) of gestures and speech-gesture referents, which have been completely and systematically annotated based on an annotation grid (cf. the SaGA documentation). The corpus is comprised of of 9881 isolated words and 1764 isolated gestures. The stimulus is a model of a town presented in a Virtual Reality (VR) environment. Upon finishing a "bus ride" through the VR town along five landmarks, a router explained the route as well as the wayside landmarks to an unknown and naive follower. The SaGA Corpus was curated for CLARIN as part of the Curation Project "Editing and Integration of Multimodal Resources in CLARIN-D" by the CLARIN-D Working Group 6

1 papers0 benchmarksAudio, Texts, Time series, Videos

ManipulateSound

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

1 papers0 benchmarksAudio

Laser Data

This dataset contains two types of audio recordings. The first set of audio recordings consists of MEMS microphone response to acoustic activities (e.g., 19 participants reading provided text in front of the Google Home Smart Assistant). The second set of audio recordings consists of MEMS microphone response to photo-acoustic activities (laser modulated--with audio recordings of 19 participants, firing at the MEMS microphone of Google Home Smart Assistant). A total of 19 students (10 male and 9 female) were enrolled for data collection. All participants were asked to read the following 5 sentences in the microphone, Hey Google, Open the garage door, Hey Google, Close the garage door, Hey Google, Turn the light on, Hey Google, Turn the light off, Hey Google, What is the weather today?. Each audio sample was injected into the microphone through a laser, and the response of the microphone was recorded. This method produced a total data set of 95 acoustic- and 95 laser-induced audio record

1 papers0 benchmarksAudio

GneutralSpeech Female

A Brazilian Portuguese TTS dataset featuring a female voice recorded with high quality in a controlled environment, with neutral emotion and more than 20 hours of recordings. with neutral emotion and more than 20 hours of recordings. Our dataset aims to facilitate transfer learning for researchers and developers working on TTS applications: a highly professional neutral female voice can serve as a good warm-up stage for learning language-specific structures, pronunciation and other non-individual characteristics of speech, leaving to further training procedures only to learn the specific adaptations needed (e.g. timbre, emotion and prosody). This can surely help enabling the accommodation of a more diverse range of female voices in Brazilian Portuguese. By doing so, we also hope to contribute to the development of accessible and high-quality TTS systems for several use cases such as virtual assistants, audiobooks, language learning tools and accessibility solutions.

1 papers0 benchmarksAudio, Texts

GneutralSpeech Male

A database containing high sampling rate recordings of a single speaker reading sentences in Brazilian Portuguese with neutral voice, along with the corresponding text corpus. Intended for speech synthesis and automatic speech recognition applications, the dataset contains text extracted from a popular Brazilian news TV program, totalling roughly 20 h of audio spoken by a trained individual in a controlled environment. The text was normalized in the recording process and special textual occurrences (e.g. acronyms, numbers, foreign names etc.) were replaced by their phonetic translation to a readable text in Portuguese. There are no noticeable accidental sounds and background noise has been kept to a minimum in all audio samples.

1 papers0 benchmarksAudio, Texts

Western Mediterranean Wetlands Bird Dataset

Manually labelled dataset of bird recordings from the species of interest inhabiting in the wetlands of the "Aiguamolls del Empord`{a}" natural park in Girona, Spain. The dataset includes 5,795 annotated audio clips generated from a source of 1,098 recordings retrieved from the Xeno-Canto portal, adding up to a total of 201.6 minutes (12,096 seconds) of vocalizations of different lengths, alongside with their corresponding annotations.

1 papers0 benchmarksAudio

ATD-Dataset (Auto-Tune Detection Dataset (ATD-Dataset))

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

1 papers0 benchmarksAudio

Audio de mosquitos Aedes Aegypti (Wing beats)

Dataset Description:

1 papers0 benchmarksAudio

mDRT (Multilingual Diagnostic Rhyme Test)

We present a multilingual test set for conducting speech intelligibility tests in the form of diagnostic rhyme tests. The materials currently contain audio recordings in 5 languages and further extensions are in progress. For Mandarin Chinese, we provide recordings for a consonant contrast test as well as a tonal contrast test. Further information on the audio data, test procedure and software to set up a full survey which can be deployed on crowdsourcing platforms is provided in our paper [arXiv preprint] and GitHub repository. We welcome contributions to this open-source project.

1 papers0 benchmarksAudio

AVS Benchmark (Audio-Visual Synchrony Benchmark)

Provided in the linked paper.

1 papers0 benchmarksAudio, Videos

Social-IQ 2.0

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

1 papers0 benchmarksAudio, Texts, Videos

ReefSet

The full version of ReefSet used in Williams et al. (2024). This dataset contains strongly labeled audio clips from coral reef habitats, taken across 16 unique datasets from 11 countries. This dataset can be used to test transfer learning performance of audio embedding models.

1 papers0 benchmarksAudio

PreviousPage 19 of 25Next