486 machine learning datasets
486 dataset results
Multimodal Dyadic Behavior (MMDB) dataset is a unique collection of multimodal (video, audio, and physiological) recordings of the social and communicative behavior of toddlers. The MMDB contains 160 sessions of 3-5 minute semi-structured play interaction between a trained adult examiner and a child between the age of 15 and 30 months. The MMDB dataset supports a novel problem domain for activity recognition, which consists of the decoding of dyadic social interactions between adults and children in a developmental context.
Fraxtil is an audio dataset where given a raw audio track, the goal is to produce a choreography step chart, similar to those used in the Dance Dance Revolution video game. It contains 90 songs choreographed by a single author, with 450 charts for the 90 songs.
The dataset consists of the features associated with 402 5-second sound samples. The 402 sounds range from easily identifiable everyday sounds to intentionally obscured artificial ones. The dataset aims to lower the barrier for the study of aural phenomenology as the largest available audio dataset to include an analysis of causal attribution. Each sample has been annotated with crowd-sourced descriptions, as well as familiarity, imageability, arousal, and valence ratings.
The ODSQA dataset is a spoken dataset for question answering in Chinese. It contains more than three thousand questions from 20 different speakers.
In The Groove (ITG) is an audio dataset where given a raw audio track, the goal is to produce a choreography step chart, similar to those used in the Dance Dance Revolution video game. It contains 133 songs choreographed by a three different authors, with 652 charts for the 133 songs.
Expanded Groove MIDI dataset (E-GMD) is an automatic drum transcription (ADT) dataset that contains 444 hours of audio from 43 drum kits, making it an order of magnitude larger than similar datasets, and the first with human-performed velocity annotations.
RWCP-SSD-Onomatopoeia is a dataset consisting of 155,568 onomatopoeic words paired with audio samples for environmental sound synthesis.
The Tongue and Lips (TaL) corpus is a multi-speaker corpus of ultrasound images of the tongue and video images of lips. This corpus contains synchronised imaging data of extraoral (lips) and intraoral (tongue) articulators from 82 native speakers of English.
Voice conversion (VC) is a technique to transform a speaker identity included in a source speech waveform into a different one while preserving linguistic information of the source speech waveform. The Voice Conversion Challenge (VCC) 2016 was launched in 2016 at Interspeech 2016. The objective of the 2016 challenge was to better understand different VC techniques built on a freely-available common dataset to look at a common goal, and to share views about unsolved problems and challenges faced by the current VC techniques. The VCC 2016 focused on the most basic VC task, that is, the construction of VC models that automatically transform the voice identity of a source speaker into that of a target speaker using a parallel clean training database where source and target speakers read out the same set of utterances in a professional recording studio. 17 research groups had participated in the 2016 challenge. The challenge was successful and it established new standard evaluation methodol
A special corpus of Indian languages covering 13 major languages of India. It comprises of 10000+ spoken sentences/utterances each of mono and English recorded by both Male and Female native speakers. Speech waveform files are available in .wav format along with the corresponding text. We hope that these recordings will be useful for researchers and speech technologists working on synthesis and recognition. You can request zip archives of the entire database here.
LA-2A Compressor data to accompany the paper "SignalTrain: Profiling Audio Compressors with Deep Neural Networks," https://arxiv.org/abs/1905.11928
PACS (Physical Audiovisual CommonSense) is the first audiovisual benchmark annotated for physical commonsense attributes. PACS contains a total of 13,400 question-answer pairs, involving 1,377 unique physical commonsense questions and 1,526 videos. The dataset provides new opportunities to advance the research field of physical reasoning by bringing audio as a core component of this multimodal problem.
Due to the highly variable sample size of the original BirdClef2020 dataset and the issues that it presents with reproducibility, we propose a pruned version of the set, where samples longer than 180s are removed along with classes with fewer than 50 samples. This processing puts it further in line with other complex audio datasets and allows for experimentation on more consumer friendly hardware.
The SmartSpeaker benchmark tests the performance of reacting to music player commands in English as well as in French. It has the difficulty of containing many artist or music tracks with uncommon names in the commands, like “play music by [a boogie wit da hoodie]” or “I’d like to listen to [Kinokoteikoku]”.
COSIAN is an annotation collection of Japanese popular (J-POP) songs, focusing on singing style and expression of famous solo-singers.
The data set contains several speakers. The 5 largest are listed individually, the rest are summarized as other. All audio files have a sampling rate of 44.1kHz. For each speaker, there is a clean variant in addition to the full data set, where the quality is even higher. Furthermore, there are various statistics. The dataset can also be used for automatic speech recognition (ASR) if audio files are converted to 16 kHz.
New refined labels for the MusicNet dataset obtained by the EM process as described in the paper: Ben Maman and Amit Bermano, "Unaligned Supervision for Automatic Music Transcription in The Wild"
This work introduces Zambezi Voice, an open-source multilingual speech resource for Zambian languages. It contains two collections of datasets: unlabelled audio recordings of radio news and talk shows programs (160 hours) and labelled data (over 80 hours) consisting of read speech recorded from text sourced from publicly available literature books. The dataset is created for speech recognition but can be extended to multilingual speech processing research for both supervised and unsupervised learning approaches. To our knowledge, this is the first multilingual speech dataset created for Zambian languages. We exploit pretraining and cross-lingual transfer learning by finetuning the Wav2Vec2.0 large-scale multilingual pre-trained model to build end-to-end (E2E) speech recognition models for our baseline models. The dataset is released publicly under a Creative Commons BY-NC-ND 4.0 license and can be accessed through the project repository.
Russian dataset of emotional speech dialogues. This dataset was assembled from ~3.5 hours of live speech by actors who voiced pre-distributed emotions in the dialogue for ~3 minutes each. <br> Each sample of dataset contains name of part from the original dataset studio source, speech file (16000 or 44100Hz) of human voice, 1 of 7 labeled emotions and the speech-to-texted part of voice speech. <br>
The MISP2021 challenge dataset is a collection of audio-visual conversational data recorded in a home TV scenario using distant multi-microphones. The dataset captures interactions between several individuals who are engaged in conversations in Chinese while watching TV and interacting with a smart speaker/TV in a living room. The dataset is extensive, comprising 141 hours of audio and video data, which were collected using far/middle/near microphones and far/middle cameras in 34 real-home TV rooms. Notably, this corpus is the first of its kind to offer a distant multimicrophone conversational Chinese audio-visual dataset. Furthermore, it is also the first large vocabulary continuous Chinese lip-reading dataset specifically designed for the adverse home-TV scenario.