486 machine learning datasets
486 dataset results
gtzan_music_speech is a dataset for music/speech discrimination. It consists of 120 tracks of 30 second length. Each class (music/speech) has 60 samples. The tracks are all 22050Hz Mono 16-bit audio files in .wav format.
The AllMusic Mood Subset (AMS) is a dataset for mood classification from songs. It is created by matching a subset of the Million Song Dataset (MSD), totalling 67k tracks, with expert annotations of 188 different moods collected from AllMusic.
This dataset is an audio dataset containing about 1500 audio clips recorded by multiple professional players.
A dataset called Medley2K that consists of 2,000 medleys and 7,712 labeled transitions.
The Humbug Zooinverse dataset is a dataset of mosquito audio recordings. With over a thousand contributors, it contains 195,434 labels of two second duration, of which approximately 10 percent signify mosquito events.
The SWC is a corpus of aligned Spoken Wikipedia articles from the English, German, and Dutch Wikipedia. This corpus has several outstanding characteristics:
Parkinson Speech Dataset is an audio dataset consisting of recordings of 20 Parkinson's Disease (PD) patients and 20 healthy subjects. From all subjects, multiple types of sound recordings (26) are taken. The goal is to classify which patients have Parkinson's.
The RWCP Sound Scene Database includes non-speech sounds recorded in an anechoic room, reconstructed signals in various rooms, impulse responses for a microphone array, speech data recorded with the same array, and recordings of background noises. It is intended for use when simulating sound scenes. It was developed by the Real Acoustic Environments Working Group of the Real World Computing Partnership (RWCP). The data was recorded from 1998 to 2000.
NAR is a dataset of audio recordings made with the humanoid robot Nao in real world conditions for sound recognition benchmarking. All the recordings were collected using the robot’s microphone and thus have the following characteristics: - recorded with low-quality sensors (300 Hz – 18 kHz bandpass) - suffering from typical fan noise from the robot’s internal hardware - recorded in mutiple real domestic environments (no special acoustic charateristics, reverberations, presence of multiple sound sources and unknown locations)
ARVSU contains a vast body of image variations in visual scenes with an annotated utterance and a corresponding addressee for each scenario.
This dataset is composed of paired videos of people dancing 3 different music styles: Ballet, Michael Jackson and Salsa. It contains multimodal data (visual data, temporal-graphs and audio) careful-selected from publicly available videos of dancers performing representative movements of the music style and audio data from the respective styles.
Kinect-WSJ is a multichannel, multispeaker, reverberated, noisy dataset which extends the WSJ0-2mix singlechannel, non-reverberated, noiseless dataset to the strong reverberation and noise conditions and the Kinect-like microphone array geometry used in CHiME-5.
Fongbe Data collected by Fréjus A. A LALEYE
The POTUS Corpus is a Database of Weekly Addresses for the Study of Stance in Politics and Virtual Agents.
Dataset for multimodal skills assessment focusing on assessing piano player’s skill level. Annotations include player's skills level, and song difficulty level. Bounding box annotations around pianists' hands are also provided.
The NISQA Corpus includes more than 14,000 speech samples with simulated (e.g. codecs, packet-loss, background noise) and live (e.g. mobile phone, Zoom, Skype, WhatsApp) conditions. Each file is labelled with subjective ratings of the overall quality and the quality dimensions Noisiness, Coloration, Discontinuity, and Loudness. In total, it contains more than 97,000 human ratings for each of the dimensions and the overall MOS.
BIRD (Big Impulse Response Dataset) is an open dataset that consists of 100,000 multichannel room impulse responses (RIRs) generated from simulations using the Image Method, making it the largest multichannel open dataset currently available. These RIRs can be used to perform efficient online data augmentation for scenarios that involve two microphones and multiple sound sources.
This is a dataset of 22.5 hours of synthesized audio using the open-source learnfm clone of the DX7 FM synthesizer, based upon 31K presets from Bobby Blue. These represent "natural'' synthesis sounds---i.e.presets devised by humans.
VAST Absorption is a dataset of spatial binaural features annotated with acoustic properties such as the 3D source position and the walls’ absorption coefficients.
Boombox is a multi-modal dataset for visual reconstruction from acoustic vibrations. Involves dropping objects into a box and capturing resulting images and vibrations. Used for training ML systems that predict images from vibration.