486 machine learning datasets
486 dataset results
1000 songs has been selected from Free Music Archive (FMA). The excerpts which were annotated are available in the same package song ids 1 to 1000. Some redundancies were identified, which reduced the dataset down to 744 songs. The dataset is split between the development set (619 songs) and the evaluation set (125 songs). The extracted 45 seconds excerpts are all re-encoded to have the same sampling frequency, i.e, 44100Hz.
ARCA23K is a dataset of labelled sound events created to investigate real-world label noise. It contains 23,727 audio clips originating from Freesound, and each clip belongs to one of 70 classes taken from the AudioSet ontology. The dataset was created using an entirely automated process with no manual verification of the data. For this reason, many clips are expected to be labelled incorrectly.
The TAU-NIGENS Spatial Sound Events 2020 dataset contains multiple spatial sound-scene recordings, consisting of sound events of distinct categories integrated into a variety of acoustical spaces, and from multiple source directions and distances as seen from the recording position. The spatialization of all sound events is based on filtering through real spatial room impulse responses (RIRs), captured in multiple rooms of various shapes, sizes, and acoustical absorption properties. Furthermore, each scene recording is delivered in two spatial recording formats, a microphone array one (MIC), and first-order Ambisonics one (FOA). The sound events are spatialized as either stationary sound sources in the room, or moving sound sources, in which case time-variant RIRs are used. Each sound event in the sound scene is associated with a trajectory of its direction-of-arrival (DoA) to the recording point, and a temporal onset and offset time. The isolated sound event recordings used for the sy
Open Broadcast Media Audio from TV (OpenBMAT) is an open, annotated dataset for the task of music detection that contains over 27 hours of TV broadcast audio from 4 countries distributed over 1647 one-minute long excerpts. It is designed to encompass several essential features for any music detection dataset and is the first one to include annotations about the loudness of music in relation to other simultaneous non-music sounds. OpenBMAT has been cross-annotated by 3 annotators obtaining high inter-annotator agreement percentages, which validates the annotation methodology and ensures the annotations reliability.
This repository contains the SINGA:PURA dataset, a strongly-labelled polyphonic urban sound dataset with spatiotemporal context. The data were collected via a number of recording units deployed across Singapore as a part of a wireless acoustic sensor network. These recordings were made as part of a project to identify and mitigate noise sources in Singapore, but also possess a wider applicability to sound event detection, classification, and localization. The taxonomy we used for the labels in this dataset has been designed to be compatible with other existing datasets for urban sound tagging while also able to capture sound events unique to the Singaporean context. Please refer to our conference paper published in APSIPA 2021 (which is found in this repository as the file "APSIPA.pdf") or download the readme ("Readme.md") for more details regarding the data collection, annotation, and processing methodologies for the creation of the dataset.
This is a subset of Kinetics-400, introduced in Look, Listen and Learn by Relja Arandjelovic and Andrew Zisserman.
We introduce AVCAffe, the first Audio-Visual dataset consisting of Cognitive load and Affect attributes. We record AVCAffe by simulating remote work scenarios over a video-conferencing platform, where subjects collaborate to complete a number of cognitively engaging tasks. AVCAffe is the largest originally collected (not collected from the Internet) affective dataset in English language. We recruit 106 participants from 18 different countries of origin, spanning an age range of 18 to 57 years old, with a balanced male-female ratio. AVCAffe comprises a total of 108 hours of video, equivalent to more than 58,000 clips along with task-based self-reported ground truth labels for arousal, valence, and cognitive load attributes such as mental demand, temporal demand, effort, and a few others. We believe AVCAffe would be a challenging benchmark for the deep learning research community given the inherent difficulty of classifying affect and cognitive load in particular. Moreover, our dataset f
Music4All-Onion is a large-scale, multi-modal music dataset that expands the Music4All dataset by including 26 additional audio, video, and metadata features for 109,269 music pieces and provides a set of 252,984,396 listening records of 119,140 users, extracted from the online music platform Last.fm .
MedleyVox is an evaluation dataset for multiple singing voices separation that corresponds to such categories. The problem definition in this dataset is categorised into i) duet, ii) unison, iii) main vs. rest, and iv) N-singing separation.
Lyra is a dataset of 1570 traditional and folk Greek music pieces that includes audio and video (timestamps and links to YouTube videos), along with annotations that describe aspects of particular interest for this dataset, including instrumentation, geographic information and labels of genre and subgenre, among others.
Dusha is a dataset for speech emotion recognition (SER) tasks. The corpus contains approximately 350 hours of data, more than 300 000 audio recordings with Russian speech and their transcripts. It is annotated using a crowd-sourcing platform and includes two subsets: acted and real-life.
VTC is a large-scale multimodal dataset containing video-caption pairs (~300k) alongside comments that can be used for multimodal representation learning.
MVSep is a synthetic dataset for the vocal separation task created by combining random vocal and instrumental samples, publicly available on the internet. The sourced samples were separated into two sets (vocal-only and instrumental-only) and then randomly mixed together. The mixtures may not always sound like a real melody, but they allow for testing audio separation methods. Synth MVSep dataset consists of 100 tracks, each with a duration of exactly one minute and a sample rate of 44.1 kHz.
One of the founding fathers of marine mammal bioacoustics, William Watkins, carried out pioneering work with William Schevill at the Woods Hole Oceanographic Institution for more than four decades, laying the groundwork for our field today. One of the lasting achievements of his career was the Watkins Marine Mammal Sound Database, a resource that contains approximately 2000 unique recordings of more than 60 species of marine mammals (Table 1). Recordings were made by Watkins and Schevill as well as many others, including G. C. Ray, D. Wartzok, D. and M. Caldwell, K. Norris, and T. Poulter. Most of these have been digitized, along with approximately 15,000 annotated digital sound clips.
The speech accent archive uniformly presents a large set of speech samples from a variety of language backgrounds. Native and non-native speakers of English read the same paragraph and are carefully transcribed. The archive is used by people who wish to compare and analyze the accents of different English speakers.
The YouTube8M-MusicTextClips dataset consists of over 4k high-quality human text descriptions of music found in video clips from the YouTube8M dataset.
Spatial LibriSpeech is spatial audio dataset with over 650 hours of 19-channel audio, first-order ambisonics, and optional distractor noise. Spatial LibriSpeech is designed for machine learning model training, and it includes labels for source position, speaking direction, room acoustics and geometry.
The BiGe corpus is comprised of 54.360 shots of interest extracted from TED and TEDx talks. All shots are tracked with fully 3d landmarks.
JamALT is a revision of the JamendoLyrics dataset (80 songs in 4 languages), adapted for use as an automatic lyrics transcription (ALT) benchmark.
A dataset containing the results of a MUSHRA listening test conducted with expert listeners from 2 international laboratories. ODAQ contains 240 audio samples and corresponding quality scores. Each audio sample is rated by 26 listeners. The audio samples are stereo audio signals sampled at 44.1 or 48 kHz and are processed by a total of 6 method classes, each operating at different quality levels. The processing method classes are designed to generate quality degradations possibly encountered during audio coding and source separation, and the quality levels for each method class span the entire quality range. The diversity of the processing methods, the large span of quality levels, the high sampling frequency, and the pool of international listeners make ODAQ particularly suited for further research into subjective and objective audio quality. The dataset is released with permissive licenses, and the software used to conduct the listening test is also made publicly available.