TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Datasets

486 machine learning datasets

Filter by Modality

  • Images3,275
  • Texts3,148
  • Videos1,019
  • Audio486
  • Medical395
  • 3D383
  • Time series298
  • Graphs285
  • Tabular271
  • Speech199
  • RGB-D192
  • Environment148
  • Point cloud135
  • Biomedical123
  • LiDAR95
  • RGB Video87
  • Tracking78
  • Biology71
  • Actions68
  • 3d meshes65
  • Tables52
  • Music48
  • EEG45
  • Hyperspectral images45
  • Stereo44
  • MRI39
  • Physics32
  • Interactive29
  • Dialog25
  • Midi22
  • 6D17
  • Replay data11
  • Financial10
  • Ranking10
  • Cad9
  • fMRI7
  • Parallel6
  • Lyrics2
  • PSG2
Clear filter

486 dataset results

Emomusic (Emotion in Music Database)

1000 songs has been selected from Free Music Archive (FMA). The excerpts which were annotated are available in the same package song ids 1 to 1000. Some redundancies were identified, which reduced the dataset down to 744 songs. The dataset is split between the development set (619 songs) and the evaluation set (125 songs). The extracted 45 seconds excerpts are all re-encoded to have the same sampling frequency, i.e, 44100Hz.

2 papers2 benchmarksAudio, Music

ARCA23K

ARCA23K is a dataset of labelled sound events created to investigate real-world label noise. It contains 23,727 audio clips originating from Freesound, and each clip belongs to one of 70 classes taken from the AudioSet ontology. The dataset was created using an entirely automated process with no manual verification of the data. For this reason, many clips are expected to be labelled incorrectly.

2 papers0 benchmarksAudio

TAU-NIGENS Spatial Sound Events 2020

The TAU-NIGENS Spatial Sound Events 2020 dataset contains multiple spatial sound-scene recordings, consisting of sound events of distinct categories integrated into a variety of acoustical spaces, and from multiple source directions and distances as seen from the recording position. The spatialization of all sound events is based on filtering through real spatial room impulse responses (RIRs), captured in multiple rooms of various shapes, sizes, and acoustical absorption properties. Furthermore, each scene recording is delivered in two spatial recording formats, a microphone array one (MIC), and first-order Ambisonics one (FOA). The sound events are spatialized as either stationary sound sources in the room, or moving sound sources, in which case time-variant RIRs are used. Each sound event in the sound scene is associated with a trajectory of its direction-of-arrival (DoA) to the recording point, and a temporal onset and offset time. The isolated sound event recordings used for the sy

2 papers0 benchmarksAudio

OpenBMAT (Open Broadcast Media Audio from TV)

Open Broadcast Media Audio from TV (OpenBMAT) is an open, annotated dataset for the task of music detection that contains over 27 hours of TV broadcast audio from 4 countries distributed over 1647 one-minute long excerpts. It is designed to encompass several essential features for any music detection dataset and is the first one to include annotations about the loudness of music in relation to other simultaneous non-music sounds. OpenBMAT has been cross-annotated by 3 annotators obtaining high inter-annotator agreement percentages, which validates the annotation methodology and ensures the annotations reliability.

2 papers0 benchmarksAudio

SINGA:PURA (SINGApore: Polyphonic URban Audio)

This repository contains the SINGA:PURA dataset, a strongly-labelled polyphonic urban sound dataset with spatiotemporal context. The data were collected via a number of recording units deployed across Singapore as a part of a wireless acoustic sensor network. These recordings were made as part of a project to identify and mitigate noise sources in Singapore, but also possess a wider applicability to sound event detection, classification, and localization. The taxonomy we used for the labels in this dataset has been designed to be compatible with other existing datasets for urban sound tagging while also able to capture sound events unique to the Singaporean context. Please refer to our conference paper published in APSIPA 2021 (which is found in this repository as the file "APSIPA.pdf") or download the readme ("Readme.md") for more details regarding the data collection, annotation, and processing methodologies for the creation of the dataset.

2 papers0 benchmarksAudio

Kinetics-Sound

This is a subset of Kinetics-400, introduced in Look, Listen and Learn by Relja Arandjelovic and Andrew Zisserman.

2 papers0 benchmarksAudio, Videos

AVCAffe (A Large Scale Audio-Visual Dataset of Cognitive Load and Affect for Remote Work)

We introduce AVCAffe, the first Audio-Visual dataset consisting of Cognitive load and Affect attributes. We record AVCAffe by simulating remote work scenarios over a video-conferencing platform, where subjects collaborate to complete a number of cognitively engaging tasks. AVCAffe is the largest originally collected (not collected from the Internet) affective dataset in English language. We recruit 106 participants from 18 different countries of origin, spanning an age range of 18 to 57 years old, with a balanced male-female ratio. AVCAffe comprises a total of 108 hours of video, equivalent to more than 58,000 clips along with task-based self-reported ground truth labels for arousal, valence, and cognitive load attributes such as mental demand, temporal demand, effort, and a few others. We believe AVCAffe would be a challenging benchmark for the deep learning research community given the inherent difficulty of classifying affect and cognitive load in particular. Moreover, our dataset f

2 papers0 benchmarksAudio, Videos

Music4All-Onion

Music4All-Onion is a large-scale, multi-modal music dataset that expands the Music4All dataset by including 26 additional audio, video, and metadata features for 109,269 music pieces and provides a set of 252,984,396 listening records of 119,140 users, extracted from the online music platform Last.fm .

2 papers0 benchmarksAudio

MedleyVox

MedleyVox is an evaluation dataset for multiple singing voices separation that corresponds to such categories. The problem definition in this dataset is categorised into i) duet, ii) unison, iii) main vs. rest, and iv) N-singing separation.

2 papers0 benchmarksAudio

Lyra Dataset (A Dataset for Greek Traditional and Folk Music)

Lyra is a dataset of 1570 traditional and folk Greek music pieces that includes audio and video (timestamps and links to YouTube videos), along with annotations that describe aspects of particular interest for this dataset, including instrumentation, geographic information and labels of genre and subgenre, among others.

2 papers0 benchmarksAudio, Music, Videos

Dusha (Dusha Crowd, Dusha Podcast)

Dusha is a dataset for speech emotion recognition (SER) tasks. The corpus contains approximately 350 hours of data, more than 300 000 audio recordings with Russian speech and their transcripts. It is annotated using a crowd-sourcing platform and includes two subsets: acted and real-life.

2 papers0 benchmarksAudio, Texts

VTC (Videos, Titles and Comments)

VTC is a large-scale multimodal dataset containing video-caption pairs (~300k) alongside comments that can be used for multimodal representation learning.

2 papers0 benchmarksAudio, Images, Texts, Videos

MVSep

MVSep is a synthetic dataset for the vocal separation task created by combining random vocal and instrumental samples, publicly available on the internet. The sourced samples were separated into two sets (vocal-only and instrumental-only) and then randomly mixed together. The mixtures may not always sound like a real melody, but they allow for testing audio separation methods. Synth MVSep dataset consists of 100 tracks, each with a duration of exactly one minute and a sample rate of 44.1 kHz.

2 papers0 benchmarksAudio

Watkins Marine Mammal Sounds (Watkins Marine Mammal Sound Database)

One of the founding fathers of marine mammal bioacoustics, William Watkins, carried out pioneering work with William Schevill at the Woods Hole Oceanographic Institution for more than four decades, laying the groundwork for our field today. One of the lasting achievements of his career was the Watkins Marine Mammal Sound Database, a resource that contains approximately 2000 unique recordings of more than 60 species of marine mammals (Table 1). Recordings were made by Watkins and Schevill as well as many others, including G. C. Ray, D. Wartzok, D. and M. Caldwell, K. Norris, and T. Poulter. Most of these have been digitized, along with approximately 15,000 annotated digital sound clips.

2 papers2 benchmarksAudio

Speech Accent Archive (The Speech Accent Archive)

The speech accent archive uniformly presents a large set of speech samples from a variety of language backgrounds. Native and non-native speakers of English read the same paragraph and are carefully transcribed. The archive is used by people who wish to compare and analyze the accents of different English speakers.

2 papers2 benchmarksAudio

YouTube8M-MusicTextClips

The YouTube8M-MusicTextClips dataset consists of over 4k high-quality human text descriptions of music found in video clips from the YouTube8M dataset.

2 papers0 benchmarksAudio, Music, Texts, Videos

Spatial LibriSpeech

Spatial LibriSpeech is spatial audio dataset with over 650 hours of 19-channel audio, first-order ambisonics, and optional distractor noise. Spatial LibriSpeech is designed for machine learning model training, and it includes labels for source position, speaking direction, room acoustics and geometry.

2 papers0 benchmarksAudio

BiGe (Bielefeld Gesture Corpus)

The BiGe corpus is comprised of 54.360 shots of interest extracted from TED and TEDx talks. All shots are tracked with fully 3d landmarks.

2 papers0 benchmarksAudio, Point cloud, Texts

Jam-ALT (JamALT: A Formatting-Aware Lyrics Transcription Benchmark)

JamALT is a revision of the JamendoLyrics dataset (80 songs in 4 languages), adapted for use as an automatic lyrics transcription (ALT) benchmark.

2 papers7 benchmarksAudio, Music, Speech, Texts

ODAQ: Open Dataset of Audio Quality

A dataset containing the results of a MUSHRA listening test conducted with expert listeners from 2 international laboratories. ODAQ contains 240 audio samples and corresponding quality scores. Each audio sample is rated by 26 listeners. The audio samples are stereo audio signals sampled at 44.1 or 48 kHz and are processed by a total of 6 method classes, each operating at different quality levels. The processing method classes are designed to generate quality degradations possibly encountered during audio coding and source separation, and the quality levels for each method class span the entire quality range. The diversity of the processing methods, the large span of quality levels, the high sampling frequency, and the pool of international listeners make ODAQ particularly suited for further research into subjective and objective audio quality. The dataset is released with permissive licenses, and the software used to conduct the listening test is also made publicly available.

2 papers1 benchmarksAudio
PreviousPage 14 of 25Next