TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Datasets

486 machine learning datasets

Filter by Modality

  • Images3,275
  • Texts3,148
  • Videos1,019
  • Audio486
  • Medical395
  • 3D383
  • Time series298
  • Graphs285
  • Tabular271
  • Speech199
  • RGB-D192
  • Environment148
  • Point cloud135
  • Biomedical123
  • LiDAR95
  • RGB Video87
  • Tracking78
  • Biology71
  • Actions68
  • 3d meshes65
  • Tables52
  • Music48
  • EEG45
  • Hyperspectral images45
  • Stereo44
  • MRI39
  • Physics32
  • Interactive29
  • Dialog25
  • Midi22
  • 6D17
  • Replay data11
  • Financial10
  • Ranking10
  • Cad9
  • fMRI7
  • Parallel6
  • Lyrics2
  • PSG2
Clear filter

486 dataset results

RemFX (RemFX Evaluation Datasets)

Audio samples processed with sound effects, to evaluate effect removal models. The audio effects applied are from the set (Distortion, Delay, Dynamic Range Compressor, Phasor, Reverb) and randomly sampled without replacement for each example; the targets are the original audio.

3 papers0 benchmarksAudio

InfantMarmosetsVox

InfantMarmosetsVox is a dataset for multi-class call-type and caller identification. It contains audio recordings of different individual marmosets and their call-types. The dataset contains a total of 350 files of precisely labelled 10-minute audio recordings across all caller classes. The audio was recorded from five pairs of infant marmoset twins, each recorded individually in two separate sound-proofed recording rooms at a sampling rate of 44.1 kHz. The start and end time, call-type, and marmoset identity of each vocalization are provided, labeled by an experienced researcher. A PyTorch Dataloader is included in this dataset.

3 papers0 benchmarksAudio

Shot2Story20K

A short clip of video may contain progression of multiple events and an interesting story line. A human needs to capture both the event in every shot and associate them together to understand the story behind it.

3 papers16 benchmarksAudio, Texts, Videos

ENST Drums (ENST-Drums: an extensive audio-visual database for drum signals processing)

ENST-Drums: an extensive audio-visual database for drum signals processing Olivier Gillet and Gaël Richard GET / ENST, CNRS LTCI, 37 rue Dareau, 75014 Paris, France

3 papers0 benchmarksAudio

SONICS (Synthetic Or Not - Identifying Counterfeit Songs)

SONICS is a large-scale dataset comprising 97,164 songs — 48,090 real songs from YouTube and 49,074 fake songs from Suno & Udio — designed for synthetic song detection (SSD), also known as fake song detection (FSD). It addresses several limitations of existing datasets, such as the lack of end-to-end fake songs, limited diversity in music-lyrics, and insufficient long-duration songs. The average length of the songs in SONICS is 176 seconds, which enables the capture of long-context relationships. Moreover, SONICS provides open access to generated fake songs and is divided into 66,709 songs for training, 26,015 songs for testing, and 4,440 songs for validation. Additionally, the inclusion of song lyrics in SONICS dataset paves the way for future research in this field.

3 papers0 benchmarksAudio

MM-OR

Operating rooms (ORs) are complex, high-stakes environments requiring precise understanding of interactions among medical staff, tools, and equipment for enhancing surgical assistance, situational awareness, and patient safety. Current datasets fall short in scale, realism and do not capture the multimodal nature of OR scenes, limiting progress in OR modeling. To this end, we introduce MM-OR, a realistic and large-scale multimodal spatiotemporal OR dataset, and the first dataset to enable multimodal scene graph generation. MM-OR captures comprehensive OR scenes containing RGB-D data, detail views, audio, speech transcripts, robotic logs, and tracking data and is annotated with panoptic segmentations, semantic scene graphs, and downstream task labels. Further, we propose MM2SG, the first multimodal large vision-language model for scene graph generation, and through extensive experiments, demonstrate its ability to effectively leverage multimodal inputs. Together, MM-OR and MM2SG establi

3 papers7 benchmarks3D, Audio, Graphs, Images, Medical, Point cloud, RGB-D, Speech, Texts, Time series, Videos

DCASE 2017

The DCASE 2017 rare sound events dataset contains isolated sound events for three classes: 148 crying babies (mean duration 2.25s), 139 glasses breaking (mean duration 1.16s), and 187 gun shots (mean duration 1.32s). As with the DCASE 2016 data, silences are not excluded from active event markings in the annotations. While this data set contains many samples per class, there are only three classes

2 papers0 benchmarksAudio

MuseScore

The MuseScore dataset is a collection of 344,166 audio and MIDI pairs downloaded from MuseScore website. The audio is usually synthesized by the MuseScore synthesizer. The audio clips have diverse musical genres and are about two mins long on average.

2 papers0 benchmarksAudio

LibriCount

LibriCount is a synthetic dataset for speaker count estimation. The dataset contains a simulated cocktail party environment of 0 to 10 speakers, mixed with 0dB SNR from random utterances of different speakers from the LibriSpeech CleanTest dataset. All recordings are of 5s durations, and all speakers are active for the most part of the recording.

2 papers0 benchmarksAudio

Stanford-ECM

Stanford-ECM is an egocentric multimodal dataset which comprises about 27 hours of egocentric video augmented with heart rate and acceleration data. The lengths of the individual videos cover a diverse range from 3 minutes to about 51 minutes in length. A mobile phone was used to collect egocentric video at 720x1280 resolution and 30 fps, as well as triaxial acceleration at 30Hz. The mobile phone was equipped with a wide-angle lens, so that the horizontal field of view was enlarged from 45 degrees to about 64 degrees. A wrist-worn heart rate sensor was used to capture the heart rate every 5 seconds. The phone and heart rate monitor was time-synchronized through Bluetooth, and all data was stored in the phone’s storage. Piecewise cubic polynomial interpolation was used to fill in any gaps in heart rate data. Finally, data was aligned to the millisecond level at 30 Hz.

2 papers0 benchmarksAudio, Videos

URBAN-SED

URBAN-SED is a dataset of 10,000 soundscapes with sound event annotations generated using the scraper library. The dataset includes 10,000 soundscapes, totals almost 30 hours and includes close to 50,000 annotated sound events. Every soundscape is 10 seconds long and has a background of Brownian noise resembling the typical “hum” often heard in urban environments. Every soundscape contains between 1-9 sound evnts from the following classes: air_conditioner, car_horn, children_playing, dog_bark, drilling, engine_idling, gun_shot, jackhammer, siren and street_music. The source material for the sound events are the clips from the UrbanSound8K dataset. URBAN-SED comes pre-sorted into three sets: train, validate and test. There are 6000 soundscapes in the training set, generated using clips from folds 1-6 in UrbanSound8K, 2000 soundscapes in the validation set, generated using clips from fold 7-8 in UrbanSound8K, and 2000 soundscapes in the test set, generated using clips from folds 9-10 in

2 papers0 benchmarksAudio

BirdCLEF 2019

BirdClef 2019 is a bird soundscape dataset. It contains around 350 hours of manually annotated soundscapes using 30 field recorders between January and June of 2017 in Ithaca, NY, USA. There are around 50,000 recordings in the dataset in total, with 659 classes. The dataset also contains species tags.

2 papers0 benchmarksAudio

BirdCLEF 2018

BirdClef 2018 is a bird soundscape dataset based on the contributions of the Xeno-canto network. The training set contains 36,496 recordings covering 1500 species of central and south America (the largest bioacoustic dataset in the literature). There are about 68 hours of recordings in total, with 1,500 classes and species tags.

2 papers0 benchmarksAudio

AV Digits Database

AV Digits Database is an audiovisual database which contains normal, whispered and silent speech. 53 participants were recorded from 3 different views (frontal, 45 and profile) pronouncing digits and phrases in three speech modes.

2 papers0 benchmarksAudio, Images, Speech

Flickr Audio Caption Corpus

The Flickr 8k Audio Caption Corpus contains 40,000 spoken captions of 8,000 natural images. It was collected in 2015 to investigate multimodal learning schemes for unsupervised speech pattern discovery. For a description of the corpus, see:

2 papers0 benchmarksAudio, Speech

PhysioNet Challenge 2016

Introduction The 2016 PhysioNet/CinC Challenge aims to encourage the development of algorithms to classify heart sound recordings collected from a variety of clinical or nonclinical (such as in-home visits) environments. The aim is to identify, from a single short recording (10-60s) from a single precordial location, whether the subject of the recording should be referred on for an expert diagnosis.

2 papers0 benchmarksAudio, Medical

VOICe

VOICe is a dataset for the development and evaluation of domain adaptation methods for sound event detection. VOICe consists of mixtures with three different sound events ("baby crying", "glass breaking", and "gunshot"), which are over-imposed over three different categories of acoustic scenes: vehicle, outdoors, and indoors. Moreover, the mixtures are also offered without any background noise.

2 papers0 benchmarksAudio

JVS-MuSiC

JVS-MuSiC is a Japanese multispeaker singing-voice corpus called "JVS-MuSiC" with the aim to analyze and synthesize a variety of voices. The corpus consists of 100 singers' recordings of the same song, Katatsumuri, which is a Japanese children's song. It also includes another song that is different for each singer.

2 papers0 benchmarksAudio, Speech

OpenSLR (Open Speech and Language Resources)

OpenSLR is a repository of open speech and language resources, including large-scale transcribed audio corpora and related software. It serves as a central platform for researchers and practitioners to access and share datasets used in speech recognition (ASR), text-to-speech (TTS), and linguistic research.

2 papers0 benchmarksAudio, Texts

Fingerprint Dataset (Neural Audio Fingerprint Dataset)

This dataset includes all music sources, background noises and impulse-reponses (IR) samples and conversation speech that have been used in the work "Neural Audio Fingerprint for High-specific Audio Retrieval based on Contrastive Learning" ICASSP 2021 (https://arxiv.org/abs/2010.11910).

2 papers0 benchmarksAudio, Music, Speech
PreviousPage 13 of 25Next