486 machine learning datasets
486 dataset results
Audio samples processed with sound effects, to evaluate effect removal models. The audio effects applied are from the set (Distortion, Delay, Dynamic Range Compressor, Phasor, Reverb) and randomly sampled without replacement for each example; the targets are the original audio.
InfantMarmosetsVox is a dataset for multi-class call-type and caller identification. It contains audio recordings of different individual marmosets and their call-types. The dataset contains a total of 350 files of precisely labelled 10-minute audio recordings across all caller classes. The audio was recorded from five pairs of infant marmoset twins, each recorded individually in two separate sound-proofed recording rooms at a sampling rate of 44.1 kHz. The start and end time, call-type, and marmoset identity of each vocalization are provided, labeled by an experienced researcher. A PyTorch Dataloader is included in this dataset.
A short clip of video may contain progression of multiple events and an interesting story line. A human needs to capture both the event in every shot and associate them together to understand the story behind it.
ENST-Drums: an extensive audio-visual database for drum signals processing Olivier Gillet and Gaël Richard GET / ENST, CNRS LTCI, 37 rue Dareau, 75014 Paris, France
SONICS is a large-scale dataset comprising 97,164 songs — 48,090 real songs from YouTube and 49,074 fake songs from Suno & Udio — designed for synthetic song detection (SSD), also known as fake song detection (FSD). It addresses several limitations of existing datasets, such as the lack of end-to-end fake songs, limited diversity in music-lyrics, and insufficient long-duration songs. The average length of the songs in SONICS is 176 seconds, which enables the capture of long-context relationships. Moreover, SONICS provides open access to generated fake songs and is divided into 66,709 songs for training, 26,015 songs for testing, and 4,440 songs for validation. Additionally, the inclusion of song lyrics in SONICS dataset paves the way for future research in this field.
Operating rooms (ORs) are complex, high-stakes environments requiring precise understanding of interactions among medical staff, tools, and equipment for enhancing surgical assistance, situational awareness, and patient safety. Current datasets fall short in scale, realism and do not capture the multimodal nature of OR scenes, limiting progress in OR modeling. To this end, we introduce MM-OR, a realistic and large-scale multimodal spatiotemporal OR dataset, and the first dataset to enable multimodal scene graph generation. MM-OR captures comprehensive OR scenes containing RGB-D data, detail views, audio, speech transcripts, robotic logs, and tracking data and is annotated with panoptic segmentations, semantic scene graphs, and downstream task labels. Further, we propose MM2SG, the first multimodal large vision-language model for scene graph generation, and through extensive experiments, demonstrate its ability to effectively leverage multimodal inputs. Together, MM-OR and MM2SG establi
The DCASE 2017 rare sound events dataset contains isolated sound events for three classes: 148 crying babies (mean duration 2.25s), 139 glasses breaking (mean duration 1.16s), and 187 gun shots (mean duration 1.32s). As with the DCASE 2016 data, silences are not excluded from active event markings in the annotations. While this data set contains many samples per class, there are only three classes
The MuseScore dataset is a collection of 344,166 audio and MIDI pairs downloaded from MuseScore website. The audio is usually synthesized by the MuseScore synthesizer. The audio clips have diverse musical genres and are about two mins long on average.
LibriCount is a synthetic dataset for speaker count estimation. The dataset contains a simulated cocktail party environment of 0 to 10 speakers, mixed with 0dB SNR from random utterances of different speakers from the LibriSpeech CleanTest dataset. All recordings are of 5s durations, and all speakers are active for the most part of the recording.
Stanford-ECM is an egocentric multimodal dataset which comprises about 27 hours of egocentric video augmented with heart rate and acceleration data. The lengths of the individual videos cover a diverse range from 3 minutes to about 51 minutes in length. A mobile phone was used to collect egocentric video at 720x1280 resolution and 30 fps, as well as triaxial acceleration at 30Hz. The mobile phone was equipped with a wide-angle lens, so that the horizontal field of view was enlarged from 45 degrees to about 64 degrees. A wrist-worn heart rate sensor was used to capture the heart rate every 5 seconds. The phone and heart rate monitor was time-synchronized through Bluetooth, and all data was stored in the phone’s storage. Piecewise cubic polynomial interpolation was used to fill in any gaps in heart rate data. Finally, data was aligned to the millisecond level at 30 Hz.
URBAN-SED is a dataset of 10,000 soundscapes with sound event annotations generated using the scraper library. The dataset includes 10,000 soundscapes, totals almost 30 hours and includes close to 50,000 annotated sound events. Every soundscape is 10 seconds long and has a background of Brownian noise resembling the typical “hum” often heard in urban environments. Every soundscape contains between 1-9 sound evnts from the following classes: air_conditioner, car_horn, children_playing, dog_bark, drilling, engine_idling, gun_shot, jackhammer, siren and street_music. The source material for the sound events are the clips from the UrbanSound8K dataset. URBAN-SED comes pre-sorted into three sets: train, validate and test. There are 6000 soundscapes in the training set, generated using clips from folds 1-6 in UrbanSound8K, 2000 soundscapes in the validation set, generated using clips from fold 7-8 in UrbanSound8K, and 2000 soundscapes in the test set, generated using clips from folds 9-10 in
BirdClef 2019 is a bird soundscape dataset. It contains around 350 hours of manually annotated soundscapes using 30 field recorders between January and June of 2017 in Ithaca, NY, USA. There are around 50,000 recordings in the dataset in total, with 659 classes. The dataset also contains species tags.
BirdClef 2018 is a bird soundscape dataset based on the contributions of the Xeno-canto network. The training set contains 36,496 recordings covering 1500 species of central and south America (the largest bioacoustic dataset in the literature). There are about 68 hours of recordings in total, with 1,500 classes and species tags.
AV Digits Database is an audiovisual database which contains normal, whispered and silent speech. 53 participants were recorded from 3 different views (frontal, 45 and profile) pronouncing digits and phrases in three speech modes.
The Flickr 8k Audio Caption Corpus contains 40,000 spoken captions of 8,000 natural images. It was collected in 2015 to investigate multimodal learning schemes for unsupervised speech pattern discovery. For a description of the corpus, see:
Introduction The 2016 PhysioNet/CinC Challenge aims to encourage the development of algorithms to classify heart sound recordings collected from a variety of clinical or nonclinical (such as in-home visits) environments. The aim is to identify, from a single short recording (10-60s) from a single precordial location, whether the subject of the recording should be referred on for an expert diagnosis.
VOICe is a dataset for the development and evaluation of domain adaptation methods for sound event detection. VOICe consists of mixtures with three different sound events ("baby crying", "glass breaking", and "gunshot"), which are over-imposed over three different categories of acoustic scenes: vehicle, outdoors, and indoors. Moreover, the mixtures are also offered without any background noise.
JVS-MuSiC is a Japanese multispeaker singing-voice corpus called "JVS-MuSiC" with the aim to analyze and synthesize a variety of voices. The corpus consists of 100 singers' recordings of the same song, Katatsumuri, which is a Japanese children's song. It also includes another song that is different for each singer.
OpenSLR is a repository of open speech and language resources, including large-scale transcribed audio corpora and related software. It serves as a central platform for researchers and practitioners to access and share datasets used in speech recognition (ASR), text-to-speech (TTS), and linguistic research.
This dataset includes all music sources, background noises and impulse-reponses (IR) samples and conversation speech that have been used in the work "Neural Audio Fingerprint for High-specific Audio Retrieval based on Contrastive Learning" ICASSP 2021 (https://arxiv.org/abs/2010.11910).