Datasets

486 machine learning datasets

486 dataset results

EasyCom

The Easy Communications (EasyCom) dataset is a world-first dataset designed to help mitigate the cocktail party effect from an augmented-reality (AR) -motivated multi-sensor egocentric world view. The dataset contains AR glasses egocentric multi-channel microphone array audio, wide field-of-view RGB video, speech source pose, headset microphone audio, annotated voice activity, speech transcriptions, head and face bounding boxes and source identification labels. We have created and are releasing this dataset to facilitate research in multi-modal AR solutions to the cocktail party problem.

22 papers15 benchmarksAudio, Dialog, Images, RGB Video, Speech, Time series, Videos

MIR-1K

MIR-1K (Multimedia Information Retrieval lab, 1000 song clips) is a dataset designed for singing voice separation. It contains:

21 papers0 benchmarksAudio, Texts

CAL500 (Computer Audition Lab 500)

CAL500 (Computer Audition Lab 500) is a dataset aimed for evaluation of music information retrieval systems. It consists of 502 songs picked from western popular music. The audio is represented as a time series of the first 13 Mel-frequency cepstral coefficients (and their first and second derivatives) extracted by sliding a 12 ms half-overlapping short-time window over the waveform of each song. Each song has been annotated by at least 3 people with 135 musically-relevant concepts spanning six semantic categories:

21 papers0 benchmarksAudio, Tabular

ICBHI Respiratory Sound Database (The Respiratory Sound database - ICBHI 2017 Challenge)

The Respiratory Sound database was originally compiled to support the scientific challenge organized at Int. Conf. on Biomedical Health Informatics - ICBHI 2017.

21 papers7 benchmarksAudio, Biomedical, Medical

STARSS23 (STARSS23: An Audio-Visual Dataset of Spatial Recordings of Real Scenes with Spatiotemporal Annotations of Sound Events)

The Sony-TAu Realistic Spatial Soundscapes 2023 (STARSS23) dataset contains multichannel recordings of sound scenes in various rooms and environments, together with temporal and spatial annotations of prominent events belonging to a set of target classes. The dataset is collected in two different countries, in Tampere, Finland by the Audio Researh Group (ARG) of Tampere University (TAU), and in Tokyo, Japan by SONY, using a similar setup and annotation procedure. The dataset is delivered in two 4-channel spatial recording formats, a microphone array one (MIC), and first-order Ambisonics one (FOA). These recordings serve as the development dataset for the DCASE 2023 Sound Event Localization and Detection Task of the DCASE 2023 Challenge.

21 papers0 benchmarksAudio, RGB Video

AQA-7

Consists of 1106 action samples from seven actions with quality scores as measured by expert human judges.

20 papers2 benchmarksAudio, Videos

iKala

The iKala dataset is a singing voice separation dataset that comprises of 252 30-second excerpts sampled from 206 iKala songs (plus 100 hidden excerpts reserved for MIREX data mining contest). The music accompaniment and the singing voice are recorded at the left and right channels respectively. Additionally, the human-labeled pitch contours and timestamped lyrics are provided.

20 papers1 benchmarksAudio, Lyrics

DiCOVA

The DiCOVA Challenge dataset is derived from the Coswara dataset, a crowd-sourced dataset of sound recordings from COVID-19 positive and non-COVID-19 individuals. The Coswara data is collected using a web-application2, launched in April-2020, accessible through the internet by anyone around the globe. The volunteering subjects are advised to record their respiratory sounds in a quiet environment.

20 papers2 benchmarksAudio

AVSBench (Audio −Visual Segmentation)

AVSBench is a pixel-level audio-visual segmentation benchmark that provides ground truth labels for sounding objects. The dataset is divided into three subsets: AVSBench-object (Single-source subset, Multi-sources subset) and AVSBench-semantic (Semantic-labels subset). Accordingly, three settings are studied:

20 papers0 benchmarksAudio, Videos

DIRHA (Distant-speech Interaction for Robust Home Applications)

DIRHA-English is a multi-microphone database composed of real and simulated sequences of 1-minute. The overall corpus is composed of different types of sequences including: 1) Phonetically-rich sentences; 2) WSJ 5-k utterances; 3) WSJ 20-k utterances; 4) Conversational speech (also including keywords and commands). The sequences are available for both UK and US English at 48 kHz. The DIRHA-English dataset offers the possibility to work with a very large number of microphone channels, to use of microphone arrays having different characteristics and to work considering different speech recognition tasks (e.g., phone-loop, keyword spotting, ASR with small and very large language models).

19 papers0 benchmarksAudio, Texts

FSDnoisy18k

The FSDnoisy18k dataset is an open dataset containing 42.5 hours of audio across 20 sound event classes, including a small amount of manually-labeled data and a larger quantity of real-world noisy data. The audio content is taken from Freesound, and the dataset was curated using the Freesound Annotator. The noisy set of FSDnoisy18k consists of 15,813 audio clips (38.8h), and the test set consists of 947 audio clips (1.4h) with correct labels. The dataset features two main types of label noise: in-vocabulary (IV) and out-of-vocabulary (OOV). IV applies when, given an observed label that is incorrect or incomplete, the true or missing label is part of the target class set. Analogously, OOV means that the true or missing label is not covered by those 20 classes.

19 papers0 benchmarksAudio

STARSS22 (Sony-TAu Realistic Spatial Soundscapes 2022)

The Sony-TAu Realistic Spatial Soundscapes 2022(STARSS22) dataset consists of recordings of real scenes captured with high channel-count spherical microphone array (SMA). The recordings are conducted from two different teams at two different sites, Tampere University in Tammere, Finland, and Sony facilities in Tokyo, Japan. Recordings at both sites share the same capturing and annotation process, and a similar organization. They are organized in sessions, corresponding to distinct rooms, human participants, and sound making props with a few exceptions.

19 papers5 benchmarksAudio

DESED (Domestic environment sound event detection)

The DESED dataset is a dataset designed to recognize sound event classes in domestic environments. The dataset is designed to be used for sound event detection (SED, recognize events with their time boundaries) but it can also be used for sound event tagging (SET, indicate presence of an event in an audio file). The dataset is composed of 10 event classes to recognize in 10 second audio files. The classes are: Alarm/bell/ringing, Blender, Cat, Dog, Dishes, Electric shaver/toothbrush, Frying, Running water, Speech, Vacuum cleaner.

17 papers3 benchmarksAudio

CMU-MOSI (Multimodal Corpus of Sentiment Intensity)

The Multimodal Corpus of Sentiment Intensity (CMU-MOSI) dataset is a collection of 2199 opinion video clips. Each opinion video is annotated with sentiment in the range [-3,3]. The dataset is rigorously annotated with labels for subjectivity, sentiment intensity, per-frame and per-opinion annotated visual features, and per-milliseconds annotated audio features.

16 papers8 benchmarksAudio, Texts, Videos

FUSS (Free Universal Sound Separation)

The Free Universal Sound Separation (FUSS) dataset is a database of arbitrary sound mixtures and source-level references, for use in experiments on arbitrary sound separation. FUSS is based on FSD50K corpus.

16 papers0 benchmarksAudio

Groove (Groove MIDI Dataset)

The Groove MIDI Dataset (GMD) is composed of 13.6 hours of aligned MIDI and (synthesized) audio of human-performed, tempo-aligned expressive drumming. The dataset contains 1,150 MIDI files and over 22,000 measures of drumming.

16 papers2 benchmarksAudio

Casual Conversations

Casual Conversations dataset is designed to help researchers evaluate their computer vision and audio models for accuracy across a diverse set of age, genders, apparent skin tones and ambient lighting conditions.

16 papers0 benchmarksAudio, Images, Videos

DiPCo (DiPCo -- Dinner Party Corpus)

We present a speech data corpus that simulates a "dinner party" scenario taking place in an everyday home environment. The corpus was created by recording multiple groups of four Amazon employee volunteers having a natural conversation in English around a dining table. The participants were recorded by a single-channel close-talk microphone and by five far-field 7-microphone array devices positioned at different locations in the recording room. The dataset contains the audio recordings and human labeled transcripts of a total of 10 sessions with a duration between 15 and 45 minutes. The corpus was created to advance in the field of noise robust and distant speech processing and is intended to serve as a public research and benchmarking data set.

16 papers0 benchmarksAudio

SONAR

SONAR, a new multilingual and multimodal fixed-size sentence embedding space, with a full suite of speech and text encoders and decoders. It substantially outperforms existing sentence embeddings such as LASER3 and LabSE on the xsim and xsim++ multilingual similarity search tasks.

16 papers0 benchmarksAudio, Speech, Texts

MeetingBank

MeetingBank, a benchmark dataset created from the city councils of 6 major U.S. cities to supplement existing datasets.

16 papers3 benchmarksAudio, Texts, Videos

PreviousPage 5 of 25Next