Datasets

486 machine learning datasets

486 dataset results

MUSDB18-HQ

MUSDB18-HQ is a high-quality version of the MUSDB18 music tracks dataset. The high-quality dataset consists of the same 150 songs, but instead of MP4 files (compressed with Advanced Audio Coding encoder at 256kbps, with bandwidth limited to 16kHz), the songs are provided as raw WAV files. Image Source: https://sigsep.github.io/datasets/musdb.html

15 papers10 benchmarksAudio

GTZAN

The gtzan8 audio dataset contains 1000 tracks of 30 second length. There are 10 genres, each containing 100 tracks which are all 22050Hz Mono 16-bit audio files in .wav format. The genres are:

15 papers3 benchmarksAudio

QuerYD

A large-scale dataset for retrieval and event localisation in video. A unique feature of the dataset is the availability of two audio tracks for each video: the original audio, and a high-quality spoken description of the visual content.

15 papers6 benchmarksAudio, Texts, Videos

ToyADMOS

ToyADMOS dataset is a machine operating sounds dataset of approximately 540 hours of normal machine operating sounds and over 12,000 samples of anomalous sounds collected with four microphones at a 48kHz sampling rate, prepared by Yuma Koizumi and members in NTT Media Intelligence Laboratories. The ToyADMOS dataset is designed for anomaly detection in machine operating sounds (ADMOS) research. It is designed for three tasks of ADMOS: product inspection (toy car), fault diagnosis for fixed machine (toy conveyor), and fault diagnosis for moving machine (toy train).

15 papers0 benchmarksAudio

CPED (Chinese Personalized and Emotional Dialogue)

We construct a dataset named CPED from 40 Chinese TV shows. CPED consists of multisource knowledge related to empathy and personal characteristic. This knowledge covers 13 emotions, gender, Big Five personality traits, 19 dialogue acts and other knowledge.

15 papers16 benchmarksAudio, Texts, Videos

AVSD (Audio-Visual Scene-Aware Dialog)

The Audio Visual Scene-Aware Dialog (AVSD) dataset, or DSTC7 Track 3, is a audio-visual dataset for dialogue understanding. The goal with the dataset and track was to design systems to generate responses in a dialog about a video, given the dialog history and audio-visual content of the video.

14 papers1 benchmarksAudio, Texts, Videos

TAU Urban Acoustic Scenes 2019

TAU Urban Acoustic Scenes 2019 development dataset consists of 10-seconds audio segments from 10 acoustic scenes: airport, indoor shopping mall, metro station, pedestrian street, public square, street with medium level of traffic, travelling by a tram, travelling by a bus, travelling by an underground metro and urban park. Each acoustic scene has 1440 segments (240 minutes of audio). The dataset contains in total 40 hours of audio.

14 papers2 benchmarksAudio

DailyTalk

DailyTalk is a high-quality conversational speech dataset designed for Text-to-Speech. We sampled, modified, and recorded 2,541 dialogues from the open-domain dialogue dataset DailyDialog which are adequately long to represent context of each dialogue.

14 papers0 benchmarksAudio

LAV-DF (Localized Audio Visual DeepFake Dataset)

Localized Audio Visual DeepFake Dataset (LAV-DF).

14 papers4 benchmarksAudio, Videos

TUT Acoustic Scenes 2017

The TUT Acoustic Scenes 2017 dataset is a collection of recordings from various acoustic scenes all from distinct locations. For each recording location 3-5 minute long audio recordings are captured and are split into 10 seconds which act as unit of sample for this task. All the audio clips are recorded with 44.1 kHz sampling rate and 24 bit resolution.

13 papers1 benchmarksAudio

L3DAS22

L3DAS22: MACHINE LEARNING FOR 3D AUDIO SIGNAL PROCESSING This dataset supports the L3DAS22 IEEE ICASSP Gand Challenge. The challenge is supported by a Python API that facilitates the dataset download and preprocessing, the training and evaluation of the baseline models and the results submission.

13 papers0 benchmarksAudio

UnAV-100

Existing audio-visual event localization (AVE) handles manually trimmed videos with only a single instance in each of them. However, this setting is unrealistic as natural videos often contain numerous audio-visual events with different categories. To better adapt to real-life applications, we focus on the task of dense-localizing audio-visual events, which aims to jointly localize and recognize all audio-visual events occurring in an untrimmed video. To tackle this problem, we introduce the first Untrimmed Audio-Visual (UnAV-100) dataset, which contains 10K untrimmed videos with over 30K audio-visual events covering 100 event categories. Each video has 2.8 audio-visual events on average, and the events are usually related to each other and might co-occur as in real-life scenes. We believe our UnAV-100, with its realistic complexity, can promote the exploration on comprehensive audio-visual video understanding.

13 papers2 benchmarksAudio, Videos

ASAP (Aligned Scores and Performances)

ASAP is a dataset of 222 digital musical scores aligned with 1068 performances (more than 92 hours) of Western classical piano music.

13 papers2 benchmarksAudio, Midi, Music

FSDKaggle2018

FSDKaggle2018 is an audio dataset containing 11,073 audio files annotated with 41 labels of the AudioSet Ontology. FSDKaggle2018 has been used for the DCASE Challenge 2018 Task 2. All audio samples are gathered from Freesound and are provided as uncompressed PCM 16 bit, 44.1 kHz mono audio files. The 41 categories of the AudioSet Ontology are: "Acoustic_guitar", "Applause", "Bark", "Bass_drum", "Burping_or_eructation", "Bus", "Cello", "Chime", "Clarinet", "Computer_keyboard", "Cough", "Cowbell", "Double_bass", "Drawer_open_or_close", "Electric_piano", "Fart", "Finger_snapping", "Fireworks", "Flute", "Glockenspiel", "Gong", "Gunshot_or_gunfire", "Harmonica", "Hi-hat", "Keys_jangling", "Knock", "Laughter", "Meow", "Microwave_oven", "Oboe", "Saxophone", "Scissors", "Shatter", "Snare_drum", "Squeak", "Tambourine", "Tearing", "Telephone", "Trumpet", "Violin_or_fiddle", "Writing".

12 papers2 benchmarksAudio

TED Gesture Dataset

Co-speech gestures are everywhere. People make gestures when they chat with others, give a public speech, talk on a phone, and even think aloud. Despite this ubiquity, there are not many datasets available. The main reason is that it is expensive to recruit actors/actresses and track precise body motions. There are a few datasets available (e.g., MSP AVATAR [17] and Personality Dyads Corpus [18]), but their sizes are limited to less than 3 h, and they lack diversity in speech content and speakers. The gestures also could be unnatural owing to inconvenient body tracking suits and acting in a lab environment.

12 papers2 benchmarksAudio, Texts, Videos

EPIC-SOUNDS

EPIC-SOUNDS is a large scale dataset of audio annotations capturing temporal extents and class labels within the audio stream of the egocentric videos from EPIC-KITCHENS-100. EPIC-SOUNDS includes 78.4k categorised and 39.2k non-categorised segments of audible events and actions, distributed across 44 classes.

12 papers3 benchmarksAudio

Neptune (Neptune Long Video Understanding Benchmark)

Neptune is a dataset consisting of challenging question-answer-decoy (QAD) sets for long videos (up to 15 minutes). The goal of this dataset is to test video-language models for a broad range of long video reasoning abilities, which are provided as "question type" labels for each question, for example "video summarization", "temporal ordering", "state changes" and "creator intent" amongst others.

12 papers0 benchmarksAudio, Texts, Videos

PreviousPage 6 of 25Next

Datasets

MUSDB18-HQ

GTZAN

QuerYD

ToyADMOS

CPED (Chinese Personalized and Emotional Dialogue)

AVSD (Audio-Visual Scene-Aware Dialog)

TAU Urban Acoustic Scenes 2019

DailyTalk

LAV-DF (Localized Audio Visual DeepFake Dataset)

TUT Acoustic Scenes 2017

L3DAS22

UnAV-100

ASAP (Aligned Scores and Performances)

FSDKaggle2018

TED Gesture Dataset

EPIC-SOUNDS

Neptune (Neptune Long Video Understanding Benchmark)

VoxForge

DCASE 2013

MSD (Million Song Dataset)

Datasets

MUSDB18-HQ

GTZAN

QuerYD

ToyADMOS

CPED (Chinese Personalized and Emotional Dialogue)

AVSD (Audio-Visual Scene-Aware Dialog)

TAU Urban Acoustic Scenes 2019

DailyTalk

LAV-DF (Localized Audio Visual DeepFake Dataset)

TUT Acoustic Scenes 2017

L3DAS22

UnAV-100

ASAP (Aligned Scores and Performances)

FSDKaggle2018

TED Gesture Dataset

EPIC-SOUNDS

Neptune (Neptune Long Video Understanding Benchmark)

VoxForge

DCASE 2013

MSD (Million Song Dataset)