Datasets

486 machine learning datasets

486 dataset results

gtzan_music_speech

gtzan_music_speech is a dataset for music/speech discrimination. It consists of 120 tracks of 30 second length. Each class (music/speech) has 60 samples. The tracks are all 22050Hz Mono 16-bit audio files in .wav format.

1 papers0 benchmarksAudio

AllMusic Mood Subset

The AllMusic Mood Subset (AMS) is a dataset for mood classification from songs. It is created by matching a subset of the Million Song Dataset (MSD), totalling 67k tracks, with expert annotations of 188 different moods collected from AllMusic.

1 papers0 benchmarksAudio

ErhuPT (Erhu Playing Technique Dataset)

This dataset is an audio dataset containing about 1500 audio clips recorded by multiple professional players.

1 papers0 benchmarksAudio

Medley2K

A dataset called Medley2K that consists of 2,000 medleys and 7,712 labeled transitions.

1 papers0 benchmarksAudio

Zooniverse (HumBug Zooniverse)

The Humbug Zooinverse dataset is a dataset of mosquito audio recordings. With over a thousand contributors, it contains 195,434 labels of two second duration, of which approximately 10 percent signify mosquito events.

1 papers0 benchmarksAudio

The Spoken Wikipedia Corpora

The SWC is a corpus of aligned Spoken Wikipedia articles from the English, German, and Dutch Wikipedia. This corpus has several outstanding characteristics:

1 papers2 benchmarksAudio, Speech

Parkinson Speech Dataset

Parkinson Speech Dataset is an audio dataset consisting of recordings of 20 Parkinson's Disease (PD) patients and 20 healthy subjects. From all subjects, multiple types of sound recordings (26) are taken. The goal is to classify which patients have Parkinson's.

1 papers0 benchmarksAudio, Speech

RWCP Sound Scene Database

The RWCP Sound Scene Database includes non-speech sounds recorded in an anechoic room, reconstructed signals in various rooms, impulse responses for a microphone array, speech data recorded with the same array, and recordings of background noises. It is intended for use when simulating sound scenes. It was developed by the Real Acoustic Environments Working Group of the Real World Computing Partnership (RWCP). The data was recorded from 1998 to 2000.

1 papers1 benchmarksAudio

NAR

NAR is a dataset of audio recordings made with the humanoid robot Nao in real world conditions for sound recognition benchmarking. All the recordings were collected using the robot’s microphone and thus have the following characteristics: - recorded with low-quality sensors (300 Hz – 18 kHz bandpass) - suffering from typical fan noise from the robot’s internal hardware - recorded in mutiple real domestic environments (no special acoustic charateristics, reverberations, presence of multiple sound sources and unknown locations)

1 papers0 benchmarksAudio

ARVSU (Addressee Recognition in Visual Scenes with Utterances)

ARVSU contains a vast body of image variations in visual scenes with an annotated utterance and a corresponding addressee for each scenario.

1 papers0 benchmarksAudio, Images

l2d (Learning to Dance)

This dataset is composed of paired videos of people dancing 3 different music styles: Ballet, Michael Jackson and Salsa. It contains multimodal data (visual data, temporal-graphs and audio) careful-selected from publicly available videos of dancers performing representative movements of the music style and audio data from the respective styles.

1 papers0 benchmarksActions, Audio, Graphs

Kinect-WSJ

Kinect-WSJ is a multichannel, multispeaker, reverberated, noisy dataset which extends the WSJ0-2mix singlechannel, non-reverberated, noiseless dataset to the strong reverberation and noise conditions and the Kinect-like microphone array geometry used in CHiME-5.

1 papers0 benchmarksAudio, Speech

Fongbe audio (Fongbe dataset)

Fongbe Data collected by Fréjus A. A LALEYE

1 papers1 benchmarksAudio

POTUS Corpus

The POTUS Corpus is a Database of Weekly Addresses for the Study of Stance in Politics and Virtual Agents.

1 papers0 benchmarksAudio, Videos

Multimodal PISA (Multimodal Piano Skills Assessment)

Dataset for multimodal skills assessment focusing on assessing piano player’s skill level. Annotations include player's skills level, and song difficulty level. Bounding box annotations around pianists' hands are also provided.

1 papers5 benchmarksAudio, Videos

NISQA Speech Quality Corpus

The NISQA Corpus includes more than 14,000 speech samples with simulated (e.g. codecs, packet-loss, background noise) and live (e.g. mobile phone, Zoom, Skype, WhatsApp) conditions. Each file is labelled with subjective ratings of the overall quality and the quality dimensions Noisiness, Coloration, Discontinuity, and Loudness. In total, it contains more than 97,000 human ratings for each of the dimensions and the overall MOS.

1 papers0 benchmarksAudio, Speech

BIRD (Big Impulse Response Dataset)

BIRD (Big Impulse Response Dataset) is an open dataset that consists of 100,000 multichannel room impulse responses (RIRs) generated from simulations using the Image Method, making it the largest multichannel open dataset currently available. These RIRs can be used to perform efficient online data augmentation for scenarios that involve two microphones and multiple sound sources.

1 papers0 benchmarksAudio

PreviousPage 16 of 25Next

Datasets

gtzan_music_speech

AllMusic Mood Subset

ErhuPT (Erhu Playing Technique Dataset)

Medley2K

Zooniverse (HumBug Zooniverse)

The Spoken Wikipedia Corpora

Parkinson Speech Dataset

RWCP Sound Scene Database

NAR

ARVSU (Addressee Recognition in Visual Scenes with Utterances)

l2d (Learning to Dance)

Kinect-WSJ

Fongbe audio (Fongbe dataset)

POTUS Corpus

Multimodal PISA (Multimodal Piano Skills Assessment)

NISQA Speech Quality Corpus

BIRD (Big Impulse Response Dataset)

DX7 Timbre Dataset

VAST Absorption

Boombox

Datasets

gtzan_music_speech

AllMusic Mood Subset

ErhuPT (Erhu Playing Technique Dataset)

Medley2K

Zooniverse (HumBug Zooniverse)

The Spoken Wikipedia Corpora

Parkinson Speech Dataset

RWCP Sound Scene Database

NAR

ARVSU (Addressee Recognition in Visual Scenes with Utterances)

l2d (Learning to Dance)

Kinect-WSJ

Fongbe audio (Fongbe dataset)

POTUS Corpus

Multimodal PISA (Multimodal Piano Skills Assessment)

NISQA Speech Quality Corpus

BIRD (Big Impulse Response Dataset)

DX7 Timbre Dataset

VAST Absorption

Boombox