Datasets

486 machine learning datasets

486 dataset results

TUT Sound Events 2017

The TUT Sound Events 2017 dataset contains 24 audio recordings in a street environment and contains 6 different classes. These classes are: brakes squeaking, car, children, large vehicle, people speaking, and people walking.

8 papers0 benchmarksAudio

SoundingEarth

SoundingEarth consists of co-located aerial imagery and audio samples all around the world.

8 papers9 benchmarksAudio, Images

SoundDescs

We introduce a new audio dataset called SoundDescs that can be used for tasks such as text to audio retrieval, audio captioning etc. This dataset contains 32,979 pairs of audio files and text descriptions. There are 23 categories found in SoundDescs including but not limited to nature, clocks, fire etc.

8 papers2 benchmarksAudio, Texts

ADIMA

ADIMA is a novel, linguistically diverse, ethically sourced, expert annotated and well-balanced multilingual profanity detection audio dataset comprising of 11,775 audio samples in 10 Indic languages spanning 65 hours and spoken by 6,446 unique users.

8 papers0 benchmarksAudio, Speech

Snips-SmartLights

The SmartLights benchmark from Snipstests the capability of controlling lights in different rooms. It consists of 1660 requests which are split into five partitions for a 5-fold evaluation. A sample command could be: “please change the [bedroom] lights to [red]” or “i’d like the [living room] lights to be at [twelve] percent”

8 papers3 benchmarksAudio

TUT-SED Synthetic 2016

TUT-SED Synthetic 2016 contains of mixture signals artificially generated from isolated sound events samples. This approach is used to get more accurate onset and offset annotations than in dataset using recordings from real acoustic environments where the annotations are always subjective. Mixture signals in the dataset are created by randomly selecting and mixing isolated sound events from 16 sound event classes together. The resulting mixtures contains sound events with varying polyphony. All together 994 sound event samples were purchased from Sound Ideas. From the 100 mixtures created, 60% were assigned for training, 20% for testing and 20% for validation. The total amount of audio material in the dataset is 566 minutes. Different instances of the sound events are used to synthesize the training, validation and test partitions. Mixtures were created by randomly selecting event instance and from it, randomly, a segment of length 3-15 seconds. Between events, random length silent re

7 papers0 benchmarksAudio

NES-MDB (Nintendo Entertainment System Music Database)

The Nintendo Entertainment System Music Database (NES-MDB) is a dataset intended for building automatic music composition systems for the NES audio synthesizer. It consists of 5278 songs from the soundtracks of 397 NES games. The dataset represents 296 unique composers, and the songs contain more than two million notes combined. It has file format options for MIDI, score and NLM (NES Language Modeling).

7 papers0 benchmarksAudio

MSSD (Music Streaming Sessions Dataset)

The Spotify Music Streaming Sessions Dataset (MSSD) consists of 160 million streaming sessions with associated user interactions, audio features and metadata describing the tracks streamed during the sessions, and snapshots of the playlists listened to during the sessions.

7 papers6 benchmarksAudio

CocoChorales

The CocoChorales Dataset CocoChorales is a dataset consisting of over 1400 hours of audio mixtures containing four-part chorales performed by 13 instruments, all synthesized with realistic-sounding generative models. CocoChorales contains mixes, sources, and MIDI data, as well as annotations for note expression (e.g., per-note volume and vibrato) and synthesis parameters (e.g., multi-f0).

7 papers0 benchmarksAudio

CochlScene

CochlScene is a dataset for acoustic scene classification. The dataset consists of 76k samples collected from 831 participants in 13 acoustic scenes.

7 papers1 benchmarksAudio

SingFake (SingFake: Singing Voice Deepfake Detection)

The rise of singing voice synthesis presents critical challenges to artists and industry stakeholders over unauthorized voice usage. Unlike synthesized speech, synthesized singing voices are typically released in songs containing strong background music that may hide synthesis artifacts. Additionally, singing voices present different acoustic and linguistic characteristics from speech utterances. These unique properties make singing voice deepfake detection a relevant but significantly different problem from synthetic speech detection. In this work, we propose the singing voice deepfake detection task. We first present SingFake, the first curated in-the-wild dataset consisting of 28.93 hours of bonafide and 29.40 hours of deepfake song clips in five languages from 40 singers. We provide a train/val/test split where the test sets include various scenarios. We then use SingFake to evaluate four state-of-the-art speech countermeasure systems trained on speech utterances. We find these sys

7 papers0 benchmarksAudio, Music, Speech

SSC (Spiking Speech Commands v0.2)

The SSC dataset is a spiking version of the Speech Commands dataset release by Google (Speech Commands). SSC was generated using Lauscher, an artificial cochlea model. The SSC dataset consists of utterances recorded from a larger number of speakers under controlled conditions. Spikes were generated in 700 input channels, and it contains 35 word categories from a large number of speakers.

7 papers2 benchmarksAudio

CHB-MIT (CHB-MIT Scalp EEG)

The CHB-MIT dataset is a dataset of EEG recordings from pediatric subjects with intractable seizures. Subjects were monitored for up to several days following withdrawal of anti-seizure mediation in order to characterize their seizures and assess their candidacy for surgical intervention. The dataset contains 23 patients divided among 24 cases (a patient has 2 recordings, 1.5 years apart). The dataset consists of 969 Hours of scalp EEG recordings with 173 seizures. There exist various types of seizures in the dataset (clonic, atonic, tonic). The diversity of patients (Male, Female, 10-22 years old) and different types of seizures contained in the datasets are ideal for assessing the performance of automatic seizure detection methods in realistic settings.

6 papers1 benchmarksAudio, EEG, Medical

CCMixter

CCMixter is a singing voice separation dataset consisting of 50 full-length stereo tracks from ccMixter featuring many different musical genres. For each song there are three WAV files available: the background music, the voice signal, and their sum.

6 papers0 benchmarksAudio

MEDIA

The MEDIA French corpus is dedicated to semantic extraction from speech in a context of human/machine dialogues. The corpus has manual transcription and conceptual annotation of dialogues from 250 speakers. It is split into the following three parts : (1) the training set (720 dialogues, 12K sentences), (2) the development set (79 dialogues, 1.3K sentences, and (3) the test set (200 dialogues, 3K sentences).

6 papers0 benchmarksAudio, Texts

ADVANCE (AuDio Visual Aerial sceNe reCognition datasEt)

The AuDio Visual Aerial sceNe reCognition datasEt (ADVANCE) is a brand-new multimodal learning dataset, which aims to explore the contribution of both audio and conventional visual messages to scene recognition. This dataset in summary contains 5075 pairs of geotagged aerial images and sounds, classified into 13 scene classes, i.e., airport, sports land, beach, bridge, farmland, forest, grassland, harbor, lake, orchard, residential area, shrub land, and train station.

6 papers0 benchmarksAudio, Images

Moviescope

Moviescope is a large-scale dataset of 5,000 movies with corresponding video trailers, posters, plots and metadata. Moviescope is based on the IMDB 5000 dataset consisting of 5.043 movie records. It is augmented by crawling video trailers associated with each movie from YouTube and text plots from Wikipedia.

6 papers0 benchmarksAudio, Texts, Videos

Spotify Podcast

A set of approximately 100K podcast episodes comprised of raw audio files along with accompanying ASR transcripts. This represents over 47,000 hours of transcribed audio, and is an order of magnitude larger than previous speech-to-text corpora.

6 papers0 benchmarksAudio

L3DAS21

L3DAS21 is a dataset for 3D audio signal processing. It consists of a 65 hours 3D audio corpus, accompanied with a Python API that facilitates the data usage and results submission stage.

6 papers4 benchmarksAudio

ASCEND

ASCEND (A Spontaneous Chinese-English Dataset) introduces a high-quality resource of spontaneous multi-turn conversational dialogue Chinese code-switching corpus collected in Hong Kong. ASCEND includes 23 bilinguals that are fluent in both Chinese and English and consists of 10.62 hours clean speech corpus.

6 papers0 benchmarksAudio, Speech

PreviousPage 8 of 25Next