Datasets

486 machine learning datasets

486 dataset results

MTASS

MTASS is an open-source dataset in which mixtures contain three types of audio signals.

The OLGA dataset contains artist similarities from AllMusic, together with content features from AcousticBrainz. With 17,673 artists, this is the largest academic artist similarity dataset that includes content-based features to date.

4 papers0 benchmarksAudio

M5Product

The M5Product dataset is a large-scale multi-modal pre-training dataset with coarse and fine-grained annotations for E-products.

4 papers0 benchmarksAudio, Images, Tables, Texts, Videos

Beatles

This dataset includes the beat and downbeat annotations for Beatles albums. The annotations are provided by M. E. P. Davies et. al [1].

4 papers2 benchmarksAudio

ARAUS (Affective Responses to Augmented Urban Soundscapes)

Choosing optimal maskers for existing soundscapes to effect a desired perceptual change via soundscape augmentation is non-trivial due to extensive varieties of maskers and a dearth of benchmark datasets with which to compare and develop soundscape augmentation models. To address this problem, we make publicly available the ARAUS (Affective Responses to Augmented Urban Soundscapes) dataset, which comprises a five-fold cross-validation set and independent test set totaling 25,440 unique subjective perceptual responses to augmented soundscapes presented as audio-visual stimuli. Each augmented soundscape is made by digitally adding "maskers" (bird, water, wind, traffic, construction, or silence) to urban soundscape recordings at fixed soundscape-to-masker ratios. Responses were then collected by asking participants to rate how pleasant, annoying, eventful, uneventful, vibrant, monotonous, chaotic, calm, and appropriate each augmented soundscape was, in accordance with ISO 12913-2:2018. Pa

4 papers0 benchmarksAudio, Tabular, Videos

ComMU

ComMU has 11,144 MIDI samples that consist of short note sequences created by professional composers with their corresponding 12 metadata. This dataset is designed for a new task, combinatorial music generation which generate diverse and high-quality music only with metadata through auto-regressive language model.

4 papers0 benchmarksAudio, Midi, Music

jazznet

jazznet is a dataset of piano patterns for music audio machine learning research. The dataset comprises chords, arpeggios, scales, and chord progressions in all keys of an 88-key piano and in all the inversions, for a total of 162520 labeled piano patterns, resulting in 95GB of data and more than 26k hours of audio. The data is also accompanied by Python scripts to enable the easy generation of new piano patterns beyond those present in the dataset. The data is broken down into small, medium, and large subsets, comprising 21516, 30328, and 52360 patterns, respectively (with all the chords, arpeggios, and scales being present in all subsets).

4 papers0 benchmarksAudio

DISCO-10M

DISCO-10M is a novel and extensive music dataset that surpasses the largest previously available music dataset by an order of magnitude.

4 papers0 benchmarksAudio

ITALIC

ITALIC: An ITALian Intent Classification Dataset

4 papers0 benchmarksAudio, Texts

Multi-Label Classification Dataset Repository

For each dataset we provide a short description as well as some characterization metrics. It includes the number of instances (m), number of attributes (d), number of labels (q), cardinality (Card), density (Dens), diversity (Div), average Imbalance Ratio per label (avgIR), ratio of unconditionally dependent label pairs by chi-square test (rDep) and complexity, defined as m × q × d as in [Read 2010]. Cardinality measures the average number of labels associated with each instance, and density is defined as cardinality divided by the number of labels. Diversity represents the percentage of labelsets present in the dataset divided by the number of possible labelsets. The avgIR measures the average degree of imbalance of all labels, the greater avgIR, the greater the imbalance of the dataset. Finally, rDep measures the proportion of pairs of labels that are dependent at 99% confidence. A broader description of all the characterization metrics and the used partition methods are described in

4 papers0 benchmarksAudio, Biology, Images, Medical, Music, Texts, Videos

VietMed (VietMed: A Dataset and Benchmark for Automatic Speech Recognition of Vietnamese in the Medical Domain)

We introduced a Vietnamese speech recognition dataset in the medical domain comprising 16h of labeled medical speech, 1000h of unlabeled medical speech and 1200h of unlabeled general-domain speech. To our best knowledge, VietMed is by far the world’s largest public medical speech recognition dataset in 7 aspects: total duration, number of speakers, diseases, recording conditions, speaker roles, unique medical terms and accents. VietMed is also by far the largest public Vietnamese speech dataset in terms of total duration. Additionally, we are the first to present a medical ASR dataset covering all ICD-10 disease groups and all accents within a country.

4 papers3 benchmarksAudio, Medical, Speech, Texts

Human-Animal-Cartoon

Human-Animal-Cartoon (HAC) dataset consists of seven actions (‘sleeping’, ‘watching tv’, ‘eating’, ‘drinking’, ‘swimming’, ‘running’, and ‘opening door’) performed by humans, animals, and cartoon figures, forming three different domains. 3381 video clips are collected from the internet with around 1000 for each domain and three modalities are provided in the dataset: video, audio, and optical flow.

4 papers0 benchmarksAudio, Videos

FakeMusicCaps

The FakeMusicCaps dataset contains total of 27605 10 seconds music tracks corresponding to almost 77 hours, generated using 5 different Text-To-Music (TTM) models. It is designed to be used as a starting dataset for the training and/or evaluation of models for the detection and attribution of synthetic music generated via TTM models.

4 papers0 benchmarksAudio

HateMM

Hate speech has become one of the most significant issues in modern society, with implications in both the online and offline worlds. However, most of the work has primarily focused on text media, with relatively little work on images and even less on videos. Thus, early-stage automated video moderation techniques are needed to handle the videos that are being uploaded to keep the platform safe and healthy. Therefore, we curated approximately ~43 hours of videos from BitChute and manually annotated them as hate or non-hate, along with the frame spans that could explain the labeling decision.

4 papers2 benchmarksAudio, Videos

Jamendo Corpus

The Jamendo Corpus is a voice detection dataset consisting of 93 songs with Creative Commons license from the Jamendo free music sharing website. Segments of each song are annotated as “voice” (sung or spoken) or “no-voice”. The songs constitute a total of about 6 hours of music. The files are all from different artists and represent various genres from mainstream commercial music. The Jamendo audio files are coded in stereo Vorbis OGG 44.1kHz with 112KB/s bitrate. The original split contains 61, 16 and 16 songs in training, validation and testing set, respectively.

3 papers0 benchmarksAudio, Texts

LITIS Rouen

The LITIS-Rouen dataset is a dataset for audio scenes. It consists of 3026 examples of 19 scene categories. Each class is specific to a location such as a train station or an open market. The audio recordings have a duration of 30 seconds and a sampling rate of 22050 Hz. The dataset has a total duration of 1500 minutes.

3 papers0 benchmarksAudio

BirdVox-full-night

The BirdVox-full-night dataset contains 6 audio recordings, each about ten hours in duration. These recordings come from ROBIN autonomous recording units, placed near Ithaca, NY, USA during the fall 2015. They were captured on the night of September 23rd, 2015, by six different sensors, originally numbered 1, 2, 3, 5, 7, and 10. Andrew Farnsworth used the Raven software to pinpoint every avian flight call in time and frequency. He found 35402 flight calls in total. He estimates that about 25 different species of passerines (thrushes, warblers, and sparrows) are present in this recording. Species are not labeled in BirdVox-full-night, but it is possible to tell apart thrushes from warblers and sparrrows by looking at the center frequencies of their calls. The annotation process took 102 hours.

3 papers0 benchmarksAudio

DCASE 2014

DCASE2014 is an audio classification benchmark.

3 papers0 benchmarksAudio, Videos

SAVEE (Surrey Audio-Visual Expressed Emotion)

The Surrey Audio-Visual Expressed Emotion (SAVEE) dataset was recorded as a pre-requisite for the development of an automatic emotion recognition system. The database consists of recordings from 4 male actors in 7 different emotions, 480 British English utterances in total. The sentences were chosen from the standard TIMIT corpus and phonetically-balanced for each emotion. The data were recorded in a visual media lab with high quality audio-visual equipment, processed and labeled. To check the quality of performance, the recordings were evaluated by 10 subjects under audio, visual and audio-visual conditions. Classification systems were built using standard features and classifiers for each of the audio, visual and audio-visual modalities, and speaker-independent recognition rates of 61%, 65% and 84% achieved respectively.

3 papers6 benchmarksAudio

FSDD (Free Spoken Digit Dataset)

Free Spoken Digit Dataset (FSDD) is a simple audio/speech dataset consisting of recordings of spoken digits in wav files at 8kHz. The recordings are trimmed so that they have near minimal silence at the beginnings and ends. It contains data from 6 speakers, 3,000 recordings (50 of each digit per speaker), and English pronunciations.

3 papers0 benchmarksAudio

PreviousPage 11 of 25Next