Datasets

486 machine learning datasets

486 dataset results

Sagalee

Speech Recognition Dataset for Oromo Language. 📊 Key features of Sagalee: 100 hours of read speech. 283 gender balanced speakers * Covers different dialects in Oromo language * Open source for research

1 papers2 benchmarksAudio, Speech

XMIDI

XMIDI is a comprehensive, large-scale symbolic music dataset that includes accurate emotion and genre labels, consisting of 108,023 MIDI files. The average duration of the music pieces is approximately 176 seconds, yielding a total dataset length of around 5,278 hours.

1 papers0 benchmarksAudio, Music

ShiftySpeech

ShiftySpeech: A Large-Scale Synthetic Speech Dataset with Distribution Shifts

1 papers0 benchmarksAudio

BIRDeep (BIRDeep_AudioAnnotations)

The BIRDeep Audio Annotations dataset is a collection of bird vocalizations from Doñana National Park, Spain. It was created as part of the BIRDeep project, which aims to optimize the detection and classification of bird species in audio recordings using deep learning techniques. The dataset is intended for use in training and evaluating models for bird vocalization detection and identification.

1 papers0 benchmarksAudio, Biology, Environment, Images

taste-music-dataset (Taste Music Dataset)

This dataset is a patched version of The Taste & Affect Music Database by D. Guedes et al. It is a set of captions that describe 100 musical pieces and associate with them gustatory keywords on the basis of Guedes findings.

1 papers0 benchmarksAudio, Music, Texts

MediBeng (Synthetic Code-Switched Bengali-English Speech Conversations for Healthcare Applications)

MediBeng Dataset The MediBeng dataset contains synthetic code-switched dialogues in Bengali and English for training models in speech recognition (ASR), text-to-speech (TTS), and machine translation in clinical settings. The dataset is available under the CC-BY-4.0 license.

1 papers1 benchmarksAudio, Medical, Speech, Texts

BERSt (Basic Emotion Random phrase Shouts)

BERSt Dataset

1 papers4 benchmarksAudio

JamendoMaxCaps

📊 Dataset Details

1 papers0 benchmarksAudio, Music, Texts

BAH (Behavioural Ambivalence/Hesitancy)

Recognizing complex emotions linked to ambivalence and hesitancy (A/H) can play a critical role in the personalization and effectiveness of digital behaviour change interventions. These subtle and conflicting emotions are manifested by a discord between multiple modalities, such as facial and vocal expressions, and body language. Although experts can be trained to identify A/H, integrating them into digital interventions is costly and less effective. Automatic learning systems provide a cost-effective alternative that can adapt to individual users, and operate seamlessly within real-time, and resource-limited environments. However, there are currently no datasets available for the design of ML models to recognize A/H.

1 papers0 benchmarksAudio, Texts, Videos

UniTalk

We present UniTalk, a novel dataset specifically designed for the task of active speaker detection, emphasizing challenging scenarios to enhance model generalization. Unlike previously established benchmarks such as AVA, which predominantly features old movies and thus exhibits significant domain gaps, UniTalk focuses explicitly on diverse and difficult real-world conditions. These include underrepresented languages, noisy backgrounds, and crowded scenes - such as multiple visible speakers speaking concurrently or in overlapping turns. It contains over 44.5 hours of video with frame-level active speaker annotations across 48,693 speaking identities, and spans a broad range of video types that reflect real-world conditions. Through rigorous evaluation, we show that state-of-the-art models, while achieving nearly perfect scores on AVA, fail to reach saturation on UniTalk, suggesting that the ASD task remains far from solved under realistic conditions. Nevertheless, models trained on UniT

1 papers0 benchmarksAudio, Videos

Laboratory effect perception during virtual stages auralization

Introduction

1 papers0 benchmarksAudio

DnR-nonverbal

DnR-nonverbal is a dataset for cinematic audio source separation (CASS) based on Divide and Remaster (DnR) dataset.

1 papers0 benchmarksAudio

ArVoice (ArVoice: A Multi-Speaker Dataset for Arabic Speech Synthesis)

We introduce ArVoice, a multi-speaker Modern Standard Arabic (MSA) speech corpus with diacritized transcriptions, intended for multi-speaker speech synthesis, and can be useful for other tasks such as speech-based diacritic restoration, voice conversion, and deepfake detection.

1 papers0 benchmarksAudio, Texts

Avalinguo Audio Dataset

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

1 papers0 benchmarksAudio

[FaQ's--Help]How do I speak to someone on Expedia?

How do I speak to someone on Expedia?, reach out to their customer support and request to speak with a supervisor or manager. (+1-888-829-0881 OR +1-805-330-4056 For quicker assistance, call Expedia's customer service at +1-888-829-0881 OR +1-805-330-4056 (US) for support in resolving your issue.

1 papers0 benchmarksAudio, Images, Texts, Videos

DINOS (Diverse INdustrial Operation Sounds)

DINOS (Diverse INdustrial Operation Sounds) is a large-scale, open-access dataset consisting of over 74,000 audio samples totaling more than 1,093 hours, collected from a wide range of industrial acoustic scenarios. It covers diverse manufacturing processes, materials, and operating conditions to comprehensively represent industrial sound characteristics. The dataset includes recordings from CNC cutting operations, additive manufacturing (AM) processes, and designed anomaly scenarios. For cutting, data were collected from two CNC machines: Haas VF-2 and Yornew VMC-300. The VF-2 recordings capture inactive, machining, and warm-up states, while the VMC-300 machines aluminum (Al-6060) under varying spindle speeds and feed rates to induce chatter—a self-excited vibration that excites the system’s natural frequencies, degrading surface finish and tool life. Additional, unlabeled machining sounds were acquired from an APEC SK2540 CNC system. For AM processes, DINOS includes data from Renisha

1 papers0 benchmarksAudio

AcousticRooms

AcousticRooms is a large-scale synthetic room impulse response (RIR) dataset designed for cross-room RIR prediction tasks. It includes over 300,000 single-channel RIRs simulated across 260 rooms spanning 10 categories, such as apartment, auditorium, office, and cafe. Each room features high-quality 3D spatial geometry and randomized material properties drawn from a diverse library of 332 acoustic materials across 11 categories. For more details, please check https://github.com/facebookresearch/AcousticRooms

1 papers0 benchmarksAudio, Images, Point cloud

WASABI

The WASABI Song Corpus is a large corpus of songs enriched with metadata extracted from music databases on the Web, and resulting from the processing of song lyrics and from audio analysis. More specifically, given that lyrics encode an important part of the semantics of a song, the authors focus on the description of the methods they proposed to extract relevant information from the lyrics, such as their structure segmentation, their topics, the explicitness of the lyrics content, the salient passages of a song and the emotions conveyed. The corpus contains 1.73M songs with lyrics (1.41M unique lyrics) annotated at different levels with the output of the above mentioned methods. Such corpus labels and the provided methods can be exploited by music search engines and music professionals (e.g. journalists, radio presenters) to better handle large collections of lyrics, allowing an intelligent browsing, categorization and segmentation recommendation of songs.

0 papers0 benchmarksAudio, Images

MedleyDB 2.0

MedleyDB 2.0 is a superset of the MedleyDB – a dataset of annotated, royalty-free multitrack recordings. The second iteration of the dataset includes 74 new multitrack recordings resulting in 194 songs in total.

0 papers0 benchmarksAudio

Mixing Secrets

Mixing Secrets is an instrument recognition dataset containing 258 multi-track recordings sourced from the Mixing Secrets for The Small Studio website. The dataset was labelled to be consistent with MedleyDB format.

0 papers0 benchmarksAudio

PreviousPage 22 of 25Next