Datasets

486 machine learning datasets

486 dataset results

PerezGaldos (Single Spanish Speaker Dataset)

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

1 papers0 benchmarksAudio

SoccerNet-Echoes (SoccerNet-Echoes: A Soccer Game Audio Commentary Dataset)

SoccerNet-Echoes: A Soccer Game Audio Commentary Dataset.

1 papers0 benchmarksAudio, Texts, Videos

BIOSED-ACPD

BIOSED-ACPD: BIOacoustic Sound Event Detection - Adaptive Change Point Detection dataset

1 papers0 benchmarksAudio

MeerKAT: Meerkat Kalahari Audio Transcripts

A large-scale reference dataset for bioacoustics. MeerKAT is a 1068h large-scale dataset containing data from audio-recording collars worn by free-ranging meerkats (Suricata suricatta) at the Kalahari Research Centre, South Africa, of which 184h are labeled with twelve time-resolved vocalization-type ground truth target classes, each with millisecond resolution. The labeled 184h MeerKAT subset exhibits realistic sparsity conditions for a bioacoustic dataset (96% background-noise or other signals and 4% vocalizations), dispersed across 66398 10-second samples, spanning 251562 labeled events and showcasing significant spectral and temporal variability, making it the first large-scale reference point with real-world conditions for benchmarking pretraining and finetune approaches in bioacoustics deep learning.

1 papers2 benchmarksAudio

inaGVAD (InaGVAD : a Challenging French TV and Radio Corpus annotated for Voice Activity Detection and Speaker Gender Segmentation)

InaGVAD is a Voice Activity Detection (VAD) and Speaker Gender Segmentation (SGS) dataset designed for representing the acoustic diversity of French TV and Radio programs. InaGVAD detailed description, together with a benchmark of 6 freely available VAD systems and 3 SGS systems, is provided in a paper presented in LREC-COLING 2024.

1 papers0 benchmarksAudio, Music, Speech

MINT (a Multi-modal Image and Narrative Text Dubbing Dataset)

Foley audio, critical for enhancing the immersive experience in multimedia content, faces significant challenges in the AI-generated content (AIGC) landscape. Despite advancements in AIGC technologies for text and image generation, the foley audio dubbing remains rudimentary due to difficulties in cross-modal scene matching and content correlation. Current text-to-audio technology, which relies on detailed and acoustically relevant textual descriptions, falls short in practical video dubbing applications. Existing datasets like AudioSet, AudioCaps, Clotho, Sound-of-Story, and WavCaps do not fully meet the requirements for real-world foley audio dubbing task. To address this, we introduce the Multi-modal Image and Narrative Text Dubbing Dataset (MINT), designed to enhance mainstream dubbing tasks such as literary story audiobooks dubbing, image/silent video dubbing. Besides, to address the limitations of existing TTA technology in understanding and planning complex prompts, a Foley Audi

1 papers0 benchmarksAudio, Images, Texts, Videos

VietMed-Sum

In doctor-patient conversations, identifying medically relevant information is crucial, posing the need for conversation summarization. In this work, we propose the first deployable real-time speech summarization system for real-world applications in industry, which generates a local summary after every N speech utterances within a conversation and a global summary after the end of a conversation. Our system could enhance user experience from a business standpoint, while also reducing computational costs from a technical perspective. Secondly, we present VietMed-Sum which, to our knowledge, is the first speech summarization dataset for medical conversations. Thirdly, we are the first to utilize LLM and human annotators collaboratively to create gold standard and synthetic summaries for medical conversation summarization.

1 papers0 benchmarksAudio, Medical, Texts

MuseChat Dataset (MuseChat: A Conversational Music Recommendation System for Videos (CVPR 2024 Highlight Paper))

Music recommendation for videos attracts growing interest in multi-modal research. However, existing systems focus primarily on content compatibility, often ignoring the users’ preferences. Their inability to interact with users for further refinements or to provide explanations leads to a less satisfying experience. We address these issues with MuseChat, a first-of-its-kind dialogue-based recommendation system that personalizes music suggestions for videos. Our system consists of two key functionalities with associated modules: recommendation and reasoning. The recommendation module takes a video along with optional information including previous suggested music and user’s preference as inputs and retrieves an appropriate music matching the context. The reasoning module, equipped with the power of Large Language Model (Vicuna-7B) and extended to multi-modal inputs, is able to provide reasonable explanation for the recommended music. To evaluate the effectiveness of MuseChat, we build

1 papers0 benchmarksAudio, Texts, Videos

YourMT3 Dataset

We redistribute a suite of datasets as part of the YourMT3 project. The license for redistribution is attached.

1 papers0 benchmarksAudio, Midi

VibraVox (rigid in-ear microphone)

This is the in-ear rigid earpiece-embedded microphone variant of the VibraVox dataset.

1 papers8 benchmarksAudio, Speech, Texts

VibraVox (soft in-ear microphone)

This is the in-ear comply foam-embedded microphone variant of the VibraVox dataset.

1 papers8 benchmarksAudio, Speech, Texts

VibraVox (throat microphone)

This is the throat microphone (laryngophone) variant of the VibraVox dataset.

1 papers8 benchmarksAudio, Speech, Texts

VibraVox (forehead accelerometer)

This is the forehead accelerometer variant of the VibraVox dataset.

1 papers8 benchmarksAudio, Speech, Texts

VibraVox (temple vibration pickup)

This is the temple vibration pickup variant of the VibraVox dataset.

1 papers8 benchmarksAudio, Speech, Texts

VibraVox (headset microphone)

This is the reference headset microphone variant of the VibraVox dataset.

1 papers4 benchmarksAudio, Speech, Texts

IS3 (Interactive-Synthetic Sound Source) Dataset

We introduce a new synthetic test set named IS3 for interactive sound source localization. By leveraging diffusion models, we generate images containing multiple sounding objects. Any combination of sounding objects can appear in the same scene. Additionally, this dataset offers unusual scenes and unique combinations that are rarely found in nature, such as ‘a donkey playing a saxophone’ or ‘a sea lion on the snow’. This dataset provides both segmentation maps and bounding box information with class categories. IS3 includes 3240 images, resulting in 6480 unique audio-visual instances (with 2 objects per image) across 118 categories. This dataset can be used in below tasks: 1) Sound Source Localization 2) Audio-Visual Segmentation 3) Semantic Segmentation

1 papers0 benchmarksAudio, Images

SF20K (Short-Films 20K)

Short-Films 20K (SF20K) is the largest publicly available movie dataset. SF20K is composed of 20,143 amateur films and offers long-term video tasks in the form of multiple-choice and open-ended question answering.

1 papers0 benchmarksAudio, Texts, Videos

TinyChirp

TinyChirp dataset for model training, validation and testing

1 papers0 benchmarksAudio

WHU - Audio ENF (WHU - Audio Electric Network Frequency)

The Whu dataset is an audio only dataset thought for testing ENF detection. It is divided into two parts: one with recordings containing ENF traces (H1) and one without them (H0). The recordings with ENF traces are coupled with the corresponding ENF reference (H1_ref). The dataset consists of 60 real-world audio recordings captured around Wuhan University campus, featuring a diverse range of environments and conditions. The recordings were made at a sampling rate of 44.1 kHz with 16-bit quantization and mono channel. Among the 60 recordings, 50 were found to have captured and verified ENF signals, which were confirmed by comparing the recording times with a reference database. The remaining 10 recordings were made in open exterior environments with strong noise and interference, and were often affected by Doppler effects due to the user walking while recording. The final dataset has 130 audio recordings in H1 and 40 in H0, obtained by randomly cropping them to durations varying from 5

1 papers0 benchmarksAudio

ENF moving video (Electric Network Frequency Moving Video Dataset)

The ENF moving video dataset, which is a subset of the dataset used in Temporal Localization of Non-Static Digital Videos Using the Electrical Network Frequency , consists of video recording without the audio channel coupled with the corresponding power ENF signal reference in WAV format at a rate of 1 kHz. The dataset is made of 8 video clips recorded in Europe at 29.97 frames per second, with a duration of approximately 11-12 minutes, using a GoPro Hero 4 Black and an NK AC3061-4KN camera. In terms of content, videos 1-3 are entirely stationary, videos 4-5 are predominantly stationary with some movement, and videos 6-8 are non-stationary, meaning the camera is fixed, but there are moving objects in most frames. All videos depict natural, everyday indoor scenes (i.e., not plain backgrounds).

1 papers0 benchmarksAudio, Videos

PreviousPage 20 of 25Next