Datasets

199 machine learning datasets

199 dataset results

VocBench

VocBench is a framework that benchmark the performance of state-of-the art neural vocoders. VocBench uses a systematic study to evaluate different neural vocoders in a shared environment that enables a fair comparison between them.

2 papers0 benchmarksSpeech

CI-AVSR

Cantonese In-car Audio-Visual Speech Recognition (CI-AVSR) is a dataset for in-car command recognition in the Cantonese language with both video and audio data. It consists of 4,984 samples (8.3 hours) of 200 in-car commands recorded by 30 native Cantonese speakers. Furthermore, the dataset is augmented using common in-car background noises to simulate real environments, producing a dataset 10 times larger than the collected one.

2 papers0 benchmarksSpeech

NPSC (Norwegian Parliamentary Speech Corpus)

The Norwegian Parliamentary Speech Corpus (NPSC) is a speech corpus made by the Norwegian Language Bank at the National Library of Norway in 2019-2021. The NPSC consists of recordings of speech from Stortinget, the Norwegian parliament, and corresponding orthographic transcriptions to Norwegian Bokmål and Norwegian Nynorsk. All transcriptions are done manually by trained linguists or philologists, and the manual transcriptions are subsequently proofread to ensure consistency and accuracy. Entire days of Parliamentary meetings are transcribed in the dataset.

2 papers0 benchmarksSpeech, Texts

ESB (End-to-End Speech Benchmark)

ESB is a benchmark for evaluating the performance of a single automatic speech recognition (ASR) system across a broad set of speech datasets. It comprises eight English speech recognition datasets, capturing a broad range of domains, acoustic conditions, speaker styles, and transcription requirements.

2 papers0 benchmarksSpeech

Jam-ALT (JamALT: A Formatting-Aware Lyrics Transcription Benchmark)

JamALT is a revision of the JamendoLyrics dataset (80 songs in 4 languages), adapted for use as an automatic lyrics transcription (ALT) benchmark.

2 papers7 benchmarksAudio, Music, Speech, Texts

GOTCHA

We release the dataset for non-commercial research. Submit requests <a href="https://forms.gle/6WPEGNWbYoEe6bte8" target="_blank">here</a>.

2 papers0 benchmarksImages, Speech, Videos

nEMO

Overview nEMO is a simulated dataset of emotional speech in the Polish language. The corpus contains over 3 hours of samples recorded with the participation of nine actors portraying six emotional states: anger, fear, happiness, sadness, surprise, and a neutral state. The text material used was carefully selected to represent the phonetics of the Polish language. The corpus is available for free under the Creative Commons license (CC BY-NC-SA 4.0).

2 papers0 benchmarksAudio, Speech

EARS-Reverb

The EARS-Reverb dataset uses real recorded room impulse responses (RIRs) from multiple public datasets (ACE-Challenge, AIR, ARNI, BRUDEX, dEchorate, DetmoldSRIR, and Palimpsest). All RIRs are fullband, and a randomly selected channel for multi-channel recordings is used. The reverberant speech is generated by convolving the clean speech with the RIR. To avoid a time delay between the reverberant and clean speech signal caused by the direct path of the RIR, the beginning of the RIR is cut off up to the index with the highest amplitude. Only RIRs with an RT60 reverberation time that does not exceed 2 s are used. Finally, the loudness of the reverberant speech is normalized to the loudness of the clean speech using the loudness K-weighted relative to full scale (LKFS).

2 papers5 benchmarksSpeech

NeuroVoz (NeuroVoz: a Castillian Spanish corpus of parkinsonian speech)

The NeuroVoz dataset emerges as a pioneering resource in the field of computational linguistics and biomedical research, specifically designed to enhance the diagnosis and understanding of Parkinson's Disease (PD) through speech analysis. This dataset is distinguished as the first of its kind to be made publicly available in Castilian Spanish, addressing a critical gap in the availability of linguistic and dialectical diversity within PD research.

2 papers0 benchmarksAudio, Speech

LongVALE

Despite impressive advancements in video understanding, most efforts remain limited to coarse-grained or visual-only video tasks. However, real-world videos encompass omni-modal information (vision, audio, and speech) with a series of events forming a cohesive storyline. The lack of multi-modal video data with fine-grained event annotations and the high cost of manual labeling are major obstacles to comprehensive omni-modality video perception. To address this gap, we propose an automatic pipeline consisting of high-quality multi-modal video filtering, semantically coherent omni-modal event boundary detection, and cross-modal correlation-aware event captioning. In this way, we present LongVALE, the first-ever Vision-Audio-Language Event understanding benchmark comprising 105K omni-modal events with precise temporal boundaries and detailed relation-aware captions within 8.4K high-quality long videos. Further, we build a baseline that leverages LongVALE to enable video large language mod

2 papers0 benchmarksAudio, Speech, Texts, Videos

Silent Speech EMG

Facial electromyography recordings during both silent and vocalized speech.

1 papers0 benchmarksSpeech

FT Speech

FT Speech is a speech corpus created from the recorded meetings of the Danish Parliament, otherwise known as the Folketing (FT). The corpus contains over 1,800 hours of transcribed speech by a total of 434 speakers. It is significantly larger in duration, vocabulary, and amount of spontaneous speech than the existing public speech corpora for Danish, which are largely limited to read-aloud and dictation data.

1 papers0 benchmarksSpeech

Kite

The Kite database is a multi-modal dataset for the control of unmanned aerial vehicles (UAVs). There are three modalities present in the dataset:

1 papers0 benchmarksImages, Speech, Texts

JSS Dataset (Jejueo Single Speaker Speech)

The Jejueo Single Speaker Speech (JSS) dataset consists of 10k high-quality audio files recorded by a native Jejueo speaker and a transcript file.

1 papers0 benchmarksSpeech, Texts

Spot the Difference Corpus

Spot the Difference Corpus is a corpus of task-oriented spontaneous dialogues which contains 54 interactions between pairs of subjects interacting to find differences in two very similar scenes. The corpus includes rich transcriptions, annotations, audio and video.

1 papers0 benchmarksSpeech

The Spoken Wikipedia Corpora

The SWC is a corpus of aligned Spoken Wikipedia articles from the English, German, and Dutch Wikipedia. This corpus has several outstanding characteristics:

1 papers2 benchmarksAudio, Speech

Parkinson Speech Dataset

Parkinson Speech Dataset is an audio dataset consisting of recordings of 20 Parkinson's Disease (PD) patients and 20 healthy subjects. From all subjects, multiple types of sound recordings (26) are taken. The goal is to classify which patients have Parkinson's.

1 papers0 benchmarksAudio, Speech

NISP- A Multi-lingual Multi-accent Dataset for Speaker Profiling

We announce the release of a new multilingual speaker dataset called NITK-IISc Multilingual Multi-accent Speaker Profiling(NISP) dataset. The dataset contains speech in six different languages -- five Indian languages along with Indian English. The dataset contains speech data from 345 bilingual speakers in India. Each speaker has contributed about 4-5 minutes of data that includes recordings in both English and their mother tongue. The transcript for the text is provided in UTF-8 format. For every speaker, the dataset contains speaker meta-data such as L1, native place, medium of instruction, current residing place etc. In addition the dataset also contains physical parameter information of the speakers such as age, height, shoulder size and weight. We hope that the dataset is useful for a diverse set of research activities including multilingual speaker recognition, language and accent recognition, automatic speech recognition etc.

1 papers0 benchmarksSpeech

twitter politicians data

Dataset based on Twitter usernames of American politicians. Data extracted from Wikidata.

1 papers0 benchmarksSpeech

Kinect-WSJ

Kinect-WSJ is a multichannel, multispeaker, reverberated, noisy dataset which extends the WSJ0-2mix singlechannel, non-reverberated, noiseless dataset to the strong reverberation and noise conditions and the Kinect-like microphone array geometry used in CHiME-5.

1 papers0 benchmarksAudio, Speech

PreviousPage 7 of 10Next