TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Datasets

486 machine learning datasets

Filter by Modality

  • Images3,275
  • Texts3,148
  • Videos1,019
  • Audio486
  • Medical395
  • 3D383
  • Time series298
  • Graphs285
  • Tabular271
  • Speech199
  • RGB-D192
  • Environment148
  • Point cloud135
  • Biomedical123
  • LiDAR95
  • RGB Video87
  • Tracking78
  • Biology71
  • Actions68
  • 3d meshes65
  • Tables52
  • Music48
  • EEG45
  • Hyperspectral images45
  • Stereo44
  • MRI39
  • Physics32
  • Interactive29
  • Dialog25
  • Midi22
  • 6D17
  • Replay data11
  • Financial10
  • Ranking10
  • Cad9
  • fMRI7
  • Parallel6
  • Lyrics2
  • PSG2
Clear filter

486 dataset results

USM-SED

USM-SED is a dataset for polyphonic sound event detection in urban sound monitoring use-cases. Based on isolated sounds taken from the FSD50k dataset, 20,000 polyphonic soundscapes are synthesized with sounds being randomly positioned in the stereo panorama using different loudness levels.

1 papers0 benchmarksAudio

Cleft

The Cleft dataset is a collection of ultrasound tongue imaging and audio data, gathered from children with cleft lip and palate by a research speech and language therapist working in a hospital environment.

1 papers0 benchmarksAudio

Dataset of Context information for Zero Interaction Security

We release both the processed data and evaluation results from our own experiments, and the underlying raw data that can be used for future experiments and schemes in the domain of Zero-Interaction Security. Find more details in the dataset description on Zenodo.

1 papers0 benchmarksAudio, Environment

GPLA-12

GPLA-12 is a new acoustic leakage dataset of gas pipelines involving 12 categories over 684 training/testing acoustic signals. The acoustic leakage signals were collected on the basis of an intact gas pipe system with external artificial leakages, and then preprocessed with structured tailoring which are turned into GPLA-12. GPLA-12 dedicates to serve as a feature learning dataset for time-series tasks and classifications.

1 papers0 benchmarksAudio

CHORD (CHOrus Recognition Dataset)

CHORD is the first chorus recognition dataset containing 627 songs for public use.

1 papers0 benchmarksAudio, Texts

USC (Uzbek Speech Corpus)

The Uzbek speech corpus (USC) comprises 958 different speakers with a total of 105 hours of transcribed audio recordings. This is the first open-source Uzbek speech corpus dedicated to the ASR task.

1 papers0 benchmarksAudio

Well-being Dataset (Cambridge Well-being Dataset for Psychological Distress Analysis)

The dataset is a private dataset collected for automatic analysis of psychological distress. It contains self-reported distress labels provided by human volunteers. The dataset consists of 30-min interview recordings of participants.

1 papers1 benchmarksAudio, Speech, Time series, Videos

ChMusic

ChMusic is a traditional Chinese music dataset for training model and performance evaluation of musical instrument recognition. This dataset cover 11 musical instruments, consisting of Erhu, Pipa, Sanxian, Dizi, Suona, Zhuiqin, Zhongruan, Liuqin, Guzheng, Yangqin and Sheng.

1 papers0 benchmarksAudio, Music

ICASSP 2021 Acoustic Echo Cancellation Challenge

The ICASSP 2021 Acoustic Echo Cancellation Challenge is intended to stimulate research in the area of acoustic echo cancellation (AEC), which is an important part of speech enhancement and still a top issue in audio communication and conferencing systems. Many recent AEC studies report good performance on synthetic datasets where the train and test samples come from the same underlying distribution. However, the AEC performance often degrades significantly on real recordings. Also, most of the conventional objective metrics such as echo return loss enhancement (ERLE) and perceptual evaluation of speech quality (PESQ) do not correlate well with subjective speech quality tests in the presence of background noise and reverberation found in realistic environments. In this challenge, we open source two large datasets to train AEC models under both single talk and double talk scenarios. These datasets consist of recordings from more than 2,500 real audio devices and human speakers in real en

1 papers0 benchmarksAudio, Speech

INTERSPEECH 2021 Acoustic Echo Cancellation Challenge

The INTERSPEECH 2021 Acoustic Echo Cancellation Challenge is intended to stimulate research in the area of acoustic echo cancellation (AEC), which is an important part of speech enhancement and still a top issue in audio communication and conferencing systems. Many recent AEC studies report reasonable performance on synthetic datasets where the train and test samples come from the same underlying distribution. However, the AEC performance often degrades significantly on real recordings. Also, most of the conventional objective metrics such as echo return loss enhancement (ERLE) and perceptual evaluation of speech quality (PESQ) do not correlate well with subjective speech quality tests in the presence of background noise and reverberation found in realistic environments. In this challenge, we open source two large datasets to train AEC models under both single talk and double talk scenarios. These datasets consist of recordings from more than 5,000 real audio devices and human speakers

1 papers0 benchmarksAudio, Speech

AVASpeech-SMAD (AVASpeech-SMAD: A Strongly Labelled Speech and Music Activity Detection Dataset with Label Co-Occurrence)

We propose a dataset, AVASpeech-SMAD, to assist speech and music activity detection research. With frame-level music labels, the proposed dataset extends the existing AVASpeech dataset, which originally consists of 45 hours of audio and speech activity labels. To the best of our knowledge, the proposed AVASpeech-SMAD is the first open-source dataset that features strong polyphonic labels for both music and speech. The dataset was manually annotated and verified via an iterative cross-checking process. A simple automatic examination was also implemented to further improve the quality of the labels. Evaluation results from two state-of-the-art SMAD systems are also provided as a benchmark for future reference.

1 papers0 benchmarksAudio, Music, Speech

Banglish

A Bilingual Dataset for Bangla and English Voice Commands

1 papers1 benchmarksAudio

Cross-cultural pop song mood ratings (US, KR, BR)

Mood ratings of 8 emotions gathered across 360 pop songs 166 raters from US, S.Korea and Brazil MIR features from Spotify

1 papers0 benchmarksAudio

EmoFilm (Emotional speech from Films)

EmoFilm is a multilingual emotional speech corpus comprising 1115 audio instances produced in English, Italian, and Spanish languages. The audio clips (with a mean length of 3.5 sec. and std 1.2 sec.) were extracted in wave format (uncompressed, mono, 48 kHz sample rate and 16-bit) from 43 films (original in English and their over-dubbed Italian and Spanish versions). Genres including comedy, drama, horror, and thriller were considered; anger, contempt, happiness, fear, and sadness emotional states were taken into account. EmoFilm has been presented at Interspeech 2018: Emilia Parada-Cabaleiro, Giovanni Costantini, Anton Batliner, Alice Baird, and Björn Schuller (2018), Categorical vs Dimensional Perception of Italian Emotional Speech, in Proc. of Interspeech, Hyderabad, India, pp. 3638-3642.

1 papers0 benchmarksAudio

SES (Spanish Emotional Speech)

Currently, an essential point in speech synthesis is the addressing of the variability of human speech. One of the main sources of this diversity is the emotional state of the speaker. Most of the recent work in this area has been focused on the prosodic aspects of speech and on rule-based formant synthesis experiments. Even when adopting an improved voice source, we cannot achieve a smiling happy voice or the menacing quality of cold anger. For this reason, we have performed two experiments aimed at developing a concatenative emotional synthesiser, a synthesiser that can copy the quality of an emotional voice without an explicit mathematical model.

1 papers0 benchmarksAudio, Texts

AESI (Athens Emotional States Inventory)

The development of ecologically valid procedures for collecting reliable and unbiased emotional data towards computer interfaces with social and affective intelligence targeting patients with mental disorders. Following its development, presented with, the Athens Emotional States Inventory (AESI) proposes the design, recording and validation of an audiovisual database for five emotional states: anger, fear, joy, sadness and neutral. The items of the AESI consist of sentences each having content indicative of the corresponding emotion. Emotional content was assessed through a survey of 40 young participants with a questionnaire following the Latin square design. The emotional sentences that were correctly identified by 85% of the participants were recorded in a soundproof room with microphones and cameras. A preliminary validation of AESI is performed through automatic emotion recognition experiments from speech. The resulting database contains 696 recorded utterances in Greek language

1 papers0 benchmarksAudio, Texts

EmoSpeech

EmoSpeech contains keywords with diverse emotions and background sounds, presented to explore new challenges in audio analysis.

1 papers0 benchmarksAudio, Speech

EGDB

This dataset contains transcriptions of the electric guitar performance of 240 tablatures, rendered with different tones. The goal is to contribute to automatic music transcription (AMT) of guitar music, a technically challenging task.

1 papers0 benchmarksAudio

MC_GRID (Multi_Channel_Grid)

Here we release the dataset (Multi_Channel_Grid, abbreviated as MC_Grid) used in our paper LIMUSE: LIGHTWEIGHT MULTI-MODAL SPEAKER EXTRACTION.

1 papers0 benchmarksAudio, Speech, Videos

WHAMR_ext

WHAMR_ext is an extension to the WHAMR corpus with larger RT60 values (between 1s and 3s)

1 papers5 benchmarksAudio, Speech
PreviousPage 17 of 25Next