TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Datasets

1,019 machine learning datasets

Filter by Modality

  • Images3,275
  • Texts3,148
  • Videos1,019
  • Audio486
  • Medical395
  • 3D383
  • Time series298
  • Graphs285
  • Tabular271
  • Speech199
  • RGB-D192
  • Environment148
  • Point cloud135
  • Biomedical123
  • LiDAR95
  • RGB Video87
  • Tracking78
  • Biology71
  • Actions68
  • 3d meshes65
  • Tables52
  • Music48
  • EEG45
  • Hyperspectral images45
  • Stereo44
  • MRI39
  • Physics32
  • Interactive29
  • Dialog25
  • Midi22
  • 6D17
  • Replay data11
  • Financial10
  • Ranking10
  • Cad9
  • fMRI7
  • Parallel6
  • Lyrics2
  • PSG2
Clear filter

1,019 dataset results

MCCSD (Mandarin Chinese Cued Speech Dataset)

This MCCS dataset is the first large-scale Mandarin Chinese Cued Speech dataset. This dataset covers 23 major categories of scenarios (e.g, communication, transportation and shoping) and 72 subcategories of scenarios (e.g, meeting, dating and introduction). It is recorded by four skilled native Mandarn Chinese Cued Speech cuers with portable cameras on the mobile phones. The Cued Speech videos are recorded with 30fps and 1280x720 format. We provide the raw Cued Speech videos, text file (with 1000 sentences) and corresponding annotations which contains two kind of data annotation. One is continuious video annotation with ELAN, the other is discrete audio annotations with Praat.

0 papers0 benchmarksActions, Audio, Speech, Videos

HEADSET (HEADSET: Human Emotion Awareness under Partial Occlusions Multimodal DataSET)

The volumetric representation of human interactions is one of the fundamental domains in the development of immersive media productions and telecommunication applications. Particularly in the context of the rapid advancement of Extended Reality (XR) applications, this volumetric data has proven to be an essential technology for future XR elaboration. In this work, we present a new multimodal database to help advance the development of immersive technologies. Our proposed database provides ethically compliant and diverse volumetric data, in particular 27 participants displaying posed facial expressions and subtle body movements while speaking, plus 11 participants wearing head-mounted displays (HMDs). The recording system consists of a volumetric capture (VoCap) studio, including 31 synchronized modules with 62 RGB cameras and 31 depth cameras. In addition to textured meshes, point clouds, and multi-view RGB-D data, we use one Lytro Illum camera for providing light field (LF) data simul

0 papers0 benchmarks3D, 3d meshes, Audio, Images, Point cloud, RGB Video, RGB-D, Videos

DREAMING Inpainting Dataset (Diminished Reality for Emerging Applications in Medicine through Inpainting Dataset)

Dataset for the DREAMING - Diminished Reality for Emerging Applications in Medicine through Inpainting Challenge!

0 papers0 benchmarksBiomedical, Images, Medical, RGB Video, Videos

L-SVD (Large-Scale Selfie Video Dataset (L-SVD): A Benchmark for Emotion Recognition)

Welcome to L-SVD L-SVD is an extensive and rigorously curated video dataset aimed at transforming the field of emotion recognition. This dataset features more than 20,000 short video clips, each carefully annotated to represent a range of human emotions. L-SVD stands at the intersection of Cognitive Science, Psychology, Computer Science, and Medical Science, providing a unique tool for both research and application in these fields.

0 papers0 benchmarksRGB Video, Videos

CTV-Dataset (Cyclist Top-View Dataset)

The CTV-Dataset (CTV stands for Cyclist Top-View) is a trajectories dataset for cyclist behaviour in mixed-traffic environments (aka. shared spaces). This dataset is meant to enlarge the available datasets in the community, focusing on cyclists as main road users to help the research in understanding and predicting cyclist behaviour in shared spaces. The dataset results from an experiment conducted in TU Clausthal to extract data from possible interaction scenarios with other road users, such as pedestrians and cars, in shared spaces. The scenarios were captured using a drone with 4K (3840×2160) resolution at 29.97 fps to ensure high-quality results. The trajectories were extracted using an in-house developed computer vision algorithm.

0 papers0 benchmarksTexts, Videos

fish (fishway)

The data was captured from an overhead perspective, showcasing the swimming behavior of fish in a simulated flowing water channel. This angle provides a panoramic view from above to observe the water channel and the fish behavior. It enables researchers to better observe and analyze fish swimming patterns, group behavior, and their adaptive abilities to water dynamics. Moreover, the overhead perspective offers more accurate spatial positioning and motion tracking, providing valuable data for studying fish behavior and ecology. By observing and analyzing this data, a deeper understanding of fish ecological adaptability, migration patterns, and interactions with environmental factors in simulated flowing water channels can be gained. This knowledge serves as a scientific basis and decision support for areas such as aquaculture, ecological conservation, and hydraulic research. E-mail: peifei122@gmail.com

0 papers0 benchmarksVideos

ABODA (Abandoned Object Dataset)

ABandoned Objects DAtaset (ABODA) is a new public dataset for abandoned object detection. ABODA comprises 11 sequences labeled with various real-application scenarios that are challenging for abandoned-object detection. The situations include crowded scenes, marked changes in lighting condition, night-time detection, as well as indoor and outdoor environments.

0 papers0 benchmarksImages, Videos

Vript (🎬 Vript: A Video Is Worth Thousands of Words)

We construct a fine-grained video-text dataset with 12K annotated high-resolution videos (~400k clips). The annotation of this dataset is inspired by the video script. If we want to make a video, we have to first write a script to organize how to shoot the scenes in the videos. To shoot a scene, we need to decide the content, shot type (medium shot, close-up, etc), and how the camera moves (panning, tilting, etc). Therefore, we extend video captioning to video scripting by annotating the videos in the format of video scripts. Different from the previous video-text datasets, we densely annotate the entire videos without discarding any scenes and each scene has a caption with ~145 words. Besides the vision modality, we transcribe the voice-over into text and put it along with the video title to give more background information for annotating the videos.

0 papers0 benchmarksTexts, Videos

OpenEQA

The OpenEQA dataset is a significant contribution in the field of Embodied Question Answering (EQA). Let me provide you with some details:

0 papers0 benchmarksVideos

GenAI-Bench: Evaluating and Improving Compositional Text-to-Visual Generation

GenAI-Bench benchmark consists of 1,600 challenging real-world text prompts sourced from professional designers. Compared to benchmarks such as PartiPrompt and T2I-CompBench, GenAI-Bench captures a wider range of aspects in the compositional text-to-visual generation, ranging from basic (scene, attribute, relation) to advanced (counting, comparison, differentiation, logic). GenAI-Bench benchmark also collects human alignment ratings (1-to-5 Likert scales) on images and videos generated by ten leading models, such as Stable Diffusion, DALL-E 3, Midjourney v6, Pika v1, and Gen2.

0 papers0 benchmarksImages, Texts, Videos

Temporal Logic Video (TLV) Dataset

The Temporal Logic Video (TLV) Dataset addresses the scarcity of state-of-the-art video datasets for long-horizon, temporally extended activity and object detection. It comprises two main components:

0 papers0 benchmarksImages, Videos

MS-EVS Dataset (Multispectral Event-based Face detection dataset)

The MS-EVS Dataset is the first large-scale event-based dataset for face detection.

0 papers0 benchmarksHyperspectral images, Images, Videos

DeepSpeak Dataset v1.0

The DeepSpeak dataset contains over 43 hours of real and deepfake footage of people talking and gesturing in front of their webcams. The source data was collected from a diverse set of participants in their natural environments and the deepfakes were generated using state-of-the-art open-source lip-sync and face-swap software.

0 papers0 benchmarksVideos

Gap Pattern Detection (Gap Pattern (Gap Up and Gap Down) Detection in Candlestick Trading Charts for Technical Analysis)

1. Candlestick Charts Candlestick charts are a type of financial chart used to represent the price movement of an asset (e.g., stocks, cryptocurrencies) over time. Each "candlestick" consists of: - Body: Represents the opening and closing prices. - Wicks (or Shadows): Represent the highest and lowest prices during the time period.

0 papers0 benchmarksImages, Videos

THVD (Talking Head Video Dataset)

About

0 papers0 benchmarks3D, Actions, Audio, Environment, Speech, Videos

FortisAVQA

We introduce FortisAVQA, a dataset designed to assess the robustness of AVQA models. Its construction involves two key processes: rephrasing and splitting. Rephrasing modifies questions from the test set of MUSIC-AVQA to enhance linguistic diversity, thereby mitigating the reliance of models on spurious correlations between key question terms and answers. Splitting entails the automatic and reasonable categorization of questions into frequent (head) and rare (tail) subsets, enabling a more comprehensive evaluation of model performance in both in-distribution and out-of-distribution scenarios.

0 papers0 benchmarksAudio, Images, Texts, Videos

Video Dataset (Storytelling Video Dataset (Russian, Emotion, Gesture, Speech))

The Storytelling Video Dataset is a high-quality, human-reviewed multimodal dataset featuring over 700 full-body video recordings of native Russian speakers. Each video is 10+ minutes long and includes synchronized speech, facial expressions, gestures, and emotional variation. The dataset is ideal for research and development in:

0 papers0 benchmarksAudio, Speech, Texts, Videos

Shanghai2020 (Shanghai-2020 Dataset)

It is released by the Shanghai Central Meteorological Observatory (SCMO) in 2020, records serval years of historical precipitation events in the Yangtze River delta area. The dataset contains a total of 43000 samples of precipitation events, of which 40000 samples for training and 3000 samples for testing. Each sample consists of 20 consecutive radar echo frames and lasts for 3 hours, where the first 10 frames are with intervals of 6 minutes and the last 10 frames are with intervals of 12 minutes. The echo frame has 460 * 460 resolution and covers 460km * 398km region. We additionally split out 3000 samples from the training set and use them for validation.

0 papers0 benchmarksImages, Videos

SICS-155 (Phase Recognition in Small Incision Cataract Surgery Videos)

Cataract is the leading cause of blindness worldwide, most affecting life in low- and middle-income countries (LMICs). The mainly used, most appropriate, and most cost-effective cataract surgical technique for LMICs is small incision cataract surgery (SICS). While algorithms have been developed for automated video analysis of surgical performance parameters for the cataract surgical technique predominantly used in high-income settings, so far there were no datasets nor algorithms for SICS available. This MICCAI challenge introduces the first SICS video dataset and offers teams the opportunity to evaluate the effectiveness of their phase recognition algorithms. The dataset of 155 patients was recruited at Sankara Eye Hospital in India.

0 papers0 benchmarksMedical, Videos
PreviousPage 51 of 51