TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Datasets

1,019 machine learning datasets

Filter by Modality

  • Images3,275
  • Texts3,148
  • Videos1,019
  • Audio486
  • Medical395
  • 3D383
  • Time series298
  • Graphs285
  • Tabular271
  • Speech199
  • RGB-D192
  • Environment148
  • Point cloud135
  • Biomedical123
  • LiDAR95
  • RGB Video87
  • Tracking78
  • Biology71
  • Actions68
  • 3d meshes65
  • Tables52
  • Music48
  • EEG45
  • Hyperspectral images45
  • Stereo44
  • MRI39
  • Physics32
  • Interactive29
  • Dialog25
  • Midi22
  • 6D17
  • Replay data11
  • Financial10
  • Ranking10
  • Cad9
  • fMRI7
  • Parallel6
  • Lyrics2
  • PSG2
Clear filter

1,019 dataset results

DAVIS-Edit

DAVIS-Edit is a curated testing benchmark for video editing. This dataset contains two evaluation settings, i.e., text- and image-based editing. Besides, it offers two types of annotated for both modalities of prompts, considering the editing scenarios with similar (DAVIS-Edit-S) and changing (DAVIS-Edit-C) shapes, so as to address the shape inconsistency problem in video-to-video editing.

1 papers0 benchmarksImages, Texts, Videos

DeVAn (Dense Video Annotation for Video-Language Models)

DeVAn is a multi-modal dataset containing 8.5K video clips carefully selected from previously published YouTube-based video datasets (YouTube-8M and YT-Temporal-1B) that integrate visual and auditory information. Over the span of 10 months, a team of 24 human annotators (college and graduate level students) created 5 short captions (1 sentence each) and 5 long summaries (3-10 sentences) for each video clip, resulting in a rich and comprehensive human-annotated dataset that serves as a robust ground truth for subsequent model training and evaluation.

1 papers0 benchmarksVideos

3D Flow Shapes

The dataset consists of high-resolution three-dimensional (3D) turbulent flow simulations. It captures intricate vortex structures caused by a variety of shapes within a channel flow environment. The dataset is generated using OpenFOAM in large eddy simulation (LES) mode, ensuring the preservation of detailed turbulent characteristics across all spatial scales.

1 papers0 benchmarksTime series, Videos

SynoClip

SynoClip Dataset The SynoClip dataset is a comprehensive and standard dataset specifically designed for the video synopsis task. It consists of six videos, ranging from 8 to 45 minutes, captured from outdoor-mounted surveillance cameras. This dataset is annotated with tracking information, making it an ideal resource not only for video synopsis but also for related tasks such as object detection in videos and multi-object tracking.

1 papers0 benchmarksVideos

EGO-CH-Gaze (Learning to Detect Attended Objects in Cultural Sites with Gaze Signals and Weak Object Supervision)

To study the problem of weakly supervised attended object detection in cultural sites, we collected and labeled a dataset of egocentric images acquired from subjects visiting a cultural site. The dataset has been designed to offer a snapshot of the subject’s visual experience while visiting a museum and contains labels for several artworks and details attended by the subjects.

1 papers0 benchmarksEnvironment, Images, Videos

ConsisID-preview-Data

Description

1 papers0 benchmarksTexts, Videos

ChronoMagic-ProH

Description

1 papers0 benchmarksTexts, Videos

AirLetters

This large collection of over 161,000 video-label pairs of video clips, shows humans drawing letters and digits in the air, and is used to evaluate a model’s ability to classify articulated motions correctly. Unlike existing video datasets, AirLetters’ accurate classification predictions rely on discerning motion patterns and integrating information presented by the video over time (i.e., over many frames of video). That study revealed that while trivial for humans, accurate representations of complex articulated motions remain an open problem for end-to-end learning for video understanding models.

1 papers0 benchmarksVideos

TimberVision

The TimberVision dataset consists of more than 2k annotated RGB images and contains a total of 51k trunk components including cut and lateral surfaces, thereby surpassing any existing dataset in this domain in terms of both quantity and detail by a large margin. The dataset can be used to train oriented object detection and instance segmentation and evaluate the influence of multiple scene parameters on model performance. Additionally, a generic framework is provided to fuse the components detected by the models for both tasks into unified trunk representations. Furthermore, geometric properties are derived automatically and multi-object tracking is applied to further enhance robustness.

1 papers0 benchmarksImages, Videos

Sound of Water 50

We collect a dataset of 805 clean videos that show the action of pouring water in a container. Our dataset spans over 50 unique containers made of 5 different materials, 4 different shapes and with hot and cold water.

1 papers1 benchmarksAudio, Videos

COOOL: Challenge Of Out-Of-Label A Novel Benchmark for Autonomous Driving

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

1 papers0 benchmarksVideos

TUMTraffic-VideoQA

TUMTraffic-VideoQA is a novel dataset designed to understand spatiotemporal video in complex roadside traffic scenarios. The dataset comprises 1,000 videos, featuring 85,000 multiple-choice QA pairs, 2,300 object captioning, and 5,700 object grounding annotations, encompassing diverse real-world conditions such as adverse weather and traffic anomalies. By incorporating tuple-based spatiotemporal object expressions, TUMTraffic-VideoQA unifies three essential tasks—multiple-choice video question answering, referred object captioning, and spatiotemporal object grounding—within a cohesive evaluation framework.

1 papers0 benchmarksImages, Texts, Videos

VideoDB's OCR Benchmark Public Collection

Dataset Introduction This dataset leverages VideoDB's Public Collection to offer a diverse range of videos featuring text-containing scenes. It spans multiple categories—ranging from finance and legal documents to software UI elements and handwritten notes—ensuring a broad representation of real-world text appearances. Each video is annotated with frame indexes to facilitate consistent and reproducible OCR benchmarks. Currently, the dataset includes over 25 curated videos, yielding thousands of extracted frames that present a variety of text-related challenges.

1 papers3 benchmarksImages, Texts, Videos

VPData

The largest video inpainting dataset comprises over 390K clips (> 866.7 hours), featuring precise masks and detailed video captions.

1 papers0 benchmarksRGB Video, Texts, Tracking, Videos

VPBench

The benchmark for VPData, the largest video inpainting dataset, which comprises over 390K clips (> 866.7 hours) and features precise masks and detailed video captions.

1 papers0 benchmarksRGB Video, Texts, Tracking, Videos

AerialMPT

AerialMPT is a dataset for pedestrian tracking in aerial image sequences and presents real-world challenges for MOT algorithms such as low frame rate, small moving objects, and complex backgrounds. AerialMPT consists of 14 sequences and 307 frames with an average size of 425 × 358 pixels. The images were acquired by DLR's 4K camera system from altitudes ranging from 600 m to 1400 m, resulting in spatial resolutions (GSDs) ranging from 8 cm/pixel to 13 cm/pixel. In a post-processing step, the images were co-registered, geo-referenced, and cropped for each region of interest, resulting in sequences of 2 fps. The images were acquired during different flight campaigns between 2016 and 2017, over different scenes containing pedestrians and with different crowd densities and movement complexities.

1 papers0 benchmarksImages, Videos

VETRA

VETRA is a dataset for vehicle tracking in aerial image sequences and presents unique challenges such as low frame rates, small and fast-moving objects, as well as high camera movement. These characteristics allow for extended tracking of numerous vehicles with varying motion behaviors over large areas and pose new challenges for MOT algorithms. VETRA consists of 52 image sequences captured by airplanes and helicopters using DLR’s 3k and 4k camera systems. The acquisition sites are located in Germany and Austria. In addition to the classical training, validation and test sets, VETRA offers a second test set specifically designed for the application of large area monitoring (LAM). The LAM sequences are recorded over 7 rural roads and motorways with a fixed camera speed and configuration. Each road section is captured at 4 different times of the day, enabling the performance of MOT algorithms to be evaluated under different traffic loads in a static environment. Furthermore, the feature

1 papers0 benchmarksImages, RGB Video, Videos

Songdo Traffic (Songdo Traffic: High Accuracy Georeferenced Vehicle Trajectories from a Large-Scale Study in a Smart City)

The Songdo Traffic dataset delivers precisely georeferenced vehicle trajectories captured through high-altitude bird's-eye view (BeV) drone footage over Songdo International Business District, South Korea. Comprising approximately 700,000 unique trajectories, this resource represents one of the most extensive aerial traffic datasets publicly available, distinguishing itself through exceptional temporal resolution that captures vehicle movements at 29.97 points per second, enabling unprecedented granularity for advanced urban mobility analysis.

1 papers0 benchmarksImages, Tabular, Time series, Tracking, Videos

Mr. HiSum

Mr. HiSum is a large-scale video highlight detection and summarization dataset, which contains 31,892 videos selected from YouTube-8M dataset and reliable frame importance score labels aggregated from 50,000+ users per video.

1 papers2 benchmarksVideos

LSDBench (Long-video Sampling Dilemma Benchmark)

A benchmark that focuses on the sampling dilemma in long-video tasks. The LSDBench dataset is designed to evaluate the sampling efficiency of long-video VLMs. It consists of multiple-choice question-answer pairs based on hour-long videos, focusing on dense and short-duration actions with high Necessary Sampling Density (NSD).

1 papers0 benchmarksActions, Images, Texts, Videos
PreviousPage 47 of 51Next