Datasets

1,019 machine learning datasets

1,019 dataset results

X4K1000FPS

Dataset of high-resolution (4096×2160), high-fps (1000fps) video frames with extreme motion. X-TEST consists of 15 video clips with 33-length of 4K-1000fps frames. X-TRAIN consists of 4,408 clips from various types of 110 scenes. The clips are 65-length of 1000fps frames

27 papers8 benchmarksVideos

UVO (Unidentified Video Objects: A Benchmark for Dense, Open-World Segmentation)

UVO is a new benchmark for open-world class-agnostic object segmentation in videos. Besides shifting the problem focus to the open-world setup, UVO is significantly larger, providing approximately 8 times more videos compared with DAVIS, and 7 times more mask (instance) annotations per video compared with YouTube-VOS and YouTube-VIS. UVO is also more challenging as it includes many videos with crowded scenes and complex background motions. Some highlights of the dataset include:

27 papers4 benchmarksImages, RGB Video, Videos

ROAD (ROAD: The ROad event Awareness Dataset for Autonomous Driving)

ROAD is designed to test an autonomous vehicle's ability to detect road events, defined as triplets composed by an active agent, the action(s) it performs and the corresponding scene locations. ROAD comprises videos originally from the Oxford RobotCar Dataset, annotated with bounding boxes showing the location in the image plane of each road event.

27 papers0 benchmarksVideos

IntentQA

We contribute an IntentQA dataset with diverse intents in daily social activities.

27 papers6 benchmarksTexts, Videos

Street Scene

Street Scene is a dataset for video anomaly detection. Street Scene consists of 46 training and 35 testing high resolution 1280×720 video sequences taken from a USB camera overlooking a scene of a two-lane street with bike lanes and pedestrian sidewalks during daytime. The dataset is challenging because of the variety of activity taking place such as cars driving, turning, stopping and parking; pedestrians walking, jogging and pushing strollers; and bikers riding in bike lanes. In addition the videos contain changing shadows, moving background such as a flag and trees blowing in the wind, and occlusions caused by trees and large vehicles. There are a total of 56,847 frames for training and 146,410 frames for testing, extracted from the original videos at 15 frames per second. The dataset contains a total of 205 naturally occurring anomalous events ranging from illegal activities such as jaywalking and illegal U-turns to simply those that do not occur in the training set such as pets be

26 papers10 benchmarksVideos

CRVD (Captured Raw Video Denoising)

The CRVD dataset consists of 55 groups of noisy-clean videos with ISO values ranging from 1600 to 25600.

26 papers5 benchmarksVideos

Drive&Act

The Drive&Act dataset is a state of the art multi modal benchmark for driver behavior recognition. The dataset includes 3D skeletons in addition to frame-wise hierarchical labels of 9.6 Million frames captured by 6 different views and 3 modalities (RGB, IR and depth).

26 papers8 benchmarks3D, RGB-D, Videos

Cityscapes-VPS

Cityscapes-VPS is a video extension of the Cityscapes validation split. It provides 2500-frame panoptic labels that temporally extend the 500 Cityscapes image-panoptic labels. There are total 3000-frame panoptic labels which correspond to 5, 10, 15, 20, 25, and 30th frames of each 500 videos, where all instance ids are associated over time. It not only supports video panoptic segmentation (VPS) task, but also provides super-set annotations for video semantic segmentation (VSS) and video instance segmentation (VIS) tasks.

26 papers9 benchmarksVideos

Animal Kingdom

Animal Kingdom is a large and diverse dataset that provides multiple annotated tasks to enable a more thorough understanding of natural animal behaviors. The wild animal footage used in the dataset records different times of the day in an extensive range of environments containing variations in backgrounds, viewpoints, illumination and weather conditions. More specifically, the dataset contains 50 hours of annotated videos to localize relevant animal behavior segments in long videos for the video grounding task, 30K video sequences for the fine-grained multi-label action recognition task, and 33K frames for the pose estimation task, which correspond to a diverse range of animals with 850 species across 6 major animal classes.

26 papers6 benchmarksImages, Videos

MSU SR-QA Dataset (MSU Super-Resolution Quality Assessment Dataset)

Our dataset was made of videos from MSU Video Upscalers Benchmark Dataset, MSU Video Super-Resolution Benchmark Dataset and MSU Super-Resolution for Video Compression Benchmark Dataset. Dataset consists of real videos (were filmed with 2 cameras), video games footages, movies, cartoons, dynamic ads.

26 papers12 benchmarksVideos

RWTH-PHOENIX-Weather 2014

The signing is recorded by a stationary color camera placed in front of the sign language interpreters. Interpreters wear dark clothes in front of an artificial grey background with color transition. All recorded videos are at 25 frames per second and the size of the frames is 210 by 260 pixels. Each frame shows the interpreter box only.

25 papers1 benchmarksVideos

IKEA ASM

A three million frame, multi-view, furniture assembly video dataset that includes depth, atomic actions, object segmentation, and human pose.

25 papers10 benchmarksVideos

MSU Video Super Resolution Benchmark: Detail Restoration

This is a dataset for a video super-resolution task. The dataset contains the most complex content for the restoration task: faces, text, QR-codes, car numbers, unpatterned textures, small details. Videos include different types of motion and different types of degradation: bicubic interpolation (BI) and Gaussian blurring and downsampling (BD). The resolution of all input video sequences is 480x320. Source: https://videoprocessing.ai/benchmarks/video-super-resolution.html Image Source: https://videoprocessing.ai/benchmarks/video-super-resolution.html

25 papers77 benchmarksImages, Videos

BDD-A (Berkeley DeepDrive Attention)

Dataset Statistics: The statistics of our dataset are summarized and compared with the largest existing dataset (DR(eye)VE) [1] in Table 1. Our dataset was collected using videos selected from a publicly available, large-scale, crowd-sourced driving video dataset, BDD100k [30, 31]. BDD100K contains human-demonstrated dashboard videos and time-stamped sensor measurements collected during urban driving in various weather and lighting conditions. To efficiently collect attention data for critical driving situations, we specifically selected video clips that both included braking events and took place in busy areas (see supplementary materials for technical details). We then trimmed videos to include 6.5 seconds prior to and 3.5 seconds after each braking event. It turned out that other driving actions, e.g., turning, lane switching and accelerating, were also included. 1,232 videos (=3.5 hours) in total were collected following these procedures. Some example images from our dataset are sh

25 papers0 benchmarksVideos

DHF1K

DHF1K is a video saliency dataset which contains a ground-truth map of binary pixel-wise gaze fixation points and a continuous map of the fixation points after being blurred by a gaussian filter. DHF1K contains 1000 videos in total. 700 of the videos are annotated, 600 of which are used for training and 100 for validation. The remaining 300 are the testing set which are to be evaluated on a public server.

24 papers5 benchmarksImages, Videos

FineAction

FineAction contains 103K temporal instances of 106 action categories, annotated in 17K untrimmed videos. FineAction introduces new opportunities and challenges for temporal action localization, thanks to its distinct characteristics of fine action classes with rich diversity, dense annotations of multiple instances, and co-occurring actions of different classes.

24 papers21 benchmarksVideos

MMAct

MMAct is a large-scale dataset for multi/cross modal action understanding. This dataset has been recorded from 20 distinct subjects with seven different types of modalities: RGB videos, keypoints, acceleration, gyroscope, orientation, Wi-Fi and pressure signal. The dataset consists of more than 36k video clips for 37 action classes covering a wide range of daily life activities such as desktop-related and check-in-based ones in four different distinct scenarios.

23 papers2 benchmarksVideos

Toyota Smarthome dataset (Toyota Smarthome Trimmed)

Toyota Smarthome Trimmed has been designed for the activity classification task of 31 activities. The videos were clipped per activity, resulting in a total of 16,115 short RGB+D video samples. activities were performed in a natural manner. As a result, the dataset poses a unique combination of challenges: high intra-class variation, high-class imbalance, and activities with similar motion and high duration variance. Activities were annotated with both coarse and fine-grained labels. These characteristics differentiate Toyota Smarthome Trimmed from other datasets for activity classification.

23 papers4 benchmarks3D, Videos

HOI4D

A large-scale 4D egocentric dataset with rich annotations, to catalyze the research of category-level human-object interaction. HOI4D consists of 2.4M RGB-D egOCentric video frames over 4000 sequences collected by 4 participants interacting with 800 different object instances from 16 categories over 610 different indoor rooms.

23 papers0 benchmarksVideos

NExT-GQA

We study visually grounded VideoQA in response to the emerging trends of utilizing pretraining techniques for video-language understanding. Specifically, by forcing vision-language models (VLMs) to answer questions and simultaneously provide visual evidence, we seek to ascertain the extent to which the predictions of such techniques are genuinely anchored in relevant video content, versus spurious correlations from language or irrelevant visual context. Towards this, we construct NExT-GQA -- an extension of NExT-QA with 10.5K temporal grounding (or location) labels tied to the original QA pairs. With NExT-GQA, we scrutinize a variety of state-of-the-art VLMs. Through post-hoc attention analysis, we find that these models are weak in substantiating the answers despite their strong QA performance. This exposes a severe limitation of these models in making reliable predictions.

23 papers2 benchmarksTexts, Videos

PreviousPage 12 of 51Next