1,019 machine learning datasets
1,019 dataset results
DAVIS-Edit is a curated testing benchmark for video editing. This dataset contains two evaluation settings, i.e., text- and image-based editing. Besides, it offers two types of annotated for both modalities of prompts, considering the editing scenarios with similar (DAVIS-Edit-S) and changing (DAVIS-Edit-C) shapes, so as to address the shape inconsistency problem in video-to-video editing.
DeVAn is a multi-modal dataset containing 8.5K video clips carefully selected from previously published YouTube-based video datasets (YouTube-8M and YT-Temporal-1B) that integrate visual and auditory information. Over the span of 10 months, a team of 24 human annotators (college and graduate level students) created 5 short captions (1 sentence each) and 5 long summaries (3-10 sentences) for each video clip, resulting in a rich and comprehensive human-annotated dataset that serves as a robust ground truth for subsequent model training and evaluation.
The dataset consists of high-resolution three-dimensional (3D) turbulent flow simulations. It captures intricate vortex structures caused by a variety of shapes within a channel flow environment. The dataset is generated using OpenFOAM in large eddy simulation (LES) mode, ensuring the preservation of detailed turbulent characteristics across all spatial scales.
SynoClip Dataset The SynoClip dataset is a comprehensive and standard dataset specifically designed for the video synopsis task. It consists of six videos, ranging from 8 to 45 minutes, captured from outdoor-mounted surveillance cameras. This dataset is annotated with tracking information, making it an ideal resource not only for video synopsis but also for related tasks such as object detection in videos and multi-object tracking.
To study the problem of weakly supervised attended object detection in cultural sites, we collected and labeled a dataset of egocentric images acquired from subjects visiting a cultural site. The dataset has been designed to offer a snapshot of the subject’s visual experience while visiting a museum and contains labels for several artworks and details attended by the subjects.
Description
Description
This large collection of over 161,000 video-label pairs of video clips, shows humans drawing letters and digits in the air, and is used to evaluate a model’s ability to classify articulated motions correctly. Unlike existing video datasets, AirLetters’ accurate classification predictions rely on discerning motion patterns and integrating information presented by the video over time (i.e., over many frames of video). That study revealed that while trivial for humans, accurate representations of complex articulated motions remain an open problem for end-to-end learning for video understanding models.
The TimberVision dataset consists of more than 2k annotated RGB images and contains a total of 51k trunk components including cut and lateral surfaces, thereby surpassing any existing dataset in this domain in terms of both quantity and detail by a large margin. The dataset can be used to train oriented object detection and instance segmentation and evaluate the influence of multiple scene parameters on model performance. Additionally, a generic framework is provided to fuse the components detected by the models for both tasks into unified trunk representations. Furthermore, geometric properties are derived automatically and multi-object tracking is applied to further enhance robustness.
We collect a dataset of 805 clean videos that show the action of pouring water in a container. Our dataset spans over 50 unique containers made of 5 different materials, 4 different shapes and with hot and cold water.
Click to add a brief description of the dataset (Markdown and LaTeX enabled).
TUMTraffic-VideoQA is a novel dataset designed to understand spatiotemporal video in complex roadside traffic scenarios. The dataset comprises 1,000 videos, featuring 85,000 multiple-choice QA pairs, 2,300 object captioning, and 5,700 object grounding annotations, encompassing diverse real-world conditions such as adverse weather and traffic anomalies. By incorporating tuple-based spatiotemporal object expressions, TUMTraffic-VideoQA unifies three essential tasks—multiple-choice video question answering, referred object captioning, and spatiotemporal object grounding—within a cohesive evaluation framework.
Dataset Introduction This dataset leverages VideoDB's Public Collection to offer a diverse range of videos featuring text-containing scenes. It spans multiple categories—ranging from finance and legal documents to software UI elements and handwritten notes—ensuring a broad representation of real-world text appearances. Each video is annotated with frame indexes to facilitate consistent and reproducible OCR benchmarks. Currently, the dataset includes over 25 curated videos, yielding thousands of extracted frames that present a variety of text-related challenges.
The largest video inpainting dataset comprises over 390K clips (> 866.7 hours), featuring precise masks and detailed video captions.
The benchmark for VPData, the largest video inpainting dataset, which comprises over 390K clips (> 866.7 hours) and features precise masks and detailed video captions.
AerialMPT is a dataset for pedestrian tracking in aerial image sequences and presents real-world challenges for MOT algorithms such as low frame rate, small moving objects, and complex backgrounds. AerialMPT consists of 14 sequences and 307 frames with an average size of 425 × 358 pixels. The images were acquired by DLR's 4K camera system from altitudes ranging from 600 m to 1400 m, resulting in spatial resolutions (GSDs) ranging from 8 cm/pixel to 13 cm/pixel. In a post-processing step, the images were co-registered, geo-referenced, and cropped for each region of interest, resulting in sequences of 2 fps. The images were acquired during different flight campaigns between 2016 and 2017, over different scenes containing pedestrians and with different crowd densities and movement complexities.
VETRA is a dataset for vehicle tracking in aerial image sequences and presents unique challenges such as low frame rates, small and fast-moving objects, as well as high camera movement. These characteristics allow for extended tracking of numerous vehicles with varying motion behaviors over large areas and pose new challenges for MOT algorithms. VETRA consists of 52 image sequences captured by airplanes and helicopters using DLR’s 3k and 4k camera systems. The acquisition sites are located in Germany and Austria. In addition to the classical training, validation and test sets, VETRA offers a second test set specifically designed for the application of large area monitoring (LAM). The LAM sequences are recorded over 7 rural roads and motorways with a fixed camera speed and configuration. Each road section is captured at 4 different times of the day, enabling the performance of MOT algorithms to be evaluated under different traffic loads in a static environment. Furthermore, the feature
The Songdo Traffic dataset delivers precisely georeferenced vehicle trajectories captured through high-altitude bird's-eye view (BeV) drone footage over Songdo International Business District, South Korea. Comprising approximately 700,000 unique trajectories, this resource represents one of the most extensive aerial traffic datasets publicly available, distinguishing itself through exceptional temporal resolution that captures vehicle movements at 29.97 points per second, enabling unprecedented granularity for advanced urban mobility analysis.
Mr. HiSum is a large-scale video highlight detection and summarization dataset, which contains 31,892 videos selected from YouTube-8M dataset and reliable frame importance score labels aggregated from 50,000+ users per video.
A benchmark that focuses on the sampling dilemma in long-video tasks. The LSDBench dataset is designed to evaluate the sampling efficiency of long-video VLMs. It consists of multiple-choice question-answer pairs based on hour-long videos, focusing on dense and short-duration actions with high Necessary Sampling Density (NSD).