Datasets

1,019 machine learning datasets

1,019 dataset results

SEPE 8K

SEPE 8K dataset is made of 40 different 8K (8192 x 4320) video sequences and 40 variant 8K (8192 x 5464) images. The video sequences were captured at a framerate of 29.97 frames per second (FPS) and had been encoded into videos using AVC/H.264, HEVC/H.265, and AV1 codecs at resolutions from 8K to 480p. The images, video sequences, encoded videos, and various other statistics related to the media that make the dataset are stored online, published, and maintained on the repo on GitHub for non-commercial use. this proposed dataset is - as far as we know - the first to publish true 8K natural sequences; thus, it is important for the next level of applications dealing with multimedia such as video quality assessment, super-resolution, video coding, video compression, and many more.

2 papers1 benchmarksImages, Videos

Sakuga-42M

Sakuga-42M is a large-scale hand-drawn cartoon video dataset for academic research purposes, it comprises 42 million cartoon keyframes covering various artistic styles, regions, and years, with comprehensive semantic annotations including video-text description pairs, anime tags, content taxonomies, etc. The dataset is intended to support researchers in their exploration of more effective and practical solutions for creating cartoons.

2 papers0 benchmarksVideos

E.T. the Exceptional Trajectories

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

2 papers6 benchmarks3D, 3d meshes, Texts, Videos

MultiOOD (Multimodal Out-of-Distribution Detection Benchmark)

MultiOOD is the first benchmark for Multimodal OOD Detection and covers diverse dataset sizes and modalities. MultiOOD comprises five video datasets with over 85, 000 video clips in total. The datasets vary in the number of classes, ranging from 7 to 229, and in size, spanning from 3k to 57k. Video, optical flow, and audio are used as different types of modalities.

2 papers0 benchmarksAudio, Videos

AVSync15

AVSync15 is a high-quality synchronized audio-video dataset curated from VGGSound. It is carefully curated with both automatic and manual steps, ensuring:

2 papers0 benchmarksAudio, Videos

TAPVid-3D: A Benchmark for Tracking Any Point in 3D

TAPVid-3D is a dataset and benchmark for evaluating the task of long-range Tracking Any Point in 3D (TAP-3D). The dataset consists of 4,000+ real-world videos and 2.1 million metric 3D point trajectories, spanning a variety of object types, motion patterns, and indoor and outdoor environments.

2 papers0 benchmarksVideos

AssistTaxi

The availability of high-quality datasets play a crucial role in advancing research and development especially, for safety critical and autonomous systems. In this paper, we present AssistTaxi, a comprehensive novel dataset which is a collection of images for runway and taxiway analysis. The dataset comprises of more than 300,000 frames of diverse and carefully collected data, gathered from Melbourne (MLB) and Grant-Valkaria (X59) general aviation airports. The importance of AssistTaxi lies in its potential to advance autonomous operations, enabling researchers and developers to train and evaluate algorithms for efficient and safe taxiing.

2 papers0 benchmarksImages, Videos

Aria Digital Twin Dataset

A real-world dataset, with hyper-accurate digital counterpart & comprehensive ground-truth annotation.

2 papers6 benchmarks3D, 3d meshes, Point cloud, RGB Video, Videos

MECD (Multi-Event Causal Discovery)

Provide:

2 papers4 benchmarksTexts, Videos

I2-2000FPS

I2-2000FPS is the first high-speed video dataset offering an unprecedented temporal resolution of 2000 frames per second (fps). Captured using the commercially available Chronos 1.4 high-speed CMOS camera, the dataset includes a diverse range of objects varying in size, shape, orientation, and motion, as well as various camera movements. This dataset is designed to enable research in areas such as motion analysis, object tracking, and scene understanding at extreme temporal resolutions. Potential applications span fields like sports analysis, robotics, autonomous navigation, and high-speed videography.

2 papers0 benchmarksImages, Videos

RHM (Rhm: Robot house multi-view human activity recognition dataset)

The Robot House Multi-View dataset (RHM) contains four views: Front, Back, Ceiling, and Robot Views. There are 14 classes with 6701 video clips for each view, making a total of 26804 video clips for the four views. The lengths of the video clips are between 1 to 5 seconds. The videos with the same number and the same classes are synchronized in different views.

2 papers3 benchmarksActions, Images, RGB Video, Videos

V2VBench

V2VBench is a comprehensive benchmark designed to evaluate video editing methods. It consists of: - 50 standardized videos across 5 categories, and - 3 editing prompts per video, encompassing 4 editing tasks: Huggingface Datasets - 8 evaluation metrics to assess the quality of edited videos: Evaluation Metrics

2 papers0 benchmarksImages, Texts, Videos

ChronoMagic-Pro

Description

2 papers0 benchmarksTexts, Videos

SynthEVox3D-Tiny (Synthetic Event Camera Voxel 3D Reconstruction Dataset)

Event cameras are sensors that are inspired by biological systems and specialize in capturing changes in brightness. These emerging cameras offer numerous advantages over conventional frame-based cameras, including high dynamic range, high frame rates, and extremely low power consumption. As a result, event cameras are increasingly being used in various fields, such as object detection and tracking, autonomous driving, 3D reconstruction, visual odometry, and SLAM.

2 papers3 benchmarksImages, Videos

LongVALE

Despite impressive advancements in video understanding, most efforts remain limited to coarse-grained or visual-only video tasks. However, real-world videos encompass omni-modal information (vision, audio, and speech) with a series of events forming a cohesive storyline. The lack of multi-modal video data with fine-grained event annotations and the high cost of manual labeling are major obstacles to comprehensive omni-modality video perception. To address this gap, we propose an automatic pipeline consisting of high-quality multi-modal video filtering, semantically coherent omni-modal event boundary detection, and cross-modal correlation-aware event captioning. In this way, we present LongVALE, the first-ever Vision-Audio-Language Event understanding benchmark comprising 105K omni-modal events with precise temporal boundaries and detailed relation-aware captions within 8.4K high-quality long videos. Further, we build a baseline that leverages LongVALE to enable video large language mod

2 papers0 benchmarksAudio, Speech, Texts, Videos

DropletVideo-10M

DropletVideo is a project exploring high-order spatio-temporal consistency in image-to-video generation. It is trained on DropletVideo-10M. The model supports multi-resolution inputs, dynamic FPS control for motion intensity, and demonstrates potential for 3D consistency. The model supports multi-resolution inputs, dynamic FPS control for motion intensity, and demonstrates potential for 3D consistency. For further details, you can check our project page as well as the technical report.

2 papers0 benchmarksVideos

DL3DV-10k

DL3DV-10K is a dataset of real-world videos with scene annotations and camera parameters.

2 papers0 benchmarksVideos

ThermoHands

ThermoHands is the first benchmark dataset specifically designed for egocentric 3D hand pose estimation from thermal images. It addresses the challenges of hand pose estimation in low-light conditions and when the hand is occluded by gloves or other wearables—scenarios where traditional RGB or NIR-based systems struggle.

2 papers0 benchmarks3D, Images, Videos

LoTE-Animal (LoTE-Animal: A Long Time-span Dataset for Endangered Animal Behavior Understanding)

Understanding and analyzing animal behavior is increasingly essential to protect endangered animal species. However, the application of advanced computer vision techniques in this regard is minimal, which boils down to lacking large and diverse datasets for training deep models.

2 papers2 benchmarksImages, RGB Video, Videos

Multi-Ego

A new multi-view egocentric dataset, Multi-Ego. The dataset is recorded simultaneously by three cameras, covering a wide variety of real-life scenarios. The footage is annotated by multiple individuals under various summarization configurations, with a consensus analysis ensuring a reliable ground truth.

1 papers0 benchmarksVideos

PreviousPage 36 of 51Next