Datasets

1,019 machine learning datasets

1,019 dataset results

TREK-150

TREK-150 is a benchmark dataset for object tracking in First Person Vision (FPV) videos composed of 150 densely annotated video sequences.

7 papers0 benchmarksVideos

NTU RGB+D 2D

NTU RGB+D 2D is a curated version of NTU RGB+D often used for skeleton-based action prediction and synthesis. It contains less number of actions.

7 papers8 benchmarksRGB-D, Videos

LSVTD is a large scale video text dataset for promoting the video text spotting community, which contains 100 text videos from 22 different real-life scenarios. LSVTD covers a wide range of 13 indoor (eg. bookstore, shopping mall) and 9 outdoor scenarios, which is more than 3 times the diversity of IC15.

7 papers0 benchmarksVideos

ACAV100M (Automatically Curated Audio-Visual)

ACAV100M processes 140 million full-length videos (total duration 1,030 years) which are used to produce a dataset of 100 million 10-second clips (31 years) with high audio-visual correspondence. This is two orders of magnitude larger than the current largest video dataset used in the audio-visual learning literature, i.e., AudioSet (8 months), and twice as large as the largest video dataset in the literature, i.e., HowTo100M (15 years).

7 papers0 benchmarksVideos

BS-RSC

BS-RSC is a real-world rolling shutter (RS) correction dataset and a corresponding model to correct the RS frames in a distorted video. Real distorted videos with corresponding ground truth are recorded simultaneously via a well-designed beam-splitter-based acquisition system. BSRSC contains various motions of both camera and objects in dynamic scenes.

7 papers2 benchmarksVideos

WALT (Watch and Learn TimeLapse Images)

We introduce a new dataset, Watch and Learn Time-lapse (WALT), consisting of multiple (4K and 1080p) cameras capturing urban environments over a year.

7 papers3 benchmarksImages, Time series, Tracking, Videos

MSU BASED (MSU BASED Video Deblurring Dataset and Benchmark)

Qualitative dataset with real blurred videos, created by using beam-splitter setup in lab environment

7 papers24 benchmarksVideos

CHAD (Charlotte Anomaly Dataset)

CHAD: Charlotte Anomaly Dataset CHAD is high-resolution, multi-camera dataset for surveillance video anomaly detection. It includes bounding box, Re-ID, and pose annotations, as well as frame-level anomaly labels, dividing all frames into two groups of anomalous or normal. You can find the paper with all the details in the following link: CHAD: Charlotte Anomaly Dataset. Please refer to the page of the dataset for more information.

7 papers3 benchmarksRGB Video, Videos

HiREST (HIerarchical REtrieval and STep-captioning)

HiREST (HIerarchical REtrieval and STep-captioning) dataset is a benchmark that covers hierarchical information retrieval and visual/textual stepwise summarization from an instructional video corpus. It consists of 3.4K text-video pairs from a video dataset, where 1.1K videos have annotations of moment spans relevant to text query and breakdown of each moment into key instruction steps with caption and timestamps (totaling 8.6K step captions). The dataset consists of video retrieval, moment retrieval, and two novel moment segmentation and step captioning tasks.

7 papers0 benchmarksTexts, Videos

HR-UBnormal

The Human Related version of UBnormal ("UBnormal: New Benchmark for Supervised Open-Set Video Anomaly Detection," Acsintoae et al.) was introduced by Flaborea et al. in the paper "Contracting Skeletal Kinematics for Human-Related Video Anomaly Detection".

7 papers3 benchmarksVideos

TAPE (resToration of digitized Analog videotaPEs)

A dataset of videos synthetically degraded with Adobe After Effects to exhibit artifacts resembling those of real-world analog videotapes. The original high-quality videos belong to the Venice scene of the Harmonic dataset. The artifacts taken into account are: 1) tape mistracking; 2) VHS edge waving; 3) chroma loss along the scanlines; 4) tape noise; 5) undersaturation. The dataset comprises a total of 26,392 frames corresponding to 40 clips. The clips are randomly divided into training and test sets with a 75%-25% ratio.

7 papers4 benchmarksVideos

NExT-QA (Open-ended VideoQA)

NExT-QA is a VideoQA benchmark targeting the explanation of video contents. It challenges QA models to reason about the causal and temporal actions and understand the rich object interactions in daily activities. This page records LLMs for answer evaluation.

7 papers2 benchmarksTexts, Videos

iPhone (Monocular Dynamic View Synthesis)

iPhone dataset is a challenging benchmarks for dynamic reconstruction. This dataset consists of a collection of videos with realistic scenes and large object motions captured with a hand-held iPhone. The evaluation measures rendering quality on novel viewpoints which have low overlaps with the training camera views. This datasets do not have the (a) teleporting camera motion or (b) quasi-static scene motion issues as the previous ones.

7 papers1 benchmarksImages, Videos

Long-RVOS

This work proposes Long-RVOS, a large-scale benchmark for long-term video object segmentation. Long-RVOS is the first minute-level dataset in the RVOS field, designed to tackle various realistic long-video challenges such as frequent occlusion, disappearance-reappearance, and shot changing. Notably, Long-RVOS offers significantly longer video duration than existing datasets. In addition, it contains the largest number of object classes and mask annotations. The large scale of Long-RVOS supports comprehensive training and evaluation of RVOS models. Finally, we gather 24,689 high-quality descriptions for building Long-RVOS.

7 papers6 benchmarksTexts, Videos

TRECVID

TRECVID is a yearly set of competitions centered on video retrieval and indexing, hosting a variety of video data sets.

6 papers1 benchmarksImages, Videos

iQIYI-VID

iQIYI-VID dataset, which comprises video clips from iQIYI variety shows, films, and television dramas. The whole dataset contains 500,000 videos clips of 5,000 celebrities. The length of each video is 1~30 seconds.

6 papers0 benchmarksVideos

MovieFIB (Movie Fill-in-the-Blank)

A quantitative benchmark for developing and understanding video of fill-in-the-blank question-answering dataset with over 300,000 examples, based on descriptive video annotations for the visually impaired.

6 papers0 benchmarksTexts, Videos

Moviescope

Moviescope is a large-scale dataset of 5,000 movies with corresponding video trailers, posters, plots and metadata. Moviescope is based on the IMDB 5000 dataset consisting of 5.043 movie records. It is augmented by crawling video trailers associated with each movie from YouTube and text plots from Wikipedia.

6 papers0 benchmarksAudio, Texts, Videos

RoadText-1K

A dataset for text in driving videos. The dataset is 20 times larger than the existing largest dataset for text in videos. The dataset comprises 1000 video clips of driving without any bias towards text and with annotations for text bounding boxes and transcriptions in every frame.

6 papers0 benchmarksVideos

V2C (Video-to-Commonsense)

Contains ~9K videos of human agents performing various actions, annotated with 3 types of commonsense descriptions.

6 papers0 benchmarksTexts, Videos

PreviousPage 23 of 51Next