Datasets

1,019 machine learning datasets

1,019 dataset results

USF (Human ID Gait Challenge Dataset)

The USF Human ID Gait Challenge Dataset is a dataset of videos for gait recognition. It has videos from 122 subjects in up to 32 possible combinations of variations in factors.

10 papers0 benchmarksImages, Videos

TAPOS

TAPOS is a new dataset developed on sport videos with manual annotations of sub-actions, and conduct a study on temporal action parsing on top. A sport activity usually consists of multiple sub-actions and that the awareness of such temporal structures is beneficial to action recognition.

10 papers1 benchmarksVideos

VideoMem

Composed of 10,000 videos annotated with memorability scores. In contrast to previous work on image memorability -- where memorability was measured a few minutes after memorization -- memory performance is measured twice: a few minutes after memorization and again 24-72 hours later.

10 papers0 benchmarksVideos

CalMS21 (Caltech Mouse Social Interactions)

The Caltech Mouse Social Interactions (CalMS21) dataset is a multi-agent dataset from behavioral neuroscience. The dataset consists of trajectory data of social interactions, recorded from videos of freely behaving mice in a standard resident-intruder assay. The CalMS21 dataset is part of the Multi-Agent Behavior Challenge 2021.

10 papers0 benchmarksImages, Videos

DDPM (Deception Detection and Physiological Monitoring)

The Deception Detection and Physiological Monitoring (DDPM) dataset captures an interview scenario in which the interviewee attempts to deceive the interviewer on selected responses. The interviewee is recorded in RGB, near-infrared, and long-wave infrared, along with cardiac pulse, blood oxygenation, and audio. After collection, data were annotated for interviewer/interviewee, curated, ground-truthed, and organized into train/test parts for a set of canonical deception detection experiments. The dataset contains almost 13 hours of recordings of 70 subjects, and over 8 million visible-light, near-infrared, and thermal video frames, along with appropriate meta, audio, and pulse oximeter data.

10 papers0 benchmarksVideos

ImageNet-VidVRD

ImageNet-VidVRD dataset contains 1,000 videos selected from ILVSRC2016-VID dataset based on whether the video contains clear visual relations. It is split into 800 training set and 200 test set, and covers common subject/objects of 35 categories and predicates of 132 categories. Ten people contributed to labeling the dataset, which includes object trajectory labeling and relation labeling. Since the ILVSRC2016-VID dataset has the object trajectory annotation for 30 categories already, we supplemented the annotations by labeling the remaining 5 categories. In order to save the labor of relation labeling, we labeled typical segments of the videos in the training set and the whole of the videos in the test set.

10 papers13 benchmarksVideos

MUGEN

MUGEN is a large-scale video-audio-text dataset MUGEN, collected using the open-sourced platform game CoinRun. MUGEN can help progress research in many tasks in multimodal understanding and generation.

10 papers0 benchmarksAudio, Texts, Videos

VideoCC3M (Video-Conceptual-Captions)

We propose a new, scalable video-mining pipeline which transfers captioning supervision from image datasets to video and audio. We use this pipeline to mine paired video and captions, using the Conceptual Captions3M image dataset as a seed dataset. Our resulting dataset VideoCC3M consists of millions of weakly paired clips with text captions and will be released publicly.

10 papers0 benchmarksTexts, Videos

Perception Test

Perception Test is a benchmark designed to evaluate the perception and reasoning skills of multimodal models. It introduces real-world videos designed to show perceptually interesting situations and defines multiple tasks that require understanding of memory, abstract patterns, physics, and semantics – across visual, audio, and text modalities. The benchmark consists of 11.6k videos, 23s average length, filmed by around 100 participants worldwide. The videos are densely annotated with six types of labels: object and point tracks, temporal action and sound segments, multiple-choice video question-answers and grounded video question-answers. The benchmark probes pre-trained models for their transfer capabilities, in a zero-shot / few-shot or fine tuning regime.

10 papers4 benchmarksVideos

Dynamic Replica

Dynamic Replica is a synthetic dataset of stereo videos featuring humans and animals in virtual environments. It is a benchmark for dynamic disparity/depth estimation and 3D reconstruction consisting of 145,200 stereo frames (524 videos).

10 papers0 benchmarksRGB-D, Videos

RVSD (Realistic Video DeSnowing Dataset)

Realistic Video DeSnowing Dataset (RVSD) contains a total of 110 pairs of videos. Each pair contains snowy and hazy videos and corresponding snow-free and haze-free ground truth videos. We use a rendering engine (Unreal Engine 5) and various augmentation techniques to generate snow and haze with diverse and realistic physical properties. This results in more realistic and varied synthesized videos, which improve the model’s performance on real-world data.

10 papers0 benchmarksImages, Videos

DiDi (Distractor Distilled Dataset)

DiDi is a distractor-distilled tracking dataset created to address the limitation of low distractor presence in current visual object tracking benchmarks. To enhance the evaluation and analysis of tracking performance amidst distractors, we have semi-automatically distilled several existing benchmarks into the DiDi dataset. The dataset is available for download at this URL: https://go.vicos.si/didi

10 papers2 benchmarksVideos

2024 AI City Challenge

The AI City Challenge, hosted at CVPR 2024, focuses on harnessing AI to enhance operational efficiency in physical settings such as retail and warehouse environments, and Intelligent Traffic Systems (ITS). It aims to utilize AI for actionable insights from sensor data, like camera feeds, to improve traffic safety and transportation outcomes. This year, the challenge spotlights two key areas with significant potential: retail business and ITS.

10 papers8 benchmarksVideos

Collective Activity

The Collective Activity Dataset contains 5 different collective activities: crossing, walking, waiting, talking, and queueing and 44 short video sequences some of which were recorded by consumer hand-held digital camera with varying view point.

9 papers1 benchmarksVideos

Tai-Chi-HD

Thai-Chi-HD is a high resolution dataset which can be used as reference benchmark for evaluating frameworks for image animation and video generation. It consists of cropped videos of full human bodies performing Tai Chi actions.

9 papers2 benchmarksVideos

CAS-VSR-W1k (LRW-1000)

LRW-1000 has been renamed as CAS-VSR-W1k.* It is a naturally-distributed large-scale benchmark for word-level lipreading in the wild, including 1000 classes with about 718,018 video samples from more than 2000 individual speakers. There are more than 1,000,000 Chinese character instances in total. Each class corresponds to the syllables of a Mandarin word which is composed by one or several Chinese characters. This dataset aims to cover a natural variability over different speech modes and imaging conditions to incorporate challenges encountered in practical applications.

9 papers2 benchmarksAudio, Texts, Videos

TUM-GAID

TUM-GAID (TUM Gait from Audio, Image and Depth) collects 305 subjects performing two walking trajectories in an indoor environment. The first trajectory is traversed from left to right and the second one from right to left. Two recording sessions were performed, one in January, where subjects wore heavy jackets and mostly winter boots, and another one in April, where subjects wore lighter clothes. The action is captured by a Microsoft Kinect sensor which provides a video stream with a resolution of 640×480 pixels and a frame rate around 30 FPS.

9 papers0 benchmarksAudio, Images, Videos

FAIR-Play

FAIR-Play is a video-audio dataset consisting of 1,871 video clips and their corresponding binaural audio clips recording in a music room. The video clip and binaural clip of the same index are roughly aligned.

9 papers0 benchmarksAudio, Videos

How2R

Amazon Mechanical Turk (AMT) is used to collect annotations on HowTo100M videos. 30k 60-second clips are randomly sampled from 9,421 videos and present each clip to the turkers, who are asked to select a video segment containing a single, self-contained scene. After this segment selection step, another group of workers are asked to write descriptions for each displayed segment. Narrations are not provided to the workers to ensure that their written queries are based on visual content only. These final video segments are 10-20 seconds long on average, and the length of queries ranges from 8 to 20 words. From this process, 51,390 queries are collected for 24k 60-second clips from 9,371 videos in HowTo100M, on average 2-3 queries per clip. The video clips and its associated queries are split into 80% train, 10% val and 10% test.

9 papers0 benchmarksTexts, Videos

UCF Sports

The UCF Sports dataset consists of a set of actions collected from various sports which are typically featured on broadcast television channels such as the BBC and ESPN. The video sequences were obtained from a wide range of stock footage websites including BBC Motion gallery and GettyImages.

9 papers3 benchmarksVideos

PreviousPage 19 of 51Next