Datasets

1,019 machine learning datasets

1,019 dataset results

EMDB

EMDB contains in-the-wild videos of human activity recorded with a hand-held iPhone. It features reference SMPL body pose and shape parameters, as well as global body root and camera trajectories. The reference 3D poses were obtained by jointly fitting SMPL to 12 body-worn electromagnetic sensors and image data. For the latter we fit a neural implicit avatar model to allow for a dense pixel-wise fitting objective.

32 papers36 benchmarks3D, Images, RGB Video, Videos

Toyota Smarthome Dataset

A large scale dataset with daily-living activities performed in a natural manner.

31 papers0 benchmarksVideos

MAFW

MAFW is a large-scale, multi-modal, compound affective database for dynamic facial expression recognition in the wild. It contains 10,045 video-audio clips, annotated with a compound emotional category and a couple of sentences that describe the subjects' affective behaviors in the clip. For the compound emotion annotation, each clip is categorized into one or more of the 11 widely-used emotions, i.e., anger, disgust, fear, happiness, neutral, sadness, surprise, contempt, anxiety, helplessness, and disappointment.

31 papers2 benchmarksVideos

AGQA (Action Genome Question Answering)

Action Genome Question Answering (AGQA) is a benchmark for compositional spatio-temporal reasoning. AGQA contains 192M unbalanced question answer pairs for 9.6K videos. It also contains a balanced subset of 3.9M question answer pairs, 3 orders of magnitude larger than existing benchmarks, that minimizes bias by balancing the answer distributions and types of question structures.

30 papers0 benchmarksTexts, Videos

VGG-SS (VGG-Sound Source)

VGG-SS (VGG Sound Source) is a benchmark for evaluating sound source localisation in videos. The dataset consists on a new set of annotations for the recently-introduced VGG-Sound dataset, where the sound sources visible in each video clip are explicitly marked with bounding box annotations. This dataset is 20 times larger than analogous existing ones, contains 5K videos spanning over 200 categories, and, differently from Flickr SoundNet, is video-based.

30 papers0 benchmarksVideos

VALUE (Video-And-Language Understanding Evaluation)

VALUE is a Video-And-Language Understanding Evaluation benchmark to test models that are generalizable to diverse tasks, domains, and datasets. It is an assemblage of 11 VidL (video-and-language) datasets over 3 popular tasks: (i) text-to-video retrieval; (ii) video question answering; and (iii) video captioning. VALUE benchmark aims to cover a broad range of video genres, video lengths, data volumes, and task difficulty levels. Rather than focusing on single-channel videos with visual information only, VALUE promotes models that leverage information from both video frames and their associated subtitles, as well as models that share knowledge across multiple tasks.

30 papers0 benchmarksTexts, Videos

VideoInstruct (Video Instruction Dataset)

Video Instruction Dataset is used to train Video-ChatGPT. It consists of 100,000 high-quality video instruction pairs. employs a combination of human-assisted and semi-automatic annotation techniques, aiming to produce high-quality video instruction data. These methods create question-answer pairs related to

30 papers31 benchmarksTexts, Videos

NVGesture

The NVGesture dataset focuses on touchless driver controlling. It contains 1532 dynamic gestures fallen into 25 classes. It includes 1050 samples for training and 482 for testing. The videos are recorded with three modalities (RGB, depth, and infrared).

29 papers2 benchmarksImages, Videos

TVSeries

A realistic dataset composed of 27 episodes from 6 popular TV series. The dataset spans over 16 hours of footage annotated with 30 action classes, totaling 6,231 action instances.

29 papers1 benchmarksVideos

Spring (Spring: A High-Resolution High-Detail Dataset and Benchmark for Scene Flow, Optical Flow and Stereo)

Spring is a large, high-resolution and high-detail, computer-generated benchmark for scene flow, optical flow, and stereo. Based on rendered scenes from the open-source Blender movie "Spring", it provides photo-realistic HD datasets with state-of-the-art visual effects and ground truth training data.

29 papers5 benchmarksImages, Videos

G3D (Gaming 3D Dataset)

The Gaming 3D Dataset (G3D) focuses on real-time action recognition in a gaming scenario. It contains 10 subjects performing 20 gaming actions: “punch right”, “punch left”, “kick right”, “kick left”, “defend”, “golf swing”, “tennis swing forehand”, “tennis swing backhand”, “tennis serve”, “throw bowling ball”, “aim and fire gun”, “walk”, “run”, “jump”, “climb”, “crouch”, “steer a car”, “wave”, “flap” and “clap”.

28 papers0 benchmarks3D, Images, Videos

MannequinChallenge

The MannequinChallenge Dataset (MQC) provides in-the-wild videos of people in static poses while a hand-held camera pans around the scene. The dataset consists of three splits for training, validation and testing.

28 papers0 benchmarksImages, Videos

How2QA

To collect How2QA for video QA task, the same set of selected video clips are presented to another group of AMT workers for multichoice QA annotation. Each worker is assigned with one video segment and asked to write one question with four answer candidates (one correctand three distractors). Similarly, narrations are hidden from the workers to ensure the collected QA pairs are not biased by subtitles. Similar to TVQA, the start and end points are provided for the relevant moment for each question. After filtering low-quality annotations, the final dataset contains 44,007 QA pairs for 22k 60-second clips selected from 9035 videos.

28 papers2 benchmarksTexts, Videos

Dynamic FAUST

Dynamic FAUST extends the FAUST dataset to dynamic 4D data. It consists of high-resolution 4D scans of human subjects in motion, captured at 60 fps.

28 papers1 benchmarks3D, Videos

Oxford Radar RobotCar Dataset

The Oxford Radar RobotCar Dataset is a radar extension to The Oxford RobotCar Dataset. It has been extended with data from a Navtech CTS350-X Millimetre-Wave FMCW radar and Dual Velodyne HDL-32E LIDARs with optimised ground truth radar odometry for 280 km of driving around Oxford, UK (in addition to all sensors in the original Oxford RobotCar Dataset).

28 papers2 benchmarksVideos

KITTI MOTS (KITTI Multi-Object Tracking and Segmentation (MOTS) Evaluation)

The Multi-Object and Segmentation (MOTS) benchmark [2] consists of 21 training sequences and 29 test sequences. It is based on the KITTI Tracking Evaluation 2012 and extends the annotations to the Multi-Object and Segmentation (MOTS) task. To this end, we added dense pixel-wise segmentation labels for every object. We evaluate submitted results using the metrics HOTA, CLEAR MOT, and MT/PT/ML. We rank methods by HOTA [1]. Our development kit and GitHub evaluation code provide details about the data format as well as utility functions for reading and writing the label files. (adapted for the segmentation case). Evaluation is performed using the code from the TrackEval repository.

28 papers3 benchmarksImages, Tracking, Videos

Vid4

The Vid4 dataset is generally used for testing video super-resolution. It consists of four sequences: walk (740x480, 47 frames), foliage (740x480, 49 frames), city (704x576, 34 frames), and calendar (720x576, 41 frames).

27 papers0 benchmarksVideos

DeeperForensics-1.0

DeeperForensics-1.0 represents the largest face forgery detection dataset by far, with 60,000 videos constituted by a total of 17.6 million frames, 10 times larger than existing datasets of the same kind. The full dataset includes 48,475 source videos and 11,000 manipulated videos. The source videos are collected on 100 paid and consented actors from 26 countries, and the manipulated videos are generated by a newly proposed many-to-many end-to-end face swapping method, DF-VAE. 7 types of real-world perturbations at 5 intensity levels are employed to ensure a larger scale and higher diversity. Image Source: https://github.com/EndlessSora/DeeperForensics-1.0

27 papers0 benchmarksImages, Videos

KoDF (Korean DeepFake Detection Dataset)

The Korean DeepFake Detection Dataset (KoDF) is a large-scale collection of synthesized and real videos focused on Korean subjects, used for the task of deepfake detection.

27 papers0 benchmarksVideos

RAVDESS (Ryerson Audio-Visual Database of Emotional Speech and Song)

The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) contains 7,356 files (total size: 24.8 GB). The database contains 24 professional actors (12 female, 12 male), vocalizing two lexically-matched statements in a neutral North American accent. Speech includes calm, happy, sad, angry, fearful, surprise, and disgust expressions, and song contains calm, happy, sad, angry, and fearful emotions. Each expression is produced at two levels of emotional intensity (normal, strong), with an additional neutral expression. All conditions are available in three modality formats: Audio-only (16bit, 48kHz .wav), Audio-Video (720p H.264, AAC 48kHz, .mp4), and Video-only (no sound). Note, there are no song files for Actor_18.

27 papers21 benchmarksAudio, Speech, Videos

PreviousPage 11 of 51Next