1,019 machine learning datasets
1,019 dataset results
The VIriors Action Recognition Challenge uses a subset of the UCF101 action recognition dataset:
Video class agnostic segmentation (VCAS) is the task of segmenting objects without regards to its semantics combining appearance, motion and geometry from monocular video sequences. The main motivation behind this is to account for unknown objects in the scene and to act as a redundant signal along with the segmentation of known classes for better safety as shown in the following Figure.
First of its kind paired win-fail action understanding dataset with samples from the following domains: “General Stunts,” “Internet Wins-Fails,” “Trick Shots,” & “Party Games.” The task is to identify successful and failed attempts at various activities. Unlike existing action recognition datasets, intra-class variation is high making the task challenging, yet feasible.
Dataset for multimodal skills assessment focusing on assessing piano player’s skill level. Annotations include player's skills level, and song difficulty level. Bounding box annotations around pianists' hands are also provided.
AMT Objects is a large dataset of object centric videos suitable for training and benchmarking models for generating 3D models of objects from a small number of photos of the objects. The dataset consists of multiple views of a large collection of object instances.
This is a video and image segmentation dataset for human head and shoulders, relevant for creating elegant media for videoconferencing and virtual reality applications. The source data includes ten online conference-style green screen videos. The authors extracted 3600 frames from the videos and generated the ground truth masks for each character in the video, and then applied virtual background to the frames to generate the training/testing sets.
WiTA (Writing in The Air) is a dataset for the challenging writing in the air (WiTA) task -- an elaborate task bridging vision and NLP. The dataset consists of five sub-datasets in two languages (Korean and English) and amounts to 209,926 video instances from 122 participants. Finger movement for WiTA is captured with RGB cameras to ensure wide accessibility and cost-efficiency.
Acticipate is a publicly available dataset with recordings of human body-motion and eye-gaze, acquired in an experimental scenario with an actor interacting with three subjects. It contains synchronised and labelled video+gaze and body motion in a dyadic scenario of interaction.
The Extended UCF Crime extends the UCF Crime data set that consists of 13 anomaly classes. The extension adds two different anomaly classes to the data set, which are ”molotov bomb” and ”protest” classes. It also adds 33 videos to the fighting class. In total, the extension adds 216 videos to the training set, 17 videos to the test set.
Near-Collision is a large-scale dataset of 13,658 egocentric video snippets of humans navigating in indoor hallways. In order to obtain ground truth annotations of human pose, the videos are provided with the corresponding 3D point cloud from LIDAR.
Dense Forest Trail is an UAV dataset collected from a variety of simulated environment in Unreal Engine.
The RBO dataset of articulated objects and interactions is a collection of 358 RGB-D video sequences (67:18 minutes) of humans manipulating 14 articulated objects under varying conditions (light, perspective, background, interaction). All sequences are annotated with ground truth of the poses of the rigid parts and the kinematic state of the articulated object (joint states) obtained with a motion capture system. We also provide complete kinematic models of these objects (kinematic structure and three-dimensional textured shape models). In 78 sequences the contact wrenches during the manipulation are also provided.
D-OCC is a large-scale dataset of 5,617 dialogues to enable fine-grained evaluation and analysis of various dialogue systems. It is used to study common grounding in dynamic environments.
This data contains about 2500 trajectories (with images and actions) of a Sawyer robot interacting with various objects.
The proposed Extended-YouTube Faces (E-YTF) is an extension of the famous YouTube Faces (YTF) dataset and is specifically designed to further push the challenges of face recognition by addressing the problem of open-set face identification from heterogeneous data i.e. still images vs video.
TLFM dataset structured in sequences of at least nine timesteps. The dataset includes 9696 images of both brightfield and green fluorescent protein channels at a resolution of 256 × 256. Dataset for multi-domain (BF and GFP) microscopy image sequence generation.
The Reasonable Crowd dataset is a dataset to evaluate autonomous driving in a limited operating domain. The data consists of 92 traffic scenarios, with multiple ways of traversing each scenario. Multiple annotators expressed their preference between pairs of scenario traversals.
TinyVIRAT-v2 is a benchmark dataset for recognizing real-world low-resolution activities present in videos. The dataset is comprised of naturally occuring low-resolution actions. This is an extension of the TinyVIRAT dataset and consists of actions with multiple labels. The videos are extracted from security videos which makes them realistic and more challenging.
The dataset is a private dataset collected for automatic analysis of psychological distress. It contains self-reported distress labels provided by human volunteers. The dataset consists of 30-min interview recordings of participants.
Trope Understanding in Movies and Animations (TrUMAn) is a dataset intending to evaluate and develop learning systems beyond visual signals.