1,019 machine learning datasets
1,019 dataset results
VILT is a new benchmark collection of tasks and multimodal video content. The video linking collection includes annotations from 10 (recipe) tasks, which the annotators chose from a random subset of the collection of 2,275 high-quality 'Wholefoods' recipes. There are linking annotations for 61 query steps across these tasks which contain cooking techniques, chosen from the 189 total recipe steps. As each method results in approximately 10 videos to annotate, the collection consists of 831 linking judgments.
VISOR is a dataset of pixel annotations and a benchmark suite for segmenting hands and active objects in egocentric video. VISOR annotates videos from EPIC-KITCHENS, and it contains 272K manual semantic masks of 257 object classes, 9.9M interpolated dense masks, and 67K hand-object relations, covering 36 hours of 179 untrimmed videos.
Trailers12k is a movie trailer dataset comprised of 12,000 titles associated to ten genres. It distinguishes from other datasets by its collection procedure aimed at providing a high-quality publicly available dataset.
Virtual-PedCross-4667 is a dataset for pedestrian crossing prediction. It consists of 4667 video sequences, 2862 pedestrian crossing sequences and 1804 not-crossing sequences. Totally, 745k video frames with the resolution of 1280×720 are saved.
A dataset of real-world underwater videos annotated with multi-object tracking labels. The data was collected of the coast of the Big Island of Hawaii and the primary goal is to help scientists studying fish behavior, with the goal of conserving rare and beautiful fish species.
Discovering Interacted Objects (DIO) is a benchmark containing 51 interactions and 1,000+ objects designed for Spatio-temporal Human-Object Interaction (ST-HOI) detection.
INDRA is a dataset capturing videos of Indian roads from the pedestrian point-of-view. INDRA contains 104 videos comprising of 26k 1080p frames, each annotated with a binary road crossing safety label and vehicle bounding boxes.
Vident-lab is a dataset of dental videos with multi-task labels to facilitate further research in relevant video processing applications. The dataset constitutes a low-quality frame, its high-quality counterpart, a teeth segmentation mask, and an inter-frame homography matrix. The homography warps the current frame to the previous frame with respect to the teeth. The dataset has the training, validation, and test sets of 300, 29, and 80 videos, respectively.
MOET a dataset consists of gaze data from participants tracking specific objects, annotated with labels and bounding boxes, in crowded real-world videos, for training and evaluating attention decoding algorithms.
YouwikiHow is a dataset for Weakly-Supervised temporal Article Grounding (WSAG). It contains 47K videos and an average of 20.8 query sentences for each video.
This dataset provides a collection of 162K images and 70 Videos of Meta-Humans. There are 10 Highly realistic Meta-Humans expressing 7 facial expressions.
The 42Street dataset is based on a theater play as an example of such an application. The dataset is created using a public recording of the 42Street theatre play [42street]. The play is 1.5 hours long and was split into 5 equally long parts of 20 minutes each, with various clothes changes between the different parts.
VISEM-Tracking is a dataset consisting of 20 video recordings of 30s of spermatozoa with manually annotated bounding-box coordinates and a set of sperm characteristics analyzed by experts in the domain. It is an extension of the previously published VISEM dataset. In addition to the annotated data, unlabeled video clips are provided for easy-to-use access and analysis of the data.
STVD is the largest public dataset on the PVCD task. It was constituted with about 83 thousands of videos having in total of more than 10 thousands of hours duration and including more than 420 thousands of video copy pairs. It offers different test sets for a fine performance characterization (frame degradation, global transformation, video speeding, etc.) with a frame level annotation for the real-time detection and video alignment. Baseline comparisons were reported to show a room for improvement. More information about the STVD dataset can be found into the publications [1, 2].
Accidental Turntables contains a challenging set of 41,212 images of cars in cluttered backgrounds, motion blur and illumination changes that serves as a benchmark for 3D pose estimation.
Werewolf Among Us is a dataset multimodal dataset for modeling persuasion behaviors. It contains 199 dialogue transcriptions and videos captured in a multi-player social deduction game setting, 26,647 utterance level annotations of persuasion strategy, and game level annotations of deduction game outcomes.
CAP-DATA is a large-scale benchmark consisting of 11,727 in-the-wild accident videos with over 2.19 million frames together with labeled fact-effect-reason-introspection description and temporal accident frame label. It can support many useful tasks for accident inference, such as accident detection and prediction (AccidentDet/Pre), causal inference of accident (Accident-Causal), accident classification (Accident-Cla), text-video based accident retrieval (Accident-Retri), and question answering in an accident (Accident-QA) of the driving scene.
Quality of Experience Evaluation of Interactive Virtual Environments with Audiovisual Scenes (QoEVAVE) provides an initial audiovisual database consiting of 12 sequences capturing real-life nature and urban scenes. The maximum video resolution is 7680x3840 (8k) at 60 frames-per-second, with 4th-order Ambisonics spatial audio (4OA). All video sequences are recorded with a minumum target duration of 60 seconds and designed to represent real-life settings for systematically evaluating various dimensions of uni-/multimodal perception, cognition, behavior, and quality of experience (QoE) in a controlled virtual environment. This database serves as a novel high-quality reference material with an equal focus on auditory and visual sensory information within the QoE community.
The Robot Tracking Benchmark (RTB) is a synthetic dataset that facilitates the quantitative evaluation of 3D tracking algorithms for multi-body objects. It was created using the procedural rendering pipeline BlenderProc. The dataset contains photo-realistic sequences with HDRi lighting and physically-based materials. Perfect ground truth annotations for camera and robot trajectories are provided in the BOP format. Many physical effects, such as motion blur, rolling shutter, and camera shaking, are accurately modeled to reflect real-world conditions. For each frame, four depth qualities exist to simulate sensors with different characteristics. While the first quality provides perfect ground truth, the second considers measurements with the distance-dependent noise characteristics of the Azure Kinect time-of-flight sensor. Finally, for the third and fourth quality, two stereo RGB images with and without a pattern from a simulated dot projector were rendered. Depth images were then recons
Dubbing Test Set consists of two subsets extracted from the En→De test set of COVOST-2, a large-scale multilingual speech translation corpus based on Common Voice. Specifically, the first subset is created by randomly sampling 91 sentences (test91), while the second is randomly sampled 101 sentences from the longest 10% of the De part of the test set (test101).