1,019 machine learning datasets
1,019 dataset results
Rendered synthetically using a library of standard 3D objects, and tests the ability to recognize compositions of object movements that require long-term reasoning.
The large-scale MUSIC-AVQA dataset of musical performance contains 45,867 question-answer pairs, distributed in 9,288 videos for over 150 hours. All QA pairs types are divided into 3 modal scenarios, which contain 9 question types and 33 question templates. Finally, as an open-ended problem of our AVQA tasks, all 42 kinds of answers constitute a set for selection.
TAO is a federated dataset for Tracking Any Object, containing 2,907 high resolution videos, captured in diverse environments, which are half a minute long on average. A bottom-up approach was used for discovering a large vocabulary of 833 categories, an order of magnitude more than prior tracking benchmarks.
WildDeepfake is a dataset for real-world deepfakes detection which consists of 7,314 face sequences extracted from 707 deepfake videos that are collected completely from the internet. WildDeepfake is a small dataset that can be used, in addition to existing datasets, to develop more effective detectors against real-world deepfakes.
CoMplex video Object SEgmentation (MOSE) is a dataset to study the tracking and segmenting objects in complex environments. MOSE contains 2,149 video clips and 5,200 objects from 36 categories, with 431,725 high-quality object segmentation masks. The most notable feature of MOSE dataset is complex scenes with crowded and occluded objects.
CityFlow is a city-scale traffic camera dataset consisting of more than 3 hours of synchronized HD videos from 40 cameras across 10 intersections, with the longest distance between two simultaneous cameras being 2.5 km. The dataset contains more than 200K annotated bounding boxes covering a wide range of scenes, viewing angles, vehicle models, and urban traffic flow conditions.
The FLIC dataset contains 5003 images from popular Hollywood movies. The images were obtained by running a state-of-the-art person detector on every tenth frame of 30 movies. People detected with high confidence (roughly 20K candidates) were then sent to the crowdsourcing marketplace Amazon Mechanical Turk to obtain ground truth labelling. Each image was annotated by five Turkers to label 10 upper body joints. The median-of-five labelling was taken in each image to be robust to outlier annotation. Finally, images were rejected manually by if the person was occluded or severely non-frontal.
OpenLane is the first real-world and the largest scaled 3D lane dataset to date. The dataset collects valuable contents from public perception dataset Waymo Open Dataset and provides lane&closest-in-path object(CIPO) annotation for 1000 segments. In short, OpenLane owns 200K frames and over 880K carefully annotated lanes. The OpenLane Dataset is publicly released to aid the research community in making advancements in 3D perception and autonomous driving technology.
TAP-Vid is a benchmark which contains both real-world videos with accurate human annotations of point tracks, and synthetic videos with perfect ground-truth point tracks. This is designed for a new task called tracking any point.
InternVid is a large-scale video-centric multimodal dataset that enables learning powerful and transferable video-text representations for multimodAL understanding and generation. The InternVid dataset contains over 7 million videos lasting nearly 760K hours, yielding 234M video clips accompanied by detailed descriptions of total 4.1B words.
DailyActivity3D dataset is a daily activity dataset captured by a Kinect device. There are 16 activity types: drink, eat, read book, call cellphone, write on a paper, use laptop, use vacuum cleaner, cheer up, sit still, toss paper, play game, lay down on sofa, walk, play guitar, stand up, sit down. If possible, each subject performs an activity in two different poses: “sitting on sofa” and “standing”. The total number of the activity samples is 320. This dataset is designed to cover human’s daily activities in the living room. When the performer stands close to the sofa or sits on the sofa, the 3D joint positions extracted by the skeleton tracker are very noisy. Moreover, most of the activities involve the humans-object interactions. Thus this dataset is more challenging.
This data set was prepared from 88 open-source YouTube cooking videos. The YouCook dataset contains videos of people cooking various recipes. The videos were downloaded from YouTube and are all in the third-person viewpoint; they represent a significantly more challenging visual problem than existing cooking and kitchen datasets (the background kitchen/scene is different for many and most videos have dynamic camera changes). In addition, frame-by-frame object and action annotations are provided for training data (as well as a number of precomputed low-level features). Finally, each video has a number of human provided natural language descriptions (on average, there are eight different descriptions per video). This dataset has been created to serve as a benchmark in describing complex real-world videos with natural language descriptions.
BEDLAM is a large-scale synthetic video dataset designed to train and test algorithms on the task of 3D human pose and shape estimation (HPS). It contains diverse body shapes, skin tones, and motions. The clothing is realistically simulated on the moving bodies using commercial clothing physics simulation.
The dataset contains over 15K images of 20 people (6 females and 14 males - 4 people were recorded twice). For each frame, a depth image, the corresponding rgb image (both 640x480 pixels), and the annotation is provided. The head pose range covers about +-75 degrees yaw and +-60 degrees pitch. Ground truth is provided in the form of the 3D location of the head and its rotation.
The EPIC-KITCHENS-55 dataset comprises a set of 432 egocentric videos recorded by 32 participants in their kitchens at 60fps with a head mounted camera. There is no guiding script for the participants who freely perform activities in kitchens related to cooking, food preparation or washing up among others. Each video is split into short action segments (mean duration is 3.7s) with specific start and end times and a verb and noun annotation describing the action (e.g. ‘open fridge‘). The verb classes are 125 and the noun classes 331. The dataset is divided into one train and two test splits.
A2D (Actor-Action Dataset) is a dataset for simultaneously inferring actors and actions in videos. A2D has seven actor classes (adult, baby, ball, bird, car, cat, and dog) and eight action classes (climb, crawl, eat, fly, jump, roll, run, and walk) not including the no-action class, which we also consider. The A2D has 3,782 videos with at least 99 instances per valid actor-action tuple and videos are labeled with both pixel-level actors and actions for sampled frames. The A2D dataset serves as a large-scale testbed for various vision problems: video-level single- and multiple-label actor-action recognition, instance-level object segmentation/co-segmentation, as well as pixel-level actor-action semantic segmentation to name a few.
AVSpeech is a large-scale audio-visual dataset comprising speech clips with no interfering background signals. The segments are of varying length, between 3 and 10 seconds long, and in each clip the only visible face in the video and audible sound in the soundtrack belong to a single speaking person. In total, the dataset contains roughly 4700 hours of video segments with approximately 150,000 distinct speakers, spanning a wide variety of people, languages and face poses.
DeformingThings4D is a synthetic dataset containing 1,972 animation sequences spanning 31 categories of humanoids and animals. It provides 200 animations for humanoids and 1772 animations for animals.
ACID consists of thousands of aerial drone videos of different coastline and nature scenes on YouTube. Structure-from-motion is used to get camera poses.
The Query-based Video Highlights (QVHighlights) dataset is a dataset for detecting customized moments and highlights from videos given natural language (NL). It consists of over 10,000 YouTube videos, covering a wide range of topics, from everyday activities and travel in lifestyle vlog videos to social and political activities in news videos. Each video in the dataset is annotated with: (1) a human-written free-form NL query, (2) relevant moments in the video w.r.t. the query, and (3) five-point scale saliency scores for all query-relevant clips.