1,019 machine learning datasets
1,019 dataset results
NT-VOT211 consists of 211 diverse videos, offering 211,000 well-annotated frames with 8 attributes including camera motion, deformation, fast motion, motion blur, tiny target, distractors, occlusion and out-of-view. To the best of our knowledge, it is the largest night-time tracking benchmark to-date that is specifically designed to address unique challenges such as adverse visibility, image blur, and distractors inherent to night-time tracking scenarios.
URMP (University of Rochester Multi-Modal Musical Performance) is a dataset for facilitating audio-visual analysis of musical performances. The dataset comprises 44 simple multi-instrument musical pieces assembled from coordinated but separately recorded performances of individual tracks. For each piece the dataset provided the musical score in MIDI format, the high-quality individual instrument audio recordings and the videos of the assembled pieces.
The RGBT234 dataset is a comprehensive video dataset specifically designed for RGB-T (Red-Green-Blue and Thermal) tracking purposes. This dataset addresses the limitations of existing datasets like OSU-CT, LITIV, and GTOT in terms of size. RGBT234 consists of 234 RGB-T videos, each containing both an RGB video and a thermal video. The total number of frames in the dataset is approximately 234,000, with the largest video pair containing up to 8,000 frames.Each frame in the RGBT234 dataset is annotated with a minimum bounding box that covers the target for both the RGB and thermal modalities. The dataset also includes various environmental challenges such as rainy conditions, nighttime scenes, cold and hot weather scenarios. To analyze the performance of different tracking algorithms based on specific attributes, the RGBT234 dataset annotates 12 attributes and provides baseline trackers, including both deep learning and non-deep learning methods like structured SVM, sparse representation
The EgoGesture dataset contains 2,081 RGB-D videos, 24,161 gesture samples and 2,953,224 frames from 50 distinct subjects.
JTA is a dataset for people tracking in urban scenarios by exploiting a photorealistic videogame. It is up to now the vastest dataset (about 500.000 frames, almost 10 million body poses) of human body parts for people tracking in urban scenarios.
MAD (Movie Audio Descriptions) is an automatically curated large-scale dataset for the task of natural language grounding in videos or natural language moment retrieval. MAD exploits available audio descriptions of mainstream movies. Such audio descriptions are redacted for visually impaired audiences and are therefore highly descriptive of the visual content being displayed. MAD contains over 384,000 natural language sentences grounded in over 1,200 hours of video, and provides a unique setup for video grounding as the visual stream is truly untrimmed with an average video duration of 110 minutes. 2 orders of magnitude longer than legacy datasets.
CelebV-HQ is a large-scale video facial attributes dataset with annotations. CelebV-HQ contains 35,666 video clips involving 15,653 identities and 83 manually labeled facial attributes covering appearance, action, and emotion.
Activity recognition research has shifted focus from distinguishing full-body motion patterns to recognizing complex interactions of multiple entities. Manipulative gestures – characterized by interactions between hands, tools, and manipulable objects – frequently occur in food preparation, manufacturing, and assembly tasks, and have a variety of applications including situational support, automated supervision, and skill assessment. With the aim to stimulate research on recognizing manipulative gestures we introduce the 50 Salads dataset. It captures 25 people preparing 2 mixed salads each and contains over 4h of annotated accelerometer and RGB-D video data. Including detailed annotations, multiple sensor types, and two sequences per participant, the 50 Salads dataset may be used for research in areas such as activity recognition, activity spotting, sequence analysis, progress tracking, sensor fusion, transfer learning, and user-adaptation.
The BirdSong dataset consists of audio recordings of bird songs at the H. J. Andrews (HJA) Experimental Forest, using unattended microphones. The goal of the dataset is to provide data to automatically identify the species of bird responsible for each utterance in these recordings. The dataset contains 548 10-seconds audio recordings.
V2V4Real is a large-scale real-world multi-modal dataset for V2V perception. The data is collected by two vehicles equipped with multi modal sensors driving together through diverse scenarios. It covers a driving area of 410 km comprising 20K LiDAR frames, 40K RGB frames, 240K annotated 3D bounding boxes for 5 classes, and HDMaps that cover all the driving routes.
MOT20 is a dataset for multiple object tracking. The dataset contains 8 challenging video sequences (4 train, 4 test) in unconstrained environments, from crowded places such as train stations, town squares and a sports stadium. Image Source: https://motchallenge.net/vis/MOT20-04
The EgoHands dataset contains 48 Google Glass videos of complex, first-person interactions between two people. The main intention of this dataset is to enable better, data-driven approaches to understanding hands in first-person computer vision. The dataset offers
The Wide Multi Channel Presentation Attack (WMCA) database consists of 1941 short video recordings of both bonafide and presentation attacks from 72 different identities. The data is recorded from several channels including color, depth, infra-red, and thermal.
A new multitask action quality assessment (AQA) dataset, the largest to date, comprising of more than 1600 diving samples; contains detailed annotations for fine-grained action recognition, commentary generation, and estimating the AQA score. Videos from multiple angles provided wherever available.
Contains 68,536 activity instances in 68.8 hours of first and third-person video, making it one of the largest and most diverse egocentric datasets available. Charades-Ego furthermore shares activity classes, scripts, and methodology with the Charades dataset, that consist of additional 82.3 hours of third-person video with 66,500 activity instances.
The Ankara University Turkish Sign Language Dataset (AUTSL) is a large-scale, multimode dataset that contains isolated Turkish sign videos. It contains 226 signs that are performed by 43 different signers. There are 38,336 video samples in total. The samples are recorded using Microsoft Kinect v2 in RGB, depth and skeleton formats. The videos are provided at a resolution of 512×512. The skeleton data contains spatial coordinates, i.e. (x, y), of the 25 junction points on the signer body that are aligned with 512×512 data.
CholecT50 is a dataset of endoscopic videos of laparoscopic cholecystectomy surgery introduced to enable research on fine-grained action recognition in laparoscopic surgery. It is annotated with triplet information in the form of <instrument, verb, target>. The dataset is a collection of 50 videos consisting of 45 videos from the Cholec80 dataset and 5 videos from an in-house dataset of the same surgical procedure.
Ego4D is a massive-scale egocentric video dataset and benchmark suite. It offers 3,025 hours of daily life activity video spanning hundreds of scenarios (household, outdoor, workplace, leisure, etc.) captured by 855 unique camera wearers from 74 worldwide locations and 9 different countries. The approach to collection is designed to uphold rigorous privacy and ethics standards with consenting participants and robust de-identification procedures where relevant. Ego4D dramatically expands the volume of diverse egocentric video footage publicly available to the research community. Portions of the video are accompanied by audio, 3D meshes of the environment, eye gaze, stereo, and/or synchronized videos from multiple egocentric cameras at the same event. Furthermore, a host of new benchmark challenges are presented, centered around understanding the first-person visual experience in the past (querying an episodic memory), present (analyzing hand-object manipulation, audio-visual conversatio
V2X-Sim, short for vehicle-to-everything simulation, is the a synthetic collaborative perception dataset in autonomous driving developed by AI4CE Lab at NYU and MediaBrain Group at SJTU to facilitate collaborative perception between multiple vehicles and roadside infrastructure. Data is collected from both roadside and vehicles when they are presented near the same intersection. With information from both the roadside infrastructure and vehicles, the dataset aims to encourage research on collaborative perception tasks.
Motivation Multi-object tracking (MOT) is a fundamental task in computer vision, aiming to estimate objects (e.g., pedestrians and vehicles) bounding boxes and identities in video sequences.