1,019 machine learning datasets
1,019 dataset results
LVOS is a dataset for long-term video object segmentation (VOS). It consists of 220 videos with a total duration of 421 minutes. The videos in our LVOS last 1.59 minutes on average, which is 20 times longer than videos in existing VOS datasets. Each video includes various attributes, especially challenges deriving from the wild, such as long-term reappearing and cross-temporal similar objects.
WanJuan is a large-scale training corpus that includes multiple modalities. The dataset incorporates text, image-text, and video modalities, with a total volume exceeding 2TB.
CASIA-B is a large multiview gait database, which is created in January 2005. There are 124 subjects, and the gait data was captured from 11 views. Three variations, namely view angle, clothing and carrying condition changes, are separately considered. Besides the video files, we still provide human silhouettes extracted from video files. The detailed information about Dataset B and an evaluation framework can be found in this paper .
The Replay-Mobile Database for face spoofing consists of 1190 video clips of photo and video attack attempts to 40 clients, under different lighting conditions. These videos were recorded with current devices from the market -- an iPad Mini2 (running iOS) and a LG-G4 smartphone (running Android). This Database was produced at the Idiap Research Institute (Switzerland) within the framework of collaboration with Galician Research and Development Center in Advanced Telecommunications - Gradiant (Spain).
ActivityNet-Entities, augments the challenging ActivityNet Captions dataset with 158k bounding box annotations, each grounding a noun phrase. This allows training video description models with this data, and importantly, evaluate how grounded or "true" such model are to the video they describe.
Car Crash Dataset (CCD) is collected for traffic accident analysis. It contains real traffic accident videos captured by dashcam mounted on driving vehicles, which is critical to developing safety-guaranteed self-driving systems. CCD is distinguished from existing datasets for diversified accident annotations, including environmental attributes (day/night, snowy/rainy/good weather conditions), whether ego-vehicles involved, accident participants, and accident reason descriptions.
The Fudan-ShanghaiTech dataset (FDST) is a dataset for video crowd counting. It contains 15K frames with about 394K annotated heads captured from 13 different scenes
VidSitu is a dataset for the task of semantic role labeling in videos (VidSRL). It is a large-scale video understanding data source with 29K 10-second movie clips richly annotated with a verb and semantic-roles every 2 seconds. Entities are co-referenced across events within a movie clip and events are connected to each other via event-event relations. Clips in VidSitu are drawn from a large collection of movies (∼3K) and have been chosen to be both complex (∼4.2 unique verbs within a video) as well as diverse (∼200 verbs have more than 100 annotations each).
BOBSL is a large-scale dataset of British Sign Language (BSL). It comprises 1,962 episodes (approximately 1,400 hours) of BSL-interpreted BBC broadcast footage accompanied by written English subtitles. From horror, period and medical dramas, history, nature and science documentaries, sitcoms, children’s shows and programs covering cooking, beauty, business and travel, BOBSL covers a wide range of topics. The dataset features a total of 39 signers. Distinct signers appear in the training, validation and test sets for signer-independent evaluation.
Subjective video quality assessment (VQA) strongly depends on semantics, context, and the types of visual distortions. A lot of existing VQA databases cover small numbers of video sequences with artificial distortions. When testing newly developed Quality of Experience (QoE) models and metrics, they are commonly evaluated against subjective data from such databases, that are the result of perception experiments. However, since the aim of these QoE models is to accurately predict natural videos, these artificially distorted video databases are an insufficient basis for learning. Additionally, the small sizes make them only marginally usable for state-of-the-art learning systems, such as deep learning. In order to give a better basis for development and evaluation of objective VQA methods, we have created a larger datasets of natural, real-world video sequences with corresponding subjective mean opinion scores (MOS) gathered through crowdsourcing. We took YFCC100m as a baseline databas
SeaDronesSee is a large-scale data set aimed at helping develop systems for Search and Rescue (SAR) using Unmanned Aerial Vehicles (UAVs) in maritime scenarios. Building highly complex autonomous UAV systems that aid in SAR missions requires robust computer vision algorithms to detect and track objects or persons of interest. This data set provides three sets of tracks: object detection, single-object tracking and multi-object tracking. Each track consists of its own data set and leaderboard.
The dataset was created for video quality assessment problem. It was formed with 36 clips from Vimeo, which were selected from 18,000+ open-source clips with high bitrate (license CCBY or CC0).
Have you wondered how autonomous mobile robots should share space with humans in public spaces? Are you interested in developing autonomous mobile robots that can navigate within human crowds in a socially compliant manner? Do you want to analyze human reactions and behaviors in the presence of mobile robots of different morphologies?
We create a benchmark dataset named ReVOS. This dataset comprises 35,074 pairs of instruction-mask sequences derived from 1,042 diverse videos. In contrast to traditional referring video segmentation datasets, such as Ref-YouTube-VOS and MeViS, which primarily contain explicit short phrases, ReVOS includes text instructions that necessitates a sophisticated understanding of both video content and general world knowledge
DAiSEE is a multi-label video classification dataset comprising of 9,068 video snippets captured from 112 users for recognizing the user affective states of boredom, confusion, engagement, and frustration "in the wild". The dataset has four levels of labels namely - very low, low, high, and very high for each of the affective states, which are crowd annotated and correlated with a gold standard annotation created using a team of expert psychologists.
VIPL-HR database is a database for remote heart rate (HR) estimation from face videos under less-constrained situations. It contains 2,378 visible light videos (VIS) and 752 near-infrared (NIR) videos of 107 subjects. Nine different conditions, including various head movements and illumination conditions are taken into consideration. All the videos are recorded using Logitech C310, RealSense F200 and the front camera of HUAWEI P9 smartphone, and the ground-truth HR is recorded using a CONTEC CMS60C BVP sensor (a FDA approved device).
The One-Minute Gradual-Emotional Behavior dataset (OMG-Emotion) dataset is composed of Youtube videos which are around a minute in length and are annotated taking into consideration a continuous emotional behavior. The videos were selected using a crawler technique that uses specific keywords based on long-term emotional behaviors such as "monologues", "auditions", "dialogues" and "emotional scenes".
The SUN-SEG dataset is a high-quality per-frame annotated VPS dataset, which includes 158,690 frames from the famous SUN dataset. It extends the labels with diverse types, i.e., object mask, boundary, scribble, polygon, and visual attribute. It also introduces the pathological information from the original SUN dataset, including pathological classification labels, location information, and shape information.
The SUN-SEG dataset is a high-quality per-frame annotated VPS dataset, which includes 158,690 frames from the famous SUN dataset. It extends the labels with diverse types, i.e., object mask, boundary, scribble, polygon, and visual attribute. It also introduces the pathological information from the original SUN dataset, including pathological classification labels, location information, and shape information.
Memorability dataset with 10000 3-second videos. Each video has upwards of 90 human annotations, and the split-half consistency of this dataset is 0.73 (best in class for video memorabilty datasets).