1,019 machine learning datasets
1,019 dataset results
Please find more details of this dataset at https://alex-xun-xu.github.io/ProjectPage/CVPR_18/index.html
The Florentine dataset is a dataset of facial gestures which contains facial clips from 160 subjects (both male and female), where gestures were artificially generated according to a specific request, or genuinely given due to a shown stimulus. 1032 clips were captured for posed expressions and 1745 clips for induced facial expressions amounting to a total of 2777 video clips. Genuine facial expressions were induced in subjects using visual stimuli, i.e. videos selected randomly from a bank of Youtube videos to generate a specific emotion.
The Daimler Monocular Pedestrian Detection dataset is a dataset for pedestrian detection in urban environments. The training set contains 15560 pedestrian samples (image cut-outs at 48×96 resolution) and 6744 additional full images without pedestrians for extracting negative samples. The test set contains an independent sequence with more than 21790 images and 56492 pedestrian labels (fully visible or partially occluded), captured from a vehicle during a 27 min driving through the urban traffic.
The Caltech Resident-Intruder Mouse dataset (CRIM13) consists of 237x2 videos (recorded with synchronized top and side view) of pairs of mice engaging in social behavior, catalogued into thirteen different actions. Each video lasts ~10min, for a total of 88 hours of video and 8 million frames. A team of behavior experts annotated each video on a frame-by-frame basis for a state-of-the-art study of the neurophysiological mechanisms involved in aggression and courtship in mice.
The Oxford Town Center dataset is a 5-minute video with 7500 frames annotated, which is divided into 6500 for training and 1000 for testing data for pedestrian detection. The data was recorded from a CCTV camera in Oxford for research and development into activity and face recognition.
MobiFace is the first dataset for single face tracking in mobile situations. It consists of 80 unedited live-streaming mobile videos captured by 70 different smartphone users in fully unconstrained environments. Over 95K bounding boxes are manually labelled. The videos are carefully selected to cover typical smartphone usage. The videos are also annotated with 14 attributes, including 6 newly proposed attributes and 8 commonly seen in object tracking.
EGO-CH is a dataset of egocentric videos for visitors’ behavior understanding. The dataset has been collected in two different cultural sites and includes more than 27 hours of video acquired by 70 subjects, including volunteers and 60 real visitors. The overall dataset includes labels for 26 environments and over 200 Points of Interest (POIs). Specifically, each video of EGO-CH has been annotated with 1) temporal labels specifying the current location of the visitor and the observed POI, 2) bounding box annotations around POIs. A large subset of the dataset, consisting of 60 videos, is also associated with surveys filled out by the visitors at the end of each visit.
MlGesture is a dataset for hand gesture recognition tasks, recorded in a car with 5 different sensor types at two different viewpoints. The dataset contains over 1300 hand gesture videos from 24 participants and features 9 different hand gesture symbols. One sensor cluster with five different cameras is mounted in front of the driver in the center of the dashboard. A second sensor cluster is mounted on the ceiling looking straight down.
METU-VIREF is a video referring expression dataset comprising of videos from VIRAT Ground and ILSVRC2015 VID datasets. VIRAT is a surveillance dataset and contains mainly people and vehicles. To line up with this and restrict the domain, only videos that contain vehicles from the ILSVRC dataset are used. The METU-VIREF dataset does not contain whole videos from these datasets (the videos need to be downloaded from the respective sources) but just referring expressions for video sequences containing an object pair. For this, object pairs are chosen which had a relation that a meaningful referring expression could be written for.
EDUVSUM contains educational videos with subtitles from three popular e-learning platforms: Edx,YouTube, and TIB AV-Portal that cover the following topics: crash course on history of science and engineering, computer science, python and web programming, machine learning and computer vision, Internet of things (IoT), and software engineering. In total, the current version of the dataset contains 98 videos with ground truth values annotated by a user with an academic background in computer science.
This dataset consists of 18 movies with duration range between 10 and 104 minutes leveraged from the OVSD dataset (Rotman et al., 2016). For these videos, the summary length limit is set to be the minimum between 4 minutes and 10% of the video length.
VOT2015 is a visual object tracking dataset. The dataset comprises 60 short sequences showing various objects in challenging backgrounds. The sequences were chosen from a large pool of sequences from different sources.
UCF50 is an action recognition data set with 50 action categories, consisting of realistic videos taken from youtube. This data set is an extension of YouTube Action data set (UCF11) which has 11 action categories.
ActioNet is a video task-based dataset collected in a synthetic 3D environment. It contains 3,038 annotated videos and hierarchical task structures over 65 individual household tasks from 120 different scenes. Each task is annotated across three to five different scenes by 10 different annotators. The tasks can be broken down into four categories: living room, bedroom, bathroom, kitchen.
Consists of 10,000+ video-sentence pairs with each accompanied by an annotated sentence specified video thumbnail.
Contains a large number of online videos and subtitles.
A dataset for audio-visual event classification and localization in the context of office environments. The audio-visual dataset is composed of 11 event classes recorded at several realistic positions in two different rooms. Two types of sequences are recorded according to the number of events in the sequence. The dataset comprises 2662 unilabel sequences and 2724 multilabel sequences corresponding to a total of 5.24 hours.
CLAD (Compled and Long Activities Dataset) is an activity dataset which exhibits real-life and diverse scenarios of complex, temporally-extended human activities and actions. The dataset consists of a set of videos of actors performing everyday activities in a natural and unscripted manner. The dataset was recorded using a static Kinect 2 sensor which is commonly used on many robotic platforms. The dataset comprises of RGB-D images, point cloud data, automatically generated skeleton tracks in addition to crowdsourced annotations.
A large-scale comprehensive collection of dashcam videos collected by vehicles on DiDi's platform. D2-City contains more than 10000 video clips which deeply reflect the diversity and complexity of real-world traffic scenarios in China.
The Large Scale Movie Description Challenge (LSMDC) - Context is an augmented version of the original LSMDC dataset with movie scripts as contextual text.