68 machine learning datasets
68 dataset results
The Breakfast Actions Dataset comprises of 10 actions related to breakfast preparation, performed by 52 different individuals in 18 different kitchens. The dataset is one of the largest fully annotated datasets available. The actions are recorded “in the wild” as opposed to a single controlled lab environment. It consists of over 77 hours of video recordings.
NExT-QA is a VideoQA benchmark targeting the explanation of video contents. It challenges QA models to reason about the causal and temporal actions and understand the rich object interactions in daily activities, e.g., "why is the boy crying?" and "How does the lady react after the boy fall backward?". It supports both multi-choice and generative open-ended QA tasks. The videos are untrimmed and the questions usually invoke local video contents for answers.
BEAT has i) 76 hours, high-quality, multi-modal data captured from 30 speakers talking with eight different emotions and in four different languages, ii) 32 millions frame-level emotion and semantic relevance annotations. Our statistical analysis on BEAT demonstrates the correlation of conversational gestures with \textit{facial expressions}, \textit{emotions}, and \textit{semantics}, in addition to the known correlation with \textit{audio}, \textit{text}, and \textit{speaker identity}. Based on this observation, we propose a baseline model, \textbf{Ca}scaded \textbf{M}otion \textbf{N}etwork \textbf{(CaMN)}, which consists of above six modalities modeled in a cascaded architecture for gesture synthesis. To evaluate the semantic relevancy, we introduce a metric, Semantic Relevance Gesture Recall (\textbf{SRGR}). Qualitative and quantitative experiments demonstrate metrics' validness, ground truth data quality, and baseline's state-of-the-art performance. To the best of our knowledge,
The KIT Motion-Language is a dataset linking human motion and natural language.
SBU-Kinect-Interaction dataset version 2.0 comprises of RGB-D video sequences of humans performing interaction activities that are recording using the Microsoft Kinect sensor. This dataset was originally recorded for a class project, and it must be used only for the purposes of research. If you use this dataset in your work, please cite the following paper. Kiwon Yun, Jean Honorio, Debaleena Chattopadhyay, Tamara L. Berg, and Dimitris Samaras, The 2nd International Workshop on Human Activity Understanding from 3D Data at Conference on Computer Vision and Pattern Recognition (HAU3D-CVPRW), CVPR 2012 SBU-Refine: SBU-Refine relabels the test set manually and refines the noise labels in training set by algorithm. H. Yang, T. Wang, X. Hu, and C.-W. Fu, “SILT: Shadow-aware iterative label tuning for learning to detect shadows from noisy labels,” in ICCV, 2023, pp. 12 687–12 698.
100 tasks from LIBERO-100 suite. Note that the datasets are split under the folder names of LIBERO-90 and LIBERO-10.
Have you wondered how autonomous mobile robots should share space with humans in public spaces? Are you interested in developing autonomous mobile robots that can navigate within human crowds in a socially compliant manner? Do you want to analyze human reactions and behaviors in the presence of mobile robots of different morphologies?
V-D4RL provides pixel-based analogues of the popular D4RL benchmarking tasks, derived from the dm_control suite, along with natural extensions of two state-of-the-art online pixel-based continuous control algorithms, DrQ-v2 and DreamerV2, to the offline setting.
100 tasks from LIBERO-100 suite. Note that the datasets are split under the folder names of LIBERO-90 and LIBERO-10. The 10 contains selected task that require long-horizon task completion.
UI-PRMD is a data set of movements related to common exercises performed by patients in physical therapy and rehabilitation programs. The data set consists of 10 rehabilitation exercises. A sample of 10 healthy individuals repeated each exercise 10 times in front of two sensory systems for motion capturing: a Vicon optical tracker, and a Kinect camera. The data is presented as positions and angles of the body joints in the skeletal models provided by the Vicon and Kinect mocap systems.
Atari-HEAD is a dataset of human actions and eye movements recorded while playing Atari videos games. For every game frame, its corresponding image frame, the human keystroke action, the reaction time to make that action, the gaze positions, and immediate reward returned by the environment were recorded. The gaze data was recorded using an EyeLink 1000 eye tracker at 1000Hz. The human subjects are amateur players who are familiar with the games. The human subjects were only allowed to play for 15 minutes and were required to rest for at least 15 minutes before the next trial. Data was collected from 4 subjects, 16 games, 175 15-minute trials, and a total of 2.97 million frames/demonstrations.
WebLINX is a large-scale benchmark of 100K interactions across 2300 expert demonstrations of conversational web navigation. It covers a broad range of patterns on over 150 real-world websites and can be used to train and evaluate agents in diverse scenarios.
The Sims4Action Dataset: a videogame-based dataset for Synthetic→Real domain adaptation for human activity recognition.
The dataset uses VGG-Sound which consists of 10s clips collected from YouTube for 309 sound classes. A subset of ‘temporally sparse’ classes is selected using the following procedure: 5–15 videos are randomly picked from each of the 309 VGGSound classes, and manually annotated as to whether audio-visual cues are only sparsely available. As a result, 12 classes are selected (∼4 %) or 6.5k and 0.6k videos in the train and test sets, respectively. The classes include 'dog barking', 'chopping wood', 'lion roaring', 'skateboarding' etc.
BRACE is a dataset for audio-conditioned dance motion synthesis challenging common assumptions for this task:
The Dataset of Multimodal Semantic Egocentric Video (DoMSEV) contains 80-hours of multimodal (RGB-D, IMU, and GPS) data related to First-Person Videos with annotations for recorder profile, frame scene, activities, interaction, and attention.
The eSports Sensors dataset contains sensor data collected from 10 players in 22 matches in League of Legends. The sensor data collected includes:
Accurate 3D human pose estimation is essential for sports analytics, coaching, and injury prevention. However, existing datasets for monocular pose estimation do not adequately capture the challenging and dynamic nature of sports movements. In response, we introduce SportsPose, a large-scale 3D human pose dataset consisting of highly dynamic sports movements. With more than 176,000 3D poses from 24 different subjects performing 5 different sports activities, SportsPose provides a diverse and comprehensive set of 3D poses that reflect the complex and dynamic nature of sports movements. Contrary to other markerless datasets we have quantitatively evaluated the precision of SportsPose by comparing our poses with a commercial marker-based system and achieve a mean error of 34.5 mm across all evaluation sequences. This is comparable to the error reported on the commonly used 3DPW dataset. We further introduce a new metric, local movement, which describes the movement of the wrist and ankle
HowTo100M Adverbs is a subset from HowTo100M with mined adverbs from 83 tasks in HowTo100M. The annotations were obtained from automatically transcribed narrations of instructional videos. The dataset contains originally 5,824 clips annotated with action-adverb pairs from 72 verbs and 6 adverbs. Source: How Do You Do It? Fine-Grained Action Understanding with Pseudo-Adverbs
The OREBA dataset aims to provide a comprehensive multi-sensor recording of communal intake occasions for researchers interested in automatic detection of intake gestures. Two scenarios are included, with 100 participants for a discrete dish and 102 participants for a shared dish, totalling 9069 intake gestures. Available sensor data consists of synchronized frontal video and IMU with accelerometer and gyroscope for both hands.