192 machine learning datasets
192 dataset results
Washington RGB-D is a widely used testbed in the robotic community, consisting of 41,877 RGB-D images organized into 300 instances divided in 51 classes of common indoor objects (e.g. scissors, cereal box, keyboard etc). Each object instance was positioned on a turntable and captured from three different viewpoints while rotating.
First-Person Hand Action Benchmark is a collection of RGB-D video sequences comprised of more than 100K frames of 45 daily hand action categories, involving 26 different objects in several hand configurations.
The EgoDexter dataset provides both 2D and 3D pose annotations for 4 testing video sequences with 3190 frames. The videos are recorded with body-mounted camera from egocentric viewpoints and contain cluttered backgrounds, fast camera motion, and complex interactions with various objects. Fingertip positions were manually annotated for 1485 out of 3190 frames.
This dataset accompanies our paper on synthesizing the 3D Ken Burns effect from a single image. It consists of 134041 captures from 32 virtual environments where each capture consists of 4 views. Each view contains color-, depth-, and normal-maps at a resolution of 512x512 pixels.
The Hands in action dataset (HIC) dataset has RGB-D sequences of hands interacting with objects.
The HRWSI dataset consists of about 21K diverse high-resolution RGB-D image pairs derived from the Web stereo images. Also, it provides sky segmentation masks, instance segmentation masks as well as invalid pixel masks.
CIRCLE is a dataset containing 10 hours of full-body reaching motion from 5 subjects across nine scenes, paired with ego-centric information of the environment represented in various forms, such as RGBD videos.
Dataset Description The Greek Sign Language (GSL) is a large-scale RGB+D dataset, suitable for Sign Language Recognition (SLR) and Sign Language Translation (SLT). The video captures are conducted using an Intel RealSense D435 RGB+D camera at a rate of 30 fps. Both the RGB and the depth streams are acquired in the same spatial resolution of 848×480 pixels. To increase variability in the videos, the camera position and orientation is slightly altered within subsequent recordings. Seven different signers are employed to perform 5 individual and commonly met scenarios in different public services. The average length of each scenario is twenty sentences.
RGB-D-D is a large-scale dataset for depth map super-resolution (SR). It consists of real-world paired low-resolution (LR) and high-resolution (HR) depth maps. The paired LR and HR depth maps are captured from mobile phone and Lucid Helios respectively ranging from indoor scenes to challenging outdoor scenes.
4D-OR includes a total of 6734 scenes, recorded by six calibrated RGB-D Kinect sensors 1 mounted to the ceiling of the OR, with one frame-per-second, providing synchronized RGB and depth images. We provide fused point cloud sequences of entire scenes, automatically annotated human 6D poses and 3D bounding boxes for OR objects. Furthermore, we provide SSG annotations for each step of the surgery together with the clinical roles of all the humans in the scenes, e.g., nurse, head surgeon, anesthesiologist.
Dynamic Replica is a synthetic dataset of stereo videos featuring humans and animals in virtual environments. It is a benchmark for dynamic disparity/depth estimation and 3D reconstruction consisting of 145,200 stereo frames (524 videos).
The High-Quality Wide Multi-Channel Attack database (HQ-WMCA) database consists of 2904 short multi-modal video recordings of both bona-fide and presentation attacks. There are 555 bonafide presentations from 51 participants and the remaining 2349 are presentation attacks. The data is recorded from several channels including color, depth, thermal, infrared (spectra), and short-wave infrared (spectra).
Bonn RGB-D Dynamic is a dataset for RGB-D SLAM, containing highly dynamic sequences. We provide 24 dynamic sequences, where people perform different tasks, such as manipulating boxes or playing with balloons, plus 2 static sequences. For each scene we provide the ground truth pose of the sensor, recorded with an Optitrack Prime 13 motion capture system. The sequences are in the same format as the TUM RGB-D Dataset, so that the same evaluation tools can be used. Furthermore, we provide a ground truth 3D point cloud of the static environment recorded using a Leica BLK360 terrestrial laser scanner.
DeepLoc is a large-scale urban outdoor localization dataset. The dataset is currently comprised of one scene spanning an area of 110 x 130 m, that a robot traverses multiple times with different driving patterns. The dataset creators use a LiDAR-based SLAM system with sub-centimeter and sub-degree accuracy to compute the pose labels that provided as groundtruth. Poses in the dataset are approximately spaced by 0.5 m which is twice as dense as other relocalization datasets.
X-Humans consists of 20 subjects (11 males, 9 females) with various clothing types and hair style. The collection of this dataset has been approved by an internal ethics committee. For each subject, we split the motion sequences into a training and test set. In total, there are 29,036 poses for training and 6,439 test poses. X-Humans also contains ground-truth SMPL-X parameters, obtained via a custom SMPL-X registration pipeline specifically designed to deal with low-resolution body parts.
We present the HANDAL dataset for category-level object pose estimation and affordance prediction. Unlike previous datasets, ours is focused on robotics-ready manipulable objects that are of the proper size and shape for functional grasping by robot manipulators, such as pliers, utensils, and screwdrivers. Our annotation process is streamlined, requiring only a single off-the-shelf camera and semi-automated processing, allowing us to produce high-quality 3D annotations without crowd-sourcing. The dataset consists of 308k annotated image frames from 2.2k videos of 212 real-world objects in 17 categories. We focus on hardware and kitchen tool objects to facilitate research in practical scenarios in which a robot manipulator needs to interact with the environment beyond simple pushing or indiscriminate grasping. We outline the usefulness of our dataset for 6-DoF category-level pose+scale estimation and related tasks. We also provide 3D reconstructed meshes of all objects, and we outline s
The 7-Scenes dataset is a collection of tracked RGB-D camera frames. The dataset may be used for evaluation of methods for different applications such as dense tracking and mapping and relocalization techniques. All scenes were recorded from a handheld Kinect RGB-D camera at 640×480 resolution. The dataset creators use an implementation of the KinectFusion system to obtain the ‘ground truth’ camera tracks, and a dense 3D model. Several sequences were recorded per scene by different users, and split into distinct training and testing sequence sets.
The SynthHands dataset is a dataset for hand pose estimation which consists of real captured hand motion retargeted to a virtual hand with natural backgrounds and interactions with different objects. The dataset contains data for male and female hands, both with and without interaction with objects. While the hand and foreground object are synthtically generated using Unity, the motion was obtained from real performances as described in the accompanying paper. In addition, real object textures and background images (depth and color) were used. Ground truth 3D positions are provided for 21 keypoints of the hand.
The HandNet dataset contains depth images of 10 participants' hands non-rigidly deforming in front of a RealSense RGB-D camera. The annotations are generated by a magnetic annotation technique. 6D pose is available for the center of the hand as well as the five fingertips (i.e. position and orientation of each).
This is a 3D action recognition dataset, also known as 3D Action Pairs dataset. The actions in this dataset are selected in pairs such that the two actions of each pair are similar in motion (have similar trajectories) and shape (have similar objects); however, the motion-shape relation is different.