Datasets

192 machine learning datasets

192 dataset results

UAV-Human

UAV-Human is a large dataset for human behavior understanding with UAVs. It contains 67,428 multi-modal video sequences and 119 subjects for action recognition, 22,476 frames for pose estimation, 41,290 frames and 1,144 identities for person re-identification, and 22,263 frames for attribute recognition. The dataset was collected by a flying UAV in multiple urban and rural districts in both daytime and nighttime over three months, hence covering extensive diversities w.r.t subjects, backgrounds, illuminations, weathers, occlusions, camera motions, and UAV flying attitudes. This dataset can be used for UAV-based human behavior understanding, including action recognition, pose estimation, re-identification, and attribute recognition.

47 papers38 benchmarksRGB Video, RGB-D

ScanNet200

The ScanNet200 benchmark studies 200-class 3D semantic segmentation - an order of magnitude more class categories than previous 3D scene understanding benchmarks. The source of scene data is identical to ScanNet, but parses a larger vocabulary for semantic and instance segmentation

45 papers18 benchmarks3D, 3d meshes, Images, RGB-D

How2Sign (A Large-scale Multimodal Dataset for Continuous American Sign Language)

The How2Sign is a multimodal and multiview continuous American Sign Language (ASL) dataset consisting of a parallel corpus of more than 80 hours of sign language videos and a set of corresponding modalities including speech, English transcripts, and depth. A three-hour subset was further recorded in the Panoptic studio enabling detailed 3D pose estimation.

44 papers5 benchmarks3D, RGB Video, RGB-D, Texts

12 Scenes

Dataset containing RGB-D data of 4 large scenes, comprising a total of 12 rooms, for the purpose of RGB and RGB-D camera relocalization. The RGB-D data was captured using a Structure.io depth sensor coupled with an iPad color camera. Each room was scanned multiple times, with the multiple sequences run through a global bundle adjustment in order to obtain globally aligned camera poses though all sequences of the same scene.

41 papers0 benchmarks3d meshes, RGB-D

HO-3D v2

A hand-object interaction dataset with 3D pose annotations of hand and object. The dataset contains 66,034 training images and 11,524 test images from a total of 68 sequences. The sequences are captured in multi-camera and single-camera setups and contain 10 different subjects manipulating 10 different objects from YCB dataset. The annotations are automatically obtained using an optimization algorithm. The hand pose annotations for the test set are withheld and the accuracy of the algorithms on the test set can be evaluated with standard metrics using the CodaLab challenge submission(see project page). The object pose annotations for the test and train set are provided along with the dataset.

36 papers71 benchmarksImages, RGB-D

WMCA (Wide Multi Channel Presentation Attack)

The Wide Multi Channel Presentation Attack (WMCA) database consists of 1941 short video recordings of both bonafide and presentation attacks from 72 different identities. The data is recorded from several channels including color, depth, infra-red, and thermal.

33 papers2 benchmarksImages, RGB-D, Videos

MatrixCity

We build a large-scale, comprehensive, and high-quality synthetic dataset for city-scale neural rendering researches. Leveraging the Unreal Engine 5 City Sample project, we developed a pipeline to easily collect aerial and street city views with ground-truth camera poses, as well as a series of additional data modalities. Flexible control on environmental factors like light, weather, human and car crowd is also available in our pipeline, supporting the need of various tasks covering city-scale neural rendering and beyond. The resulting pilot dataset, MatrixCity, contains 67k aerial images and 452k street images from two city maps of total size 28km^2.

33 papers0 benchmarksImages, RGB-D

InteriorNet

InteriorNet is a RGB-D for large scale interior scene understanding and mapping. The dataset contains 20M images created by pipeline:

30 papers0 benchmarks3D, Images, RGB-D

OCID (Object Clutter Indoor Dataset)

Developing robot perception systems for handling objects in the real-world requires computer vision algorithms to be carefully scrutinized with respect to the expected operating domain. This demands large quantities of ground truth data to rigorously evaluate the performance of algorithms.

29 papers1 benchmarksRGB-D

AVD (Active Vision Dataset)

AVD focuses on simulating robotic vision tasks in everyday indoor environments using real imagery. The dataset includes 20,000+ RGB-D images and 50,000+ 2D bounding boxes of object instances densely captured in 9 unique scenes.

29 papers4 benchmarksImages, RGB-D

GraspNet-1Billion

GraspNet-1Billion provides large-scale training data and a standard evaluation platform for the task of general robotic grasping. The dataset contains 97,280 RGB-D image with over one billion grasp poses.

29 papers4 benchmarksRGB-D

SBU / SBU-Refine (SBU-Kinect-Interaction dataset v2.0)

SBU-Kinect-Interaction dataset version 2.0 comprises of RGB-D video sequences of humans performing interaction activities that are recording using the Microsoft Kinect sensor. This dataset was originally recorded for a class project, and it must be used only for the purposes of research. If you use this dataset in your work, please cite the following paper. Kiwon Yun, Jean Honorio, Debaleena Chattopadhyay, Tamara L. Berg, and Dimitris Samaras, The 2nd International Workshop on Human Activity Understanding from 3D Data at Conference on Computer Vision and Pattern Recognition (HAU3D-CVPRW), CVPR 2012 SBU-Refine: SBU-Refine relabels the test set manually and refines the noise labels in training set by algorithm. H. Yang, T. Wang, X. Hu, and C.-W. Fu, “SILT: Shadow-aware iterative label tuning for learning to detect shadows from noisy labels,” in ICCV, 2023, pp. 12 687–12 698.

28 papers16 benchmarksActions, RGB-D

THuman2.0 Dataset

THuman2.0 Dataset contains 500 high-quality human scans captured by a dense DLSR rig. For each scan, we provide the 3D model (.obj) and the corresponding texture map (.jpeg). Image Source: Original Paper

28 papers4 benchmarks3D, Images, RGB-D

Drive&Act

The Drive&Act dataset is a state of the art multi modal benchmark for driver behavior recognition. The dataset includes 3D skeletons in addition to frame-wise hierarchical labels of 9.6 Million frames captured by 6 different views and 3 modalities (RGB, IR and depth).

26 papers8 benchmarks3D, RGB-D, Videos

VOID (Visual Odometry with Inertial and Depth)

The dataset was collected using the Intel RealSense D435i camera, which was configured to produce synchronized accelerometer and gyroscope measurements at 400 Hz, along with synchronized VGA-size (640 x 480) RGB and depth streams at 30 Hz. The depth frames are acquired using active stereo and is aligned to the RGB frame using the sensor factory calibration. All the measurements are timestamped.

26 papers4 benchmarksImages, Point cloud, RGB Video, RGB-D

REALY (Region-aware benchmark based on the LYHM)

The REALY benchmark aims to introduce a region-aware evaluation pipeline to measure the fine-grained normalized mean square error (NMSE) of 3D face reconstruction methods from under-controlled image sets.

24 papers25 benchmarks3D, 3d meshes, Images, RGB-D

ReDWeb (Relative Depth from Web)

The ReDWeb dataset consists of 3600 RGB-RD image pairs collected from the Web. This dataset covers a wide range of scenes and features various non-rigid objects.

22 papers0 benchmarksRGB-D

SCAND (Socially CompliAnt Navigation Dataset)

Have you wondered how autonomous mobile robots should share space with humans in public spaces? Are you interested in developing autonomous mobile robots that can navigate within human crowds in a socially compliant manner? Do you want to analyze human reactions and behaviors in the presence of mobile robots of different morphologies?

18 papers0 benchmarksActions, LiDAR, Point cloud, RGB Video, RGB-D, Videos

CDTB (Color-and-Depth Tracking)

Source: https://www.vicos.si/Projects/CDTB 4.2 State-of-the-art Comparison A TH CTB (color-and-depth visual object tracking) dataset is recorded by several passive and active RGB-D setups and contains indoor as well as outdoor sequences acquired in direct sunlight. The sequences were recorded to contain significant object pose change, clutter, occlusion, and periods of long-term target absence to enable tracker evaluation under realistic conditions. Sequences are per-frame annotated with 13 visual attributes for detailed analysis. It contains around 100,000 samples. Image Source: https://www.vicos.si/Projects/CDTB

17 papers0 benchmarksImages, RGB-D

ViViD++ (Vision for Visibility Dataset)

A dataset capturing diverse visual data formats that target varying luminance conditions, and was recorded from alternative vision sensors, by handheld or mounted on a car, repeatedly in the same space but in different conditions.

17 papers0 benchmarks3D, Images, LiDAR, RGB Video, RGB-D

PreviousPage 2 of 10Next