192 machine learning datasets
192 dataset results
Provides a large-scale synthetic dataset which contains accurate ground truth depth of various photo-realistic scenes.
The Dataset of Multimodal Semantic Egocentric Video (DoMSEV) contains 80-hours of multimodal (RGB-D, IMU, and GPS) data related to First-Person Videos with annotations for recorder profile, frame scene, activities, interaction, and attention.
FINO-Net is a multimodal (RGB, depth and audio) dataset, containing 229 real-world manipulation data of 5 different manipulation types recorded with a Baxter robot.
The Few-Shot Object Learning (FewSOL) dataset can be used for object recognition with a few images per object. It contains 336 real-world objects with 9 RGB-D images per object from different views. Object segmentation masks, object poses and object attributes are provided. In addition, synthetic images generated using 330 3D object models are used to augment the dataset. FewSOL dataset can be used to study a set of few-shot object recognition problems such as classification, detection and segmentation, shape reconstruction, pose estimation, keypoint correspondences and attribute recognition.
The MUAD dataset (Multiple Uncertainties for Autonomous Driving), consisting of 10,413 realistic synthetic images with diverse adverse weather conditions (night, fog, rain, snow), out-of-distribution objects, and annotations for semantic segmentation, depth estimation, object, and instance detection. Predictive uncertainty estimation is essential for the safe deployment of Deep Neural Networks in real-world autonomous systems and MUAD allows to a better assess the impact of different sources of uncertainty on model performance.
TRansPose is a large-scale multispectral dataset that combines stereo RGB-D, TIR (TIR) images, and object poses to promote transparent object research. The dataset includes 99 transparent objects, encompassing 43 household items, 27 recyclable trashes, 29 chemical laboratory equivalents, and 12 non-transparent objects. It comprises a vast collection of 333,819 images and 4,000,056 annotations, providing instance-level segmentation masks, ground-truth poses, and completed depth information.
MMToM-QA is the first multimodal benchmark to evaluate machine Theory of Mind (ToM), the ability to understand people's minds. MMToM-QA consists of 600 questions. Each question is paired with a clip of the full activity in a video (as RGB-D frames), as well as a text description of the scene and the actions taken by the person in that clip. All questions have two choices. The questions are categorized into seven types, evaluating belief inference and goal inference in rich and diverse situations. Each belief inference type has 100 questions, totaling 300 belief questions; each goal inference type has 75 questions, totaling 300 goal questions. The questions are paired with 134 videos of a person looking for daily objects in household environments.
From PARIS: Part-level Reconstruction and Motion Analysis for Articulated Objects: 5.1. Dataset Synthetic dataset. The synthetic 3D models we use for evaluation are from the PartNet-Mobility dataset [49, 27, 4], a large-scale dataset for articulated objects across 46 categories. We select instances across 10 categories to conduct our experiments. For each articulation state, we randomly sample 64-100 views covering the upper hemisphere of the object to simulate capturing in the real world. Then we render RGB images and acquire camera parameters and object masks using Blender [6] to create our training data. Real-world dataset. The real data we use for experiments is from the MultiScan dataset [25], scanning real-world indoor scenes with articulated objects in multiple states. We use the reconstructed mesh of an object in two states as ground truth for evaluation, and the real RGB frames as training data.
MUSES offers 2500 multi-modal scenes, evenly distributed across various combinations of weather conditions (clear, fog, rain, and snow) and types of illumination (daytime, nighttime). Each image includes high-quality 2D pixel-level panoptic annotations and class-level and novel instance-level uncertainty annotations. Further, each adverse-condition image has a corresponding image of the same scene taken under clear-weather, daytime conditions. The annotation process for MUSES utilizes all available sensor data, allowing the annotators to also reliably label degraded image regions that are still discernible in other modalities. This results in better pixel coverage in the annotations and creates a more challenging evaluation setup.
The UAVA,<i>UAV-Assistant</i>, dataset is specifically designed for fostering applications which consider UAVs and humans as cooperative agents. We employ a real-world 3D scanned dataset (<a href="https://niessner.github.io/Matterport/">Matterport3D</a>), physically-based rendering, a gamified simulator for realistic drone navigation trajectory collection, to generate realistic multimodal data both from the user’s exocentric view of the drone, as well as the drone’s egocentric view.
The Composable activities dataset consists of 693 videos that contain activities in 16 classes performed by 14 actors. Each activity is composed of 3 to 11 atomic actions. RGB-D data for each sequence is captured using a Microsoft Kinect sensor and estimate position of relevant body joints.
The UASOL an RGB-D stereo dataset, that contains 160902 frames, filmed at 33 different scenes, each with between 2 k and 10 k frames. The frames show different paths from the perspective of a pedestrian, including sidewalks, trails, roads, etc. The images were extracted from video files with 15 fps at HD2K resolution with a size of 2280 × 1282 pixels. The dataset also provides a GPS geolocalization tag for each second of the sequences and reflects different climatological conditions. It also involved up to 4 different persons filming the dataset at different moments of the day.
A synthetic depth estimation dataset for benchmark rendered from a high-quality CAD indoor environment
AnoVox is a large-scale benchmark for ANOmaly detection in autonomous driving. AnoVox incorporates multimodal sensor data and spatial VOXel ground truth, allowing for the comparison of methods independent of their used sensor. AnoVox contains both content and temporal anomalies.
Operating rooms (ORs) are complex, high-stakes environments requiring precise understanding of interactions among medical staff, tools, and equipment for enhancing surgical assistance, situational awareness, and patient safety. Current datasets fall short in scale, realism and do not capture the multimodal nature of OR scenes, limiting progress in OR modeling. To this end, we introduce MM-OR, a realistic and large-scale multimodal spatiotemporal OR dataset, and the first dataset to enable multimodal scene graph generation. MM-OR captures comprehensive OR scenes containing RGB-D data, detail views, audio, speech transcripts, robotic logs, and tracking data and is annotated with panoptic segmentations, semantic scene graphs, and downstream task labels. Further, we propose MM2SG, the first multimodal large vision-language model for scene graph generation, and through extensive experiments, demonstrate its ability to effectively leverage multimodal inputs. Together, MM-OR and MM2SG establi
The ICL-NUIM dataset aims at benchmarking RGB-D, Visual Odometry and SLAM algorithms. Two different scenes (the living room and the office room scene) are provided with ground truth. Living room has 3D surface ground truth together with the depth-maps as well as camera poses and as a result perfectly suits not just for benchmarking camera trajectory but also reconstruction. Office room scene comes with only trajectory data and does not have any explicit 3D model with it.
Rendered Handpose Dataset contains 41258 training and 2728 testing samples. Each sample provides:
The Robo-VLN dataset is a continuous control formulation of the VLN-CE dataset by Krantz et al ported over from Room-to-Room (R2R) dataset created by Anderson et al. The details regarding converting discrete VLN dataset into continuous control formulation can be found in our paper.
This is the dataset for the CGF 2021 paper "DONeRF: Towards Real-Time Rendering of Compact Neural Radiance Fields using Depth Oracle Networks".
Pano3D is a new benchmark for depth estimation from spherical panoramas. Its goal is to drive progress for this task in a consistent and holistic manner. The Pano3D 360 depth estimation benchmark provides a standard Matterport3D train and test split, as well as a secondary GibsonV2 partioning for testing and training as well. The latter is used for zero-shot cross dataset transfer performance assessment and decomposes it into 3 different splits, each one focusing on a specific generalization axis.