192 machine learning datasets
192 dataset results
Includes several sets of synthetic stereo images labelled with grasp rectangles representing parallel-jaw grasps (Cornell-like format).
PLAD is a dataset where sparse depth is provided by line-based visual SLAM to verify StructMDC.
This dataset inclue multi-spectral acquisition of vegetation for the conception of new DeepIndices. The images were acquired with the Airphen (Hyphen, Avignon, France) six-band multi-spectral camera configured using the 450/570/675/710/730/850 nm bands with a 10 nm FWHM. The dataset were acquired on the site of INRAe in Montoldre (Allier, France, at 46°20'30.3"N 3°26'03.6"E) within the framework of the “RoSE challenge” founded by the French National Research Agency (ANR) and in Dijon (Burgundy, France, at 47°18'32.5"N 5°04'01.8"E) within the site of AgroSup Dijon. Images of bean and corn, containing various natural weeds (yarrows, amaranth, geranium, plantago, etc) and sowed ones (mustards, goosefoots, mayweed and ryegrass) with very distinct characteristics in terms of illumination (shadow, morning, evening, full sun, cloudy, rain, ...) were acquired in top-down view at 1.8 meter from the ground. (2020-05-01)
In this dataset two robots, Baxter and UR5, perform 8 behaviors (look, grasp, pick, hold, shake, lower, drop, and push) on 95 objects that vary by 5 color (blue, green, red, white, and yellow), 6 contents (wooden button, plastic dices, glass marbles, nuts & bolts, pasta, and rice), and 4 weights (empty, 50g, 100g, and 150g). There are 90 objects with contents (5 colors x 3 weights x 6 contents) and 5 objects without any content that only vary by 5 colors. Both robots perform 5 trials on each object, resulting in 7,600 interactions (2 robots x 8 behaviors x 95 objects x 5 trials
HOWS-CL-25 (Household Objects Within Simulation dataset for Continual Learning) is a synthetic dataset especially designed for object classification on mobile robots operating in a changing environment (like a household), where it is important to learn new, never seen objects on the fly. This dataset can also be used for other learning use-cases, like instance segmentation or depth estimation. Or where household objects or continual learning are of interest.
One-Shot Affordance Part Segmentation variant of the UMD dataset. Each object instance in the dataset contains a single image.
The Robot Tracking Benchmark (RTB) is a synthetic dataset that facilitates the quantitative evaluation of 3D tracking algorithms for multi-body objects. It was created using the procedural rendering pipeline BlenderProc. The dataset contains photo-realistic sequences with HDRi lighting and physically-based materials. Perfect ground truth annotations for camera and robot trajectories are provided in the BOP format. Many physical effects, such as motion blur, rolling shutter, and camera shaking, are accurately modeled to reflect real-world conditions. For each frame, four depth qualities exist to simulate sensors with different characteristics. While the first quality provides perfect ground truth, the second considers measurements with the distance-dependent noise characteristics of the Azure Kinect time-of-flight sensor. Finally, for the third and fourth quality, two stereo RGB images with and without a pattern from a simulated dot projector were rendered. Depth images were then recons
In this dataset UR5 robot used 6 tools: metal-scissor, metal-whisk, plastic-knife, plastic-spoon, wooden-chopstick, and wooden-fork to perform 6 behaviors: look, stirring-slow, stirring-fast, stirring-twist, whisk, and poke. The robot explored 15 objects: cane-sugar, chia-seed, chickpea, detergent, empty, glass-bead, kidney-bean, metal-nut-bolt, plastic-bead, salt, split-green-pea, styrofoam-bead, water, wheat, and wooden-button kept cylindrical containers. The robot performed 10 trials on each object using a tool, resulting in 5,400 interactions (6 tools x 6 behaviors x 15 objects x 10 trials). The robot records multiple sensory data (audio, RGB images, depth images, haptic, and touch images) while interacting with the objects.
The dataset includes the synthetic data generated from rendering the 3D meshes of LM objects and several household objects in Blender for training 6D pose estimation algorithms. The whole dataset contains synthetic data for 18 objects (13 from LM and 5 from household objects), with 20,000 data samples for each object. Each data sample includes an RGB image in .png format and a depth image in .exr format. Each sample has the annotations of mask labels in .png format and the ground truth pose labels saved in .json files. Apart from the training data, the 3D meshes of the objects and the pre-trained models of the 6D pose estimation algorithm are also included. The whole dataset takes approximately ~1T of storage memory.
VR-Folding contains garment meshes of 4 categories from CLOTH3D dataset, namely Shirt, Pants, Top and Skirt. For flattening task, there are 5871 videos which contain 585K frames in total. For folding task, there are 3896 videos which contain 204K frames in total. The data for each frame include multi-view RGB-D images, object masks, full garment meshes, and hand poses.
ARKitTrack is a new RGB-D tracking dataset for both static and dynamic scenes captured by consumer-grade LiDAR scanners equipped on Apple's iPhone and iPad. ARKitTrack contains 300 RGBD sequences, 455 targets, and 229.7K video frames in total. This dataset has 123.9K pixel-level target masks along with the bounding box annotations and frame-level attributes.
Introduction NBMOD is a dataset created for researching the task of specific object grasp detection by robots in noisy environments. The dataset comprises three subsets: Simple background Single-object Subset (SSS), Noisy background Single-object Subset (NSS), and Multi-Object grasp detection Subset (MOS). The SSS subset contains 13,500 images, the NSS subset contains 13,000 images, and the MOS subset contains 5,000 images.
Overview: This dataset encompasses a compilation of 6,700 executed scoops (excavations), mapped across a vast spectrum of materials, terrain topography, and compositions.
InfraParis is a novel and versatile dataset supporting multiple tasks across three modalities: RGB, depth, and infrared. From the city to the suburbs, it contains a variety of styles in different areas of the greater Paris area, providing rich semantic information. InfraParis contains 7301 images with bounding boxes and full semantic (19 classes) annotations. We assess various state-of-the-art baseline techniques, encompassing models for the tasks of semantic segmentation, object detection, and depth estimation.
We introduce a RGB+S dataset named “Industrial Human Action Recognition Dataset” (InHARD) from a real-world setting for industrial human action recognition with over 2 million frames, collected from 16 distinct subjects. This dataset contains 13 different industrial action classes and over 4800 action samples. The introduction of this dataset should allow us the study and development of various learning techniques for the task of human actions analysis inside industrial environments involving human robot collaborations.
The dataset is recorded with an on-vehicle ZED stereo camera in both urban and rural environments
Understanding comprehensive assembly knowledge from videos is critical for futuristic ultra-intelligent industry. To enable technological breakthrough, we present HA-ViD – an assembly video dataset that features representative industrial assembly scenarios, natural procedural knowledge acquisition process, and consistent human-robot shared annotations. Specifically, HA-ViD captures diverse collaboration patterns of real-world assembly, natural human behaviors and learning progression during assembly, and granulate action annotations to subject, action verb, manipulated object, target object, and tool. We provide 3222 multi-view and multi-modality videos, 1.5M frames, 96K temporal labels and 2M spatial labels. We benchmark four foundational video understanding tasks: action recognition, action segmentation, object detection and multi-object tracking. Importantly, we analyze their performance and the further reasoning steps for comprehending knowledge in assembly progress, process effici
A large-scale isolated Indian sign language dataset. It contains 2002 common words, used in daily communications among Indian deaf community. The dataset contains 40033 videos across 2002 words. The total duration of the dataset is around 36.2 hours with 7.8 Million frames.
A large-scale, egocentric, multimodal, and context-aware dataset of human demonstrations of social navigation.
The ViCoS Towel Dataset is a state-of-the-art benchmark for grasp point localization on cloth objects, specifically towels. Designed to advance research in robotic grasping and perception for textile objects, this dataset includes a collection of 8,000 high-resolution RGB-D images (1920×1080) captured with a Kinect V2 under a variety of conditions. Each image provides detailed depth information, making it ideal for training deep learning models and conducting thorough benchmarking.