3,275 machine learning datasets
3,275 dataset results
We present the HANDAL dataset for category-level object pose estimation and affordance prediction. Unlike previous datasets, ours is focused on robotics-ready manipulable objects that are of the proper size and shape for functional grasping by robot manipulators, such as pliers, utensils, and screwdrivers. Our annotation process is streamlined, requiring only a single off-the-shelf camera and semi-automated processing, allowing us to produce high-quality 3D annotations without crowd-sourcing. The dataset consists of 308k annotated image frames from 2.2k videos of 212 real-world objects in 17 categories. We focus on hardware and kitchen tool objects to facilitate research in practical scenarios in which a robot manipulator needs to interact with the environment beyond simple pushing or indiscriminate grasping. We outline the usefulness of our dataset for 6-DoF category-level pose+scale estimation and related tasks. We also provide 3D reconstructed meshes of all objects, and we outline s
A benchmark to evaluate the tool-use capabilities of LLM-based agents in real-world scenarios.
The HInt dataset is frequently used as a <b>generalizability benchmark</b> for 3D Hand Reconstruction. It features three data subsets: HInt-NewDays, HInt-VISOR and HInt-Ego4D subsets and it aims to complement existing datasets used for training and evaluation 3D hand pose estimation. HInt annotates 2D keypoint locations and occlusion labels for 21 keypoints on the hand. It is built off of 3 existing datasets (Hands23, Epic-Kitchens VISOR, and Ego4D) and provides annotations for images from the three existing datasets.
RFUND is a relabeled version of FUNSD and XFUND datasets, tackling the following issues in their original annotations:
English subset of RFUND
This dataset includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, price, brand, and image features), and links (also viewed/also bought graphs).
Autonomous trucking is a promising technology that can greatly impact modern logistics and the environment. Ensuring its safety on public roads is one of the main duties that requires an accurate perception of the environment. To achieve this, machine learning methods rely on large datasets, but to this day, no such datasets are available for autonomous trucks. In this work, we present MAN TruckScenes, the first multimodal dataset for autonomous trucking. MAN TruckScenes allows the research community to come into contact with truck-specific challenges, such as trailer occlusions, novel sensor perspectives, and terminal environments for the first time. It comprises more than 740 scenes of 20s each within a multitude of different environmental conditions. The sensor set includes 4 cameras, 6 lidar, 6 radar sensors, 2 IMUs, and a high-precision GNSS. The dataset's 3D bounding boxes were manually annotated and carefully reviewed to achieve a high quality standard. Bounding boxes are availa
A evaluation benchmark ZEB for image matching by merging 8 real-world datasets and 4 simulated datasets with diverse image resolutions, scene conditions and view points.
Vision-language generative reward models (VL-GenRMs) play a crucial role in aligning and evaluating multimodal AI systems, yet their own evaluation remains under-explored. Current assessment methods primarily rely on AI-annotated preference labels from traditional VL tasks, which can introduce biases and often fail to effectively challenge state-of-the-art models. To address these limitations, we introduce VL-RewardBench, a comprehensive benchmark spanning general multimodal queries, visual hallucination detection, and complex reasoning tasks. Through our AI-assisted annotation pipeline combining sample selection with human verification, we curate 1,250 high-quality examples specifically designed to probe model limitations. Comprehensive evaluation across 16 leading large vision-language models, demonstrates VL-RewardBench's effectiveness as a challenging testbed, where even GPT-4o achieves only 65.4% accuracy, and state-of-the-art open-source models such as Qwen2-VL-72B, struggle to s
OpenS2V-Eval introduces 180 prompts from seven major categories of S2V, which incorporate both real and synthetic test data. Furthermore, to accurately align human preferences with S2V benchmarks, we propose three automatic metrics: NexusScore, NaturalScore, GmeScore to separately quantify subject consistency, naturalness, and text relevance in generated videos. Building on this, we conduct a comprehensive evaluation of 14 representative S2V models, highlighting their strengths and weaknesses across different content.
Are current 3D object tracking methods truely robust enough for low-fidelity depth sensors like the iPhone LiDAR? We introduce DTTD-Mobile (fully compatible w/ YCB toolbox), a new benchmark built on real-world data captured from mobile devices; 18 objects observed in 100 videos with 47,668 sampled frames and 114,143 object annotations. We evaluate several popular methods—including BundleSDF, ES6D, MegaPose, and DenseFusion—and highlight their limitations in this challenging setting.
Click to add a brief description of the dataset (Markdown and LaTeX enabled).
This dataset contains 5955 painting images (from WikiCommons) : a train set of 2978 images and a test set of 2977 images (for classification task). 1480 of the 2977 images are annotated with bounding boxes for 7 iconographic classes : ‘angel’,‘Child_Jesus’,‘crucifixion_of_Jesus’,‘Mary’,‘nudity’, ‘ruins’,‘Saint_Sebastien’.
The PROBA-V Super-Resolution dataset is the official dataset of ESA's Kelvins competition for "PROBA-V Super Resolution". It contains satellite data from 74 hand-selected regions around the globe at different points in time. The data is composed of radiometrically and geometrically corrected Top-Of-Atmosphere (TOA) reflectances for the RED and NIR spectral bands at 300m and 100m resolution in Plate Carrée projection. The 300m resolution data is delivered as 128x128 grey-scale pixel images, the 100m resolution data as 384x384 grey-scale pixel images. Additionally, a quality map is provided for each pixel, indicating whether the pixels are concealed (i.e. by clouads, ice, water, missing information, etc.).
Horse-10 is an animal pose estimation dataset. It comprises 30 diverse Thoroughbred horses, for which 22 body parts were labeled by an expert in 8,114 frames (animal pose estimation). Horses have various coat colors and the “in-the-wild” aspect of the collected data at various Thoroughbred yearling sales and farms added additional complexity. The authors introduce Horse-C to contrast the domain shift inherent in the Horse-10 dataset with domain shift induced by common image corruptions.
The Stanford Light Field Archive is a collection of several light fields for research in computer graphics and vision.
The Oxford-Affine dataset is a small dataset containing 8 scenes with sequence of 6 images per scene. The images in a sequence are related by homographies.
The Retrieval-SFM dataset is used for instance image retrieval. The dataset contains 28559 images from 713 locations in the world. Each image has a label indicating the location it belongs to. Most locations are famous man-made architectures such as palaces and towers, which are relatively static and positively contribute to visual place recognition. The training dataset contains various perceptual changes including variations in viewing angles, occlusions and illumination conditions, etc.
Freiburg Groceries is a groceries classification dataset consisting of 5000 images of size 256x256, divided into 25 categories. It has imbalanced class sizes ranging from 97 to 370 images per class. Images were taken in various aspect ratios and padded to squares.
MobilityAids is a dataset for perception of people and their mobility aids. The annotated dataset contains five classes: pedestrian, person in wheelchair, pedestrian pushing a person in a wheelchair, person using crutches and person using a walking frame. In total the hospital dataset has over 17, 000 annotated RGB-D images, containing people categorized according to the mobility aids they use. The images were collected in the facilities of the Faculty of Engineering of the University of Freiburg and in a hospital in Frankfurt.