3,275 machine learning datasets
3,275 dataset results
UBody is a large-scale Upper-Body dataset with the following annotations:
ISTD+ consists of shadow images, shadow-free images, and shadow masks, with 1,330 training images and 540 testing images from 135 unique background scenes. ISTD suffers from color and luminosity inconsistencies between shadow and shadow-free images, which ISTD+ corrects with a color compensation mechanism to ensure uniform pixel colors across the ground-truth images.
The MHP dataset contains multiple persons captured in real-world scenes with pixel-level fine-grained semantic annotations in an instance-aware setting.
The KITTI-Depth dataset includes depth maps from projected LiDAR point clouds that were matched against the depth estimation from the stereo cameras. The depth images are highly sparse with only 5% of the pixels available and the rest is missing. The dataset has 86k training images, 7k validation images, and 1k test set images on the benchmark server with no access to the ground truth.
4DFAB is a large scale database of dynamic high-resolution 3D faces which consists of recordings of 180 subjects captured in four different sessions spanning over a five-year period (2012 - 2017), resulting in a total of over 1,800,000 3D meshes. It contains 4D videos of subjects displaying both spontaneous and posed facial behaviours. The database can be used for both face and facial expression recognition, as well as behavioural biometrics. It can also be used to learn very powerful blendshapes for parametrising facial behaviour.
Multi-View Operating Room (MVOR) is a dataset recorded during real clinical interventions. It consists of 732 synchronized multi-view frames recorded by three RGB-D cameras in a hybrid OR. It also includes the visual challenges present in such environments, such as occlusions and clutter.
REFUGE Challenge provides a data set of 1200 fundus images with ground truth segmentations and clinical glaucoma labels, currently the largest existing one.
Created for MVS tasks and is a large-scale multi-view aerial dataset generated from a highly accurate 3D digital surface model produced from thousands of real aerial images with precise camera parameters.
VegFru is a domain-specific dataset for fine-grained visual categorization. VegFru categorizes vegetables and fruits according to their eating characteristics, and each image contains at least one edible part of vegetables or fruits with the same cooking usage. Particularly, all the images are labelled hierarchically. The current version covers vegetables and fruits of 25 upper-level categories and 292 subordinate classes. And it contains more than 160,000 images in total and at least 200 images for each subordinate class.
Food2K is a large food recognition dataset with 2,000 categories and over 1 million images. Compared with existing food recognition datasets, Food2K bypasses them in both categories and images by one order of magnitude, and thus establishes a new challenging benchmark to develop advanced models for food visual representation learning. Food2K can be further explored to benefit more food-relevant tasks including emerging and more complex ones (e.g., nutritional understanding of food), and the trained models on Food2K can be expected as backbones to improve the performance of more food-relevant tasks.
PPR10K is a dataset for portrait photo retouching (PPR), which aims to enhance the visual quality of a collection of flat-looking portrait photos. The Portrait Photo Retouching dataset (PPR10K) is a large-scale and diverse dataset that contains:
TransNAS-Bench-101 is a Neural Architecture Search (NAS) benchmark dataset containing network performance across seven tasks, covering classification, regression, pixel-level prediction, and self-supervised tasks. This diversity provides opportunities to transfer NAS methods among tasks and allows for more complex transfer schemes to evolve. We explore two fundamentally different types of search space: cell-level search space and macro-level search space. With 7,352 backbones evaluated on seven tasks, 51,464 trained models with detailed training information are provided. With TransNAS-Bench-101, we hope to encourage the advent of exceptional NAS algorithms that raise cross-task search efficiency and generalizability to the next level.
2021 Hotel-ID is a dataset for hotel recognition to help raise awareness of human trafficking and generate novel approaches. The dataset consists of hotel room images that have been crowd-sourced and uploaded through the TraffickCam mobile application.
Dataset provided by the Image Matching Workshop
HiFiMask is a large-scale High-Fidelity Mask dataset, namely CASIA-SURF HiFiMask (briefly HiFiMask). It contains a total amount of 54,600 videos are recorded from 75 subjects with 225 realistic masks by 7 new kinds of sensors.
PASS is a large-scale image dataset, containing 1.4 million images, that does not include any humans and which can be used for high-quality pretraining while significantly reducing privacy concerns.
The IIIT5K dataset contains 5,000 text instance images: 2,000 for training and 3,000 for testing. It contains words from street scenes and from originally-digital images. Every image is associated with a 50 -word lexicon and a 1,000 -word lexicon.
MIntRec is a novel dataset for multimodal intent recognition. It formulates coarse-grained and fine-grained intent taxonomies based on the data collected from the TV series Superstore. The dataset consists of 2,224 high-quality samples with text, video, and audio modalities and has multimodal annotations among twenty intent categories.
The dataset contains 2,624 samples of $300\times300$ pixels 8-bit grayscale images of functional and defective solar cells with varying degree of degradations extracted from 44 different solar modules. The defects in the annotated images are either of intrinsic or extrinsic type and are known to reduce the power efficiency of solar modules.
Vision-language modeling has enabled open-vocabulary tasks where predictions can be queried using any text prompt in a zero-shot manner. Existing open-vocabulary tasks focus on object classes, whereas research on object attributes is limited due to the lack of a reliable attribute-focused evaluation benchmark. This paper introduces the Open-Vocabulary Attribute Detection (OVAD) task and the corresponding OVAD benchmark. The objective of the novel task and benchmark is to probe object-level attribute information learned by vision-language models. To this end, we created a clean and densely annotated test set covering 117 attribute classes on the 80 object classes of MS COCO. It includes positive and negative annotations, which enables open-vocabulary evaluation. Overall, the benchmark consists of 1.4 million annotations. For reference, we provide a first baseline method for open-vocabulary attribute detection. Moreover, we demonstrate the benchmark's value by studying the attribute dete