Datasets

3,275 machine learning datasets

3,275 dataset results

SALICON (Salicency in Context)

The SALIency in CONtext (SALICON) dataset contains 10,000 training images, 5,000 validation images and 5,000 test images for saliency prediction. This dataset has been created by annotating saliency in images from MS COCO. The ground-truth saliency annotations include fixations generated from mouse trajectories. To improve the data quality, isolated fixations with low local density have been excluded. The training and validation sets, provided with ground truth, contain the following data fields: image, resolution and gaze. The testing data contains only the image and resolution fields.

161 papers21 benchmarksImages

CrowdHuman

CrowdHuman is a large and rich-annotated human detection dataset, which contains 15,000, 4,370 and 5,000 images collected from the Internet for training, validation and testing respectively. The number is more than 10× boosted compared with previous challenging pedestrian detection dataset like CityPersons. The total number of persons is also noticeably larger than the others with ∼340k person and ∼99k ignore region annotations in the CrowdHuman training subset.

161 papers10 benchmarksImages

Objects365

Objects365 is a large-scale object detection dataset, Objects365, which has 365 object categories over 600K training images. More than 10 million, high-quality bounding boxes are manually labeled through a three-step, carefully designed annotation pipeline. It is the largest object detection dataset (with full annotation) so far and establishes a more challenging benchmark for the community.

161 papers11 benchmarksImages

V-COCO (Verbs in COCO)

Verbs in COCO (V-COCO) is a dataset that builds off COCO for human-object interaction detection. V-COCO provides 10,346 images (2,533 for training, 2,867 for validating and 4,946 for testing) and 16,199 person instances. Each person has annotations for 29 action categories and there are no interaction labels including objects.

159 papers4 benchmarksImages

VisDial (Visual Dialog)

Visual Dialog (VisDial) dataset contains human annotated questions based on images of MS COCO dataset. This dataset was developed by pairing two subjects on Amazon Mechanical Turk to chat about an image. One person was assigned the job of a ‘questioner’ and the other person acted as an ‘answerer’. The questioner sees only the text description of an image (i.e., an image caption from MS COCO dataset) and the original image remains hidden to the questioner. Their task is to ask questions about this hidden image to “imagine the scene better”. The answerer sees the image, caption and answers the questions asked by the questioner. The two of them can continue the conversation by asking and answering questions for 10 rounds at max.

159 papers4 benchmarksDialog, Images, Texts

Total-Text

Total-Text is a text detection dataset that consists of 1,555 images with a variety of text types including horizontal, multi-oriented, and curved text instances. The training split and testing split have 1,255 images and 300 images, respectively.

156 papers6 benchmarksImages

IJB-A (IARPA Janus Benchmark A)

The IARPA Janus Benchmark A (IJB-A) database is developed with the aim to augment more challenges to the face recognition task by collecting facial images with a wide variations in pose, illumination, expression, resolution and occlusion. IJB-A is constructed by collecting 5,712 images and 2,085 videos from 500 identities, with an average of 11.4 images and 4.2 videos per identity.

156 papers23 benchmarksImages

ObjectNet

ObjectNet is a test set of images collected directly using crowd-sourcing. ObjectNet is unique as the objects are captured at unusual poses in cluttered, natural scenes, which can severely degrade recognition performance. There are 50,000 images in the test set which controls for rotation, background and viewpoint. There are 313 object classes with 113 overlapping ImageNet.

155 papers7 benchmarksImages

SID (See-in-the-Dark)

The See-in-the-Dark (SID) dataset contains 5094 raw short-exposure images, each with a corresponding long-exposure reference image. Images were captured using two cameras: Sony α7SII and Fujifilm X-T2.

155 papers3 benchmarksImages

AFW (Annotated Faces in the Wild)

AFW (Annotated Faces in the Wild) is a face detection dataset that contains 205 images with 468 faces. Each face image is labeled with at most 6 landmarks with visibility labels, as well as a bounding box.

154 papers0 benchmarksImages

AFLW (Annotated Facial Landmarks in the Wild)

The Annotated Facial Landmarks in the Wild (AFLW) is a large-scale collection of annotated face images gathered from Flickr, exhibiting a large variety in appearance (e.g., pose, expression, ethnicity, age, gender) as well as general imaging and environmental conditions. In total about 25K faces are annotated with up to 21 landmarks per image.

154 papers8 benchmarksImages

In-Shop (In-shop Clothes Retrieval Benchmark)

In-shop Clothes Retrieval Benchmark evaluates the performance of in-shop Clothes Retrieval. This is a large subset of DeepFashion, containing large pose and scale variations. It also has large diversities, large quantities, and rich annotations, including:

154 papers2 benchmarksImages

A-OKVQA

A-OKVQA is crowdsourced visual question answering dataset composed of a diverse set of about 25K questions requiring a broad base of commonsense and world knowledge to answer.

154 papers2 benchmarksImages, Texts

dSprites (Disentanglement testing Sprites dataset)

dSprites is a dataset of 2D shapes procedurally generated from 6 ground truth independent latent factors. These factors are color, shape, scale, rotation, x and y positions of a sprite.

153 papers0 benchmarksImages

CC12M (Conceptual 12M)

Conceptual 12M (CC12M) is a dataset with 12 million image-text pairs specifically meant to be used for vision-and-language pre-training.

153 papers0 benchmarksImages, Texts

MegaDepth

The MegaDepth dataset is a dataset for single-view depth prediction that includes 196 different locations reconstructed from COLMAP SfM/MVS.

152 papers0 benchmarksImages

VRD (Visual Relationship Detection dataset)

The Visual Relationship Dataset (VRD) contains 4000 images for training and 1000 for testing annotated with visual relationships. Bounding boxes are annotated with a label containing 100 unary predicates. These labels refer to animals, vehicles, clothes and generic objects. Pairs of bounding boxes are annotated with a label containing 70 binary predicates. These labels refer to actions, prepositions, spatial relations, comparatives or preposition phrases. The dataset has 37993 instances of visual relationships and 6672 types of relationships. 1877 instances of relationships occur only in the test set and they are used to evaluate the zero-shot learning scenario.

151 papers7 benchmarksImages, Texts

MOT16 (Multiple Object Tracking 2016)

The MOT16 dataset is a dataset for multiple object tracking. It a collection of existing and new data (part of the sources are from and ), containing 14 challenging real-world videos of both static scenes and moving scenes, 7 for training and 7 for testing. It is a large-scale dataset, composed of totally 110407 bounding boxes in training set and 182326 bounding boxes in test set. All video sequences are annotated under strict standards, their ground-truths are highly accurate, making the evaluation meaningful.

149 papers6 benchmarksImages, Videos

DISFA (Denver Intensity of Spontaneous Facial Action)

The Denver Intensity of Spontaneous Facial Action (DISFA) dataset consists of 27 videos of 4844 frames each, with 130,788 images in total. Action unit annotations are on different levels of intensity, which are ignored in the following experiments and action units are either set or unset. DISFA was selected from a wider range of databases popular in the field of facial expression recognition because of the high number of smiles, i.e. action unit 12. In detail, 30,792 have this action unit set, 82,176 images have some action unit(s) set and 48,612 images have no action unit(s) set at all.

148 papers22 benchmarksImages, Videos

SBD (Semantic Boundaries Dataset)

The Semantic Boundaries Dataset (SBD) is a dataset for predicting pixels on the boundary of the object (as opposed to the inside of the object with semantic segmentation). The dataset consists of 11318 images from the trainval set of the PASCAL VOC2011 challenge, divided into 8498 training and 2820 test images. This dataset has object instance boundaries with accurate figure/ground masks that are also labeled with one of 20 Pascal VOC classes.

147 papers5 benchmarksImages

PreviousPage 10 of 164Next