Datasets

3,275 machine learning datasets

3,275 dataset results

IconQA (Icon Question Answering)

Current visual question answering (VQA) tasks mainly consider answering human-annotated questions for natural images in the daily-life context. Icon question answering (IconQA) is a benchmark which aims to highlight the importance of abstract diagram understanding and comprehensive cognitive reasoning in real-world diagram word problems. For this benchmark, a large-scale IconQA dataset is built that consists of three sub-tasks: multi-image-choice, multi-text-choice, and filling-in-the-blank. Compared to existing VQA benchmarks, IconQA requires not only perception skills like object recognition and text understanding, but also diverse cognitive reasoning skills, such as geometric reasoning, commonsense reasoning, and arithmetic reasoning.

42 papers16 benchmarksImages, Texts

C-GQA (Compositional GQA)

We propose a split built on top of Stanford GQA dataset originally proposed for VQA and name it Compositional GQA (C-GQA) dataset (see supplementary for the details). CGQA contains over 9.5k compositional labels making it the most extensive dataset for CZSL. With cleaner labels and a larger label space, our hope is that this dataset will inspire further research on the topic.

42 papers0 benchmarksImages

I-HAZE

The I-Haze dataset contains 25 indoor hazy images (size 2833×4657 pixels) training. It has 5 hazy images for validation along with their corresponding ground truth images.

41 papers0 benchmarksImages

KITTI Road

KITTI Road is road and lane estimation benchmark that consists of 289 training and 290 test images. It contains three different categories of road scenes: * uu - urban unmarked (98/100) * um - urban marked (95/96) * umm - urban multiple marked lanes (96/94) * urban - combination of the three above Ground truth has been generated by manual annotation of the images and is available for two different road terrain types: road - the road area, i.e, the composition of all lanes, and lane - the ego-lane, i.e., the lane the vehicle is currently driving on (only available for category "um"). Ground truth is provided for training images only.

41 papers0 benchmarksImages, Point cloud

Partial-REID

Partial REID is a specially designed partial person reidentification dataset that includes 600 images from 60 people, with 5 full-body images and 5 occluded images per person. These images were collected on a university campus by 6 cameras from different viewpoints, backgrounds and different types of occlusion. The examples of partial persons in the Partial REID dataset are shown in the Figure.

41 papers1 benchmarksImages

ExpW (Expression in-the-Wild)

The Expression in-the-Wild (ExpW) dataset is for facial expression recognition and contains 91,793 faces manually labeled with expressions. Each of the face images is annotated as one of the seven basic expression categories: “angry”, “disgust”, “fear”, “happy”, “sad”, “surprise”, or “neutral”.

41 papers6 benchmarksImages

Million-AID

Million-AID is a large-scale benchmark dataset containing a million instances for RS scene classification. There are 51 semantic scene categories in Million-AID. And the scene categories are customized to match the land-use classification standards, which greatly enhance the practicability of the constructed Million-AID. Different form the existing scene classification datasets of which categories are organized with parallel or uncertain relationships, scene categories in Million-AID are organized with systematic relationship architecture, giving it superiority in management and scalability. Specifically, the scene categories in Million-AID are organized by the hierarchical category network of a three-level tree: 51 leaf nodes fall into 28 parent nodes at the second level which are grouped into 8 nodes at the first level, representing the 8 underlying scene categories of agriculture land, commercial land, industrial land, public service land, residential land, transportation land, unut

41 papers0 benchmarksImages

JFT-3B

JFT-3B is an internal Google dataset and a larger version of the JFT-300M dataset. It consists of nearly 3 billion images, annotated with a class-hierarchy of around 30k labels via a semi-automatic pipeline. In other words, the data and associated labels are noisy.

41 papers0 benchmarksImages

NT-VOT211

NT-VOT211 consists of 211 diverse videos, offering 211,000 well-annotated frames with 8 attributes including camera motion, deformation, fast motion, motion blur, tiny target, distractors, occlusion and out-of-view. To the best of our knowledge, it is the largest night-time tracking benchmark to-date that is specifically designed to address unique challenges such as adverse visibility, image blur, and distractors inherent to night-time tracking scenarios.

41 papers4 benchmarksImages, Videos

Florence (Florence 3D Faces)

The Florence 3D faces dataset consists of:

40 papers30 benchmarks3D, Images

VeRi-Wild

Veri-Wild is the largest vehicle re-identification dataset (as of CVPR 2019). The dataset is captured from a large CCTV surveillance system consisting of 174 cameras across one month (30× 24h) under unconstrained scenarios. This dataset comprises 416,314 vehicle images of 40,671 identities. Evaluation on this dataset is split across three subsets: small, medium and large; comprising 3000, 5000 and 10,000 identities respectively (in probe and gallery sets).

40 papers0 benchmarksImages

Visual Wake Words

Visual Wake Words represents a common microcontroller vision use-case of identifying whether a person is present in the image or not, and provides a realistic benchmark for tiny vision models.

40 papers1 benchmarksImages

Samanantar

Samanantar is the largest publicly available parallel corpora collection for Indic languages: Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu. The corpus has 49.6M sentence pairs between English to Indian Languages.

40 papers0 benchmarksImages, Speech, Texts

AID (Aerial Image Dataset)

AID is a new large-scale aerial image dataset, by collecting sample images from Google Earth imagery. Note that although the Google Earth images are post-processed using RGB renderings from the original optical aerial images, it has proven that there is no significant difference between the Google Earth images with the real optical aerial images even in the pixel-level land use/cover mapping. Thus, the Google Earth images can also be used as aerial images for evaluating scene classification algorithms.

40 papers5 benchmarksImages

RedCaps

RedCaps is a large-scale dataset of 12M image-text pairs collected from Reddit. Images and captions from Reddit depict and describe a wide variety of objects and scenes. The data is collected from a manually curated set of subreddits (350 total), which give coarse image labels and allow steering of the dataset composition without labeling individual instances.

40 papers0 benchmarksImages, Texts

TheoremQA

We propose the first question-answering dataset driven by STEM theorems. We annotated 800 QA pairs covering 350+ theorems spanning across Math, EE&CS, Physics and Finance. The dataset is collected by human experts with very high quality. We provide the dataset as a new benchmark to test the limit of large language models to apply theorems to solve challenging university-level questions. We provide a pipeline in the following to prompt LLMs and evaluate their outputs with WolframAlpha.

40 papers1 benchmarksImages, Texts

SIXray

The SIXray dataset is constructed by the Pattern Recognition and Intelligent System Development Laboratory, University of Chinese Academy of Sciences. It contains 1,059,231 X-ray images which are collected from some several subway stations. There are six common categories of prohibited items, namely, gun, knife, wrench, pliers, scissors and hammer. It has three subsets called SIXray10, SIXray100 and SIXray1000, There are image-level annotations provided by human security inspectors for the whole dataset. In addition the images in the test set are annotated with a bounding-box for each prohibited item to evaluate the performance of object localization.

39 papers5 benchmarksImages

TDIUC (Task Directed Image Understanding Challenge)

Task Directed Image Understanding Challenge (TDIUC) dataset is a Visual Question Answering dataset which consists of 1.6M questions and 170K images sourced from MS COCO and the Visual Genome Dataset. The image-question pairs are split into 12 categories and 4 additional evaluation matrices which help evaluate models’ robustness against answer imbalance and its ability to answer questions that require higher reasoning capability. The TDIUC dataset divides the VQA paradigm into 12 different task directed question types. These include questions that require a simpler task (e.g., object presence, color attribute) and more complex tasks (e.g., counting, positional reasoning). The dataset includes also an “Absurd” question category in which questions are irrelevant to the image contents to help balance the dataset.

39 papers1 benchmarksImages

BUFF (Bodies Under Flowing Fashion)

BUFF consists of 5 subjects, 3 male and 2 female wearing 2 clothing styles: a) t-shirt and long pants and b) a soccer outfit. They perform 3 different motions i) hips ii) tilt_twist_left iii) shoulders_mill.

39 papers6 benchmarks3D, Images

RoadTracer

RoadTracer is a dataset for extraction of road networks from aerial images. It consists of a large corpus of high-resolution satellite imagery and ground truth road network graphs covering the urban core of forty cities across six countries. For each city, the dataset covers a region of approximately 24 sq km around the city center. The satellite imagery is obtained from Google at 60 cm/pixel resolution, and the road network from OSM.

39 papers0 benchmarksImages

PreviousPage 26 of 164Next