Datasets

3,275 machine learning datasets

3,275 dataset results

KVQA (Knowledge-aware VQA)

It contains manually verified 183K question-answer pairs about more than 18K persons and 24K images. The questions in this dataset require multi-entity, multi-relation and multi-hop reasoning over KG to arrive at an answer. To enable visual named entity linking, it also provides a support set containing reference images of 69K persons harvested from Wikidata as part of the dataset.

31 papers0 benchmarksImages, Texts

IGLUE (Image-Grounded Language Understanding Evaluation)

The Image-Grounded Language Understanding Evaluation (IGLUE) benchmark brings together—by both aggregating pre-existing datasets and creating new ones—visual question answering, cross-modal retrieval, grounded reasoning, and grounded entailment tasks across 20 diverse languages. The benchmark enables the evaluation of multilingual multimodal models for transfer learning, not only in a zero-shot setting, but also in newly defined few-shot learning setups.

31 papers0 benchmarksImages, Texts

MMC4 (Multimodal C4)

Multimodal C4 (MMC4) is an augmentation of the popular text-only c4 corpus with images interleaved. The corpus contains 103M documents containing 585M images interleaved with 43B English tokens.

31 papers0 benchmarksImages, Texts

DeepFashion2

DeepFashion2 is a versatile benchmark of four tasks including clothes detection, pose estimation, segmentation, and retrieval. It has 801K clothing items where each item has rich annotations such as style, scale, viewpoint, occlusion, bounding box, dense landmarks and masks. There are also 873K Commercial-Consumer clothes pairs

30 papers0 benchmarksImages

General-100

The General-100 dataset is a dataset for image super-resolution. It contains 100 bmp format images with no compression) The size of the 100 images ranges from 710 x 704 (large) to 131 x 112 (small).

30 papers0 benchmarksImages

InteriorNet

InteriorNet is a RGB-D for large scale interior scene understanding and mapping. The dataset contains 20M images created by pipeline:

30 papers0 benchmarks3D, Images, RGB-D

ShoeV2

ShoeV2 is a dataset of 2,000 photos and 6648 sketches of shoes. The dataset is designed for fine-grained sketch-based image retrieval.

30 papers0 benchmarksImages

BRACS (BReAst Carcinoma Subtyping)

BReAst Carcinoma Subtyping (BRACS) dataset, a large cohort of annotated Hematoxylin & Eosin (H&E)-stained images to facilitate the characterization of breast lesions. BRACS contains 547 Whole-Slide Images (WSIs), and 4539 Regions of Interest (ROIs) extracted from the WSIs. Each WSI, and respective ROIs, are annotated by the consensus of three board-certified pathologists into different lesion categories. Specifically, BRACS includes three lesion types, i.e., benign, malignant and atypical, which are further subtyped into seven categories.

30 papers0 benchmarksImages, Medical

KITTI-C

🤖 Robo3D - The KITTI-C Benchmark KITTI-C is an evaluation benchmark heading toward robust and reliable 3D object detection in autonomous driving. With it, we probe the robustness of 3D detectors under out-of-distribution (OoD) scenarios against corruptions that occur in the real-world environment. Specifically, we consider natural corruptions happen in the following cases:

30 papers6 benchmarksImages, Point cloud

Comic2k

Comic2k is a dataset used for cross-domain object detection which contains 2k comic images with image and instance-level annotations. Image Source: https://naoto0804.github.io/cross_domain_detection/

29 papers16 benchmarksImages

NVGesture

The NVGesture dataset focuses on touchless driver controlling. It contains 1532 dynamic gestures fallen into 25 classes. It includes 1050 samples for training and 482 for testing. The videos are recorded with three modalities (RGB, depth, and infrared).

29 papers2 benchmarksImages, Videos

AVD (Active Vision Dataset)

AVD focuses on simulating robotic vision tasks in everyday indoor environments using real imagery. The dataset includes 20,000+ RGB-D images and 50,000+ 2D bounding boxes of object instances densely captured in 9 unique scenes.

29 papers4 benchmarksImages, RGB-D

MaRVL (Multicultural Reasoning over Vision and Language)

Multicultural Reasoning over Vision and Language (MaRVL) is a dataset based on an ImageNet-style hierarchy representative of many languages and cultures (Indonesian, Mandarin Chinese, Swahili, Tamil, and Turkish). The selection of both concepts and images is entirely driven by native speakers. Afterwards, we elicit statements from native speakers about pairs of images. The task consists in discriminating whether each grounded statement is true or false.

29 papers2 benchmarksImages, Texts

PartImageNet

PartImageNet is a large, high-quality dataset with part segmentation annotations. It consists of 158 classes from ImageNet with approximately 24000 images. PartImageNet offers part-level annotations on a general set of classes with non-rigid, articulated objects, while having an order of magnitude larger size compared to existing datasets. It can be utilized in multiple vision tasks including but not limited to: Part Discovery, Semantic Segmentation, Few-shot Learning.

29 papers0 benchmarksImages

OmniBenchmark

Omni-Realm Benchmark (OmniBenchmark) is a diverse (21 semantic realm-wise datasets) and concise (realm-wise datasets have no concepts overlapping) benchmark for evaluating pre-trained model generalization across semantic super-concepts/realms, e.g. across mammals to aircraft.

29 papers1 benchmarksImages

Spring (Spring: A High-Resolution High-Detail Dataset and Benchmark for Scene Flow, Optical Flow and Stereo)

Spring is a large, high-resolution and high-detail, computer-generated benchmark for scene flow, optical flow, and stereo. Based on rendered scenes from the open-source Blender movie "Spring", it provides photo-realistic HD datasets with state-of-the-art visual effects and ground truth training data.

29 papers5 benchmarksImages, Videos

G3D (Gaming 3D Dataset)

The Gaming 3D Dataset (G3D) focuses on real-time action recognition in a gaming scenario. It contains 10 subjects performing 20 gaming actions: “punch right”, “punch left”, “kick right”, “kick left”, “defend”, “golf swing”, “tennis swing forehand”, “tennis swing backhand”, “tennis serve”, “throw bowling ball”, “aim and fire gun”, “walk”, “run”, “jump”, “climb”, “crouch”, “steer a car”, “wave”, “flap” and “clap”.

28 papers0 benchmarks3D, Images, Videos

MannequinChallenge

The MannequinChallenge Dataset (MQC) provides in-the-wild videos of people in static poses while a hand-held camera pans around the scene. The dataset consists of three splits for training, validation and testing.

28 papers0 benchmarksImages, Videos

CelebA-Spoof

CelebA-Spoof is a large-scale face anti-spoofing dataset with the following properties:

28 papers0 benchmarksImages

MedICaT

MedICaT is a dataset of medical images, captions, subfigure-subcaption annotations, and inline textual references. Figures and captions are extracted from open access articles in PubMed Central and corresponding reference text is derived from S2ORC. The dataset consists of: 217,060 figures from 131,410 open access papers 7507 subcaption and subfigure annotations for 2069 compound figures Inline references for ~25K figures in the ROCO dataset

28 papers0 benchmarksImages, Medical

PreviousPage 31 of 164Next