Datasets

3,275 machine learning datasets

3,275 dataset results

P-DukeMTMC-reID

P-DukeMTMC-reID is a modified version based on DukeMTMC-reID dataset. There are 12,927 images (665 identifies) in training set, 2,163 images (634 identities) for querying and 9,053 images in the gallery set.

11 papers4 benchmarksImages

4D-OR

4D-OR includes a total of 6734 scenes, recorded by six calibrated RGB-D Kinect sensors 1 mounted to the ceiling of the OR, with one frame-per-second, providing synchronized RGB and depth images. We provide fused point cloud sequences of entire scenes, automatically annotated human 6D poses and 3D bounding boxes for OR objects. Furthermore, we provide SSG annotations for each step of the surgery together with the clinical roles of all the humans in the scenes, e.g., nurse, head surgeon, anesthesiologist.

11 papers7 benchmarks3D, Graphs, Images, Medical, Point cloud, RGB Video, RGB-D, Time series, Videos

ImageCoDe (Image Retrieval from Contextual Descriptions)

Given 10 minimally contrastive (highly similar) images and a complex description for one of them, the task is to retrieve the correct image. The source of most images are videos and descriptions as well as retrievals come from human.

11 papers1 benchmarksImages, Texts

GasHisSDB

Four pathologists from Longhua Hospital Shanghai University of Traditional Chinese Medicine provide 600 images of gastric cancer pathology images at size 2048$\times$2048 pixels. These images were scanned using a NewUsbCamera and digitized at $\times$20 magnification, tissue-level labels were also given by the four experienced pathologists. Based on that, five biomedical researchers from Northeastern University cropped them to 245,196 sub-sized gastric cancer pathology images, and two experienced pathologists from Liaoning Cancer Hospital and Institute perform the calibration. The 245,196 images were split to three sizes (160$\times$160, 120$\times$120, 80$\times$80) for two categories: abnormal and normal.

11 papers3 benchmarksImages

UnrealEgo

UnrealEgo is a dataset that provides in-the-wild stereo images with a large variety of motions for 3D human pose estimation. The in-the-wild stereo images are stereo fisheye images and depth maps with a resolution of 1024×1024 pixels each with 25 frames per second and a total of 450k (900k images) are captured for the dataset. Metadata is provided for each frame, including 3D joint positions, camera positions, and 2D coordinates of reprojected joint positions in the fisheye views.

11 papers8 benchmarks3D, Images

ImageNet-X

ImageNet-X is a set of human annotations pinpointing failure types for the popular ImageNet dataset. ImageNet-X labels distinguishing object factors such as pose, size, color, lighting, occlusions, co-occurences, etc. for each image in the validation set and a random subset of 12,000 training samples. It is designed to study the types of mistakes as a function of model's architecture, learning paradigm, and training procedures.

11 papers0 benchmarksImages

DocILE

DocILE is a large dataset of business documents for the tasks of Key Information Localization and Extraction and Line Item Recognition. It contains 6.7k annotated business documents, 100k synthetically generated documents, and nearly 1M unlabeled documents for unsupervised pre-training. The dataset has been built with knowledge of domain- and task-specific aspects, resulting in the following key features:

11 papers0 benchmarksImages, Texts

OpenLane-V2 val

OpenLane-V2 is the world's first perception and reasoning benchmark for scene structure in autonomous driving. The primary task of the dataset is scene structure perception and reasoning, which requires the model to recognize the dynamic drivable states of lanes in the surrounding environment. The challenge of this dataset includes not only detecting lane centerlines and traffic elements but also recognizing the attribute of traffic elements and topology relationships on detected objects.

11 papers12 benchmarksImages, Videos

AeBAD (Aero-engine Blade Anomaly Detection Dataset)

Unlike previous datasets that focus on detecting the diversity of defect categories (like MVTec AD and VisA), AeBAD is centered on the diversity of domains within the same data category.

11 papers0 benchmarksImages

MIMIC-IT

MultI-Modal In-Context Instruction Tuning (MIMIC-IT) is a dataset for instruction tuning into multi-modal models, motivated by the Flamingo model's upstream interleaved format pretraining dataset. The data sample consists of a queried image-instruction-answer triplet, with the instruction-answer tailored to the image, and context. The context contains a series of image-instruction-answer triplets that contextually correlate with the queried triplet, emulating the relationship between the context and the queried image-text pair found in the MMC4 dataset.

11 papers0 benchmarksImages, Texts

Casia V1+

Casia V1 is a dataset for forgery classification. Casia V1+ is a modification of the Casia V1 dataset proposed by Chen et al. that replaces authentic images that also exist in Casiav2 with images from the COREL dataset to avoid data contamination.

11 papers19 benchmarksImages

COST (COCO Segmentation Text)

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

11 papers0 benchmarksImages, Texts

MRR-Benchmark (Multi-Modal Reading Benchmark)

Multi-Modal Reading (MMR) Benchmark includes 550 annotated question-answer pairs across 11 distinct tasks involving texts, fonts, visual elements, bounding boxes, spatial relations, and grounding, with carefully designed evaluation metrics.

11 papers1 benchmarksImages, Texts

ChicagoFSWild+

This is the home of a collaborative data collection effort by U. Chicago and TTI-Chicago researchers. This is to our knowledge the first collection of American Sign Language fingerspelling data "in the wild," that is in naturally occurring (online) video. The collection consists of two data set releases, ChicagoFSWild and ChicagoFSWild+.

11 papers1 benchmarksImages, Texts, Videos

FM-IQA (Freestyle Multilingual Image Question Answering)

FM-IQA is a question-answering dataset containing over 150,000 images and 310,000 freestyle Chinese question-answer pairs and their English translations.

10 papers0 benchmarksImages, Texts

SK-LARGE

SK-LARGE is a benchmark dataset for object skeleton detection, built on the MS COCO dataset. It contains 1491 images, 746 for training and 745 for testing.

10 papers5 benchmarksImages

Office-Caltech-10

Office-Caltech-10 a standard benchmark for domain adaptation, which consists of Office 10 and Caltech 10 datasets. It contains the 10 overlapping categories between the Office dataset and Caltech256 dataset. SURF BoW historgram features, vector quantized to 800 dimensions are also available for this dataset.

10 papers1 benchmarksImages

USF (Human ID Gait Challenge Dataset)

The USF Human ID Gait Challenge Dataset is a dataset of videos for gait recognition. It has videos from 122 subjects in up to 32 possible combinations of variations in factors.

10 papers0 benchmarksImages, Videos

CLEVR-Dialog

CLEVR-Dialog is a large diagnostic dataset for studying multi-round reasoning in visual dialog. Specifically, that authors construct a dialog grammar that is grounded in the scene graphs of the images from the CLEVR dataset. This combination results in a dataset where all aspects of the visual dialog are fully annotated. In total, CLEVR-Dialog contains 5 instances of 10-round dialogs for about 85k CLEVR images, totaling to 4.25M question-answer pairs.

10 papers0 benchmarksDialog, Images, Texts

NAF (National Archives Forms Dataset)

This dataset was created with images provided by the United States National Archive and FamilySearch.

10 papers0 benchmarksImages

PreviousPage 50 of 164Next