3,275 machine learning datasets
3,275 dataset results
P-DukeMTMC-reID is a modified version based on DukeMTMC-reID dataset. There are 12,927 images (665 identifies) in training set, 2,163 images (634 identities) for querying and 9,053 images in the gallery set.
4D-OR includes a total of 6734 scenes, recorded by six calibrated RGB-D Kinect sensors 1 mounted to the ceiling of the OR, with one frame-per-second, providing synchronized RGB and depth images. We provide fused point cloud sequences of entire scenes, automatically annotated human 6D poses and 3D bounding boxes for OR objects. Furthermore, we provide SSG annotations for each step of the surgery together with the clinical roles of all the humans in the scenes, e.g., nurse, head surgeon, anesthesiologist.
Given 10 minimally contrastive (highly similar) images and a complex description for one of them, the task is to retrieve the correct image. The source of most images are videos and descriptions as well as retrievals come from human.
Four pathologists from Longhua Hospital Shanghai University of Traditional Chinese Medicine provide 600 images of gastric cancer pathology images at size 2048$\times$2048 pixels. These images were scanned using a NewUsbCamera and digitized at $\times$20 magnification, tissue-level labels were also given by the four experienced pathologists. Based on that, five biomedical researchers from Northeastern University cropped them to 245,196 sub-sized gastric cancer pathology images, and two experienced pathologists from Liaoning Cancer Hospital and Institute perform the calibration. The 245,196 images were split to three sizes (160$\times$160, 120$\times$120, 80$\times$80) for two categories: abnormal and normal.
UnrealEgo is a dataset that provides in-the-wild stereo images with a large variety of motions for 3D human pose estimation. The in-the-wild stereo images are stereo fisheye images and depth maps with a resolution of 1024×1024 pixels each with 25 frames per second and a total of 450k (900k images) are captured for the dataset. Metadata is provided for each frame, including 3D joint positions, camera positions, and 2D coordinates of reprojected joint positions in the fisheye views.
ImageNet-X is a set of human annotations pinpointing failure types for the popular ImageNet dataset. ImageNet-X labels distinguishing object factors such as pose, size, color, lighting, occlusions, co-occurences, etc. for each image in the validation set and a random subset of 12,000 training samples. It is designed to study the types of mistakes as a function of model's architecture, learning paradigm, and training procedures.
DocILE is a large dataset of business documents for the tasks of Key Information Localization and Extraction and Line Item Recognition. It contains 6.7k annotated business documents, 100k synthetically generated documents, and nearly 1M unlabeled documents for unsupervised pre-training. The dataset has been built with knowledge of domain- and task-specific aspects, resulting in the following key features:
OpenLane-V2 is the world's first perception and reasoning benchmark for scene structure in autonomous driving. The primary task of the dataset is scene structure perception and reasoning, which requires the model to recognize the dynamic drivable states of lanes in the surrounding environment. The challenge of this dataset includes not only detecting lane centerlines and traffic elements but also recognizing the attribute of traffic elements and topology relationships on detected objects.
Unlike previous datasets that focus on detecting the diversity of defect categories (like MVTec AD and VisA), AeBAD is centered on the diversity of domains within the same data category.
MultI-Modal In-Context Instruction Tuning (MIMIC-IT) is a dataset for instruction tuning into multi-modal models, motivated by the Flamingo model's upstream interleaved format pretraining dataset. The data sample consists of a queried image-instruction-answer triplet, with the instruction-answer tailored to the image, and context. The context contains a series of image-instruction-answer triplets that contextually correlate with the queried triplet, emulating the relationship between the context and the queried image-text pair found in the MMC4 dataset.
Casia V1 is a dataset for forgery classification. Casia V1+ is a modification of the Casia V1 dataset proposed by Chen et al. that replaces authentic images that also exist in Casiav2 with images from the COREL dataset to avoid data contamination.
Click to add a brief description of the dataset (Markdown and LaTeX enabled).
Multi-Modal Reading (MMR) Benchmark includes 550 annotated question-answer pairs across 11 distinct tasks involving texts, fonts, visual elements, bounding boxes, spatial relations, and grounding, with carefully designed evaluation metrics.
This is the home of a collaborative data collection effort by U. Chicago and TTI-Chicago researchers. This is to our knowledge the first collection of American Sign Language fingerspelling data "in the wild," that is in naturally occurring (online) video. The collection consists of two data set releases, ChicagoFSWild and ChicagoFSWild+.
FM-IQA is a question-answering dataset containing over 150,000 images and 310,000 freestyle Chinese question-answer pairs and their English translations.
SK-LARGE is a benchmark dataset for object skeleton detection, built on the MS COCO dataset. It contains 1491 images, 746 for training and 745 for testing.
Office-Caltech-10 a standard benchmark for domain adaptation, which consists of Office 10 and Caltech 10 datasets. It contains the 10 overlapping categories between the Office dataset and Caltech256 dataset. SURF BoW historgram features, vector quantized to 800 dimensions are also available for this dataset.
The USF Human ID Gait Challenge Dataset is a dataset of videos for gait recognition. It has videos from 122 subjects in up to 32 possible combinations of variations in factors.
CLEVR-Dialog is a large diagnostic dataset for studying multi-round reasoning in visual dialog. Specifically, that authors construct a dialog grammar that is grounded in the scene graphs of the images from the CLEVR dataset. This combination results in a dataset where all aspects of the visual dialog are fully annotated. In total, CLEVR-Dialog contains 5 instances of 10-round dialogs for about 85k CLEVR images, totaling to 4.25M question-answer pairs.
This dataset was created with images provided by the United States National Archive and FamilySearch.