3,275 machine learning datasets
3,275 dataset results
The CLEVR-Hans data set is a novel confounded visual scene data set, which captures complex compositions of different objects. This data set consists of CLEVR images divided into several classes.
Casual Conversations dataset is designed to help researchers evaluate their computer vision and audio models for accuracy across a diverse set of age, genders, apparent skin tones and ambient lighting conditions.
NewsCLIPpings is a dataset for detecting mismatched images and captions. Different to previous misinformation datasets, in NewsCLIPpings both the images and captions are unmanipulated, but some of them are mismatched.
SketchyCOCO dataset consists of two parts:
e-SNLI-VE is a large VL (vision-language) dataset with NLEs (natural language explanations) with over 430k instances for which the explanations rely on the image content. It has been built by merging the explanations from e-SNLI and the image-sentence pairs from SNLI-VE.
One large-scale database for Text-to-Image Person Re-identification, i.e., Text-based Person Retrieval.
An evaluation protocol for face verification focusing on a large intra-pair image quality difference.
Wukong is a large-scale Chinese cross-modal dataset for benchmarking different multi-modal pre-training methods to facilitate the Vision-Language Pre-training (VLP). This dataset contains 100 million Chinese image-text pairs from the web. This base query list is taken from and is filtered according to the frequency of Chinese words and phrases.
The SKM-TEA dataset pairs raw quantitative knee MRI (qMRI) data, image data, and dense labels of tissues and pathology for end-to-end exploration and evaluation of the MR imaging pipeline. This 1.6TB dataset consists of raw-data measurements of ~25,000 slices (155 patients) of anonymized patient knee MRI scans, the corresponding scanner-generated DICOM images, manual segmentations of four tissues, and bounding box annotations for sixteen clinically relevant pathologies.
CholecT45 is a subset of CholecT50 consisting of 45 videos from the Cholec80 dataset. It is the first public release of part of CholecT50 dataset. CholecT50 is a dataset of 50 endoscopic videos of laparoscopic cholecystectomy surgery introduced to enable research on fine-grained action recognition in laparoscopic surgery. It is annotated with 100 triplet classes in the form of <instrument, verb, target>.
The General Robust Image Task (GRIT) Benchmark is an evaluation-only benchmark for evaluating the performance and robustness of vision systems across multiple image prediction tasks, concepts, and data sources. GRIT hopes to encourage our research community to pursue the following research directions:
VideoLQ consists of videos downloaded from various video hosting sites such as Flickr and YouTube, with a Creative Common license.
GSV-Cities is a large-scale dataset for training deep neural network for the task of Visual Place Recognition.
V-D4RL provides pixel-based analogues of the popular D4RL benchmarking tasks, derived from the dm_control suite, along with natural extensions of two state-of-the-art online pixel-based continuous control algorithms, DrQ-v2 and DreamerV2, to the offline setting.
GeneCIS benchmark is designed for measuring models’ ability to adapt to a range of similarity conditions, which is zero-shot evaluation only.
This dataset, based on Flickr30K, is introduced in Learning with Noisy Correspondence for Cross-modal Matching. Noisy correspondence is simulated by randomly shuffling the captions of training images for a specific percentage, denoted by noise ratio
The realistic and dynamic scenes (REDS) dataset was proposed in the NTIRE19 Challenge. The dataset is composed of 300 video sequences with resolution of 720×1,280, and each video has 100 frames, where the training set, the validation set and the testing set have 240, 30 and 30 videos, respectively
The dataset is constructed from images of defective production items that were provided and annotated by Kolektor Group d.o.o.. The images were captured in a controlled industrial environment in a real-world case.
QED is a linguistically principled framework for explanations in question answering. Given a question and a passage, QED represents an explanation of the answer as a combination of discrete, human-interpretable steps: sentence selection := identification of a sentence implying an answer to the question referential equality := identification of noun phrases in the question and the answer sentence that refer to the same thing predicate entailment := confirmation that the predicate in the sentence entails the predicate in the question once referential equalities are abstracted away. The QED dataset is an expert-annotated dataset of QED explanations build upon a subset of the Google Natural Questions dataset.
Syn2Real, a synthetic-to-real visual domain adaptation benchmark meant to encourage further development of robust domain transfer methods. The goal is to train a model on a synthetic "source" domain and then update it so that its performance improves on a real "target" domain, without using any target annotations. It includes three tasks, illustrated in figures above: the more traditional closed-set classification task with a known set of categories; the less studied open-set classification task with unknown object categories in the target domain; and the object detection task, which involves localizing instances of objects by predicting their bounding boxes and corresponding class labels.