3,275 machine learning datasets
3,275 dataset results
We manually labelled 3359 images from the RWTH-PHOENIX-Weather 2014 Development set.
PDFVQA: A New Dataset for Real-World VQA on PDF Documents
Click to add a brief description of the dataset (Markdown and LaTeX enabled).
Bongard-OpenWorld is a new benchmark for evaluating real-world few-shot reasoning for machine vision. We hope it can help us better understand the limitations of current visual intelligence and facilitate future research on visual agents with stronger few-shot visual reasoning capabilities.
We introduce FUNSD-r and CORD-r in Token Path Prediction, the revised VrD-NER datasets to reflect the real-world scenarios of NER on scanned VrDs.
We introduce FUNSD-r and CORD-r in Token Path Prediction, the revised VrD-NER datasets to reflect the real-world scenarios of NER on scanned VrDs.
Dataset of restaurant reviews from TripAdvisor that includes images and texts uploaded in reviews by users. Reviews in six different cities are included: Gijón (Spain), Barcelona (Spain), Madrid (Spain), New York City (USA), Paris (France) and London (United Kingdom). In the original publication, the following task is proposed: Can we explain, using the existing image or text from a different user, why a given restaurant was recommended to a certain user?
A Large Dataset for Remote Sensing Image Change Captioning. The LEVIR-CC dataset contains 10,077 pairs of bi-temporal remote sensing images and 50,385 sentences describing the differences between images.
OpenCHAIR is a benchmark for evaluating open-vocabulary hallucinations in image captioning models. By leveraging the linguistic knowledge of LLMs, OpenCHAIR is able to perform fine-grained hallucination measurements, as well as significantly increase the amount of objects that can be measured (especially when compared to the existing benchmark, CHAIR). To exploit the LLM's full potential we construct a new dataset by generating 5000 captions with highly diverse objects and let a powerful text-to-image model generate images for them. We find that we are not just able to increase the benchmark's diversity, but also improve the evaluation accuracy with respect to CHAIR's.
We collect a total of 13,380 images captured on 2,210 different scenes. We use different objects and backgrounds to build our dataset.
CelebA-Spoof is a large-scale face anti-spoofing dataset recently introduced in [53]. The dataset contains 625,537 images of 10,177 celebrities captured under different spoof mediums, environments and illumination conditions. The original dataset proposes three different evaluation protocols. For our experimentation, we focus on the most general ”intra” protocol, in which different spoof types, environments and illumination conditions are used for both training and testing.
SiW (Spoofing in the Wild) is a face anti-spoofing dataset recently introduced in [29] where images are extracted from short videos captured at high resolution and 30 frames per second. In total, 4,478 videos are collected from 165 subjects including variations in spoof type, recording device, illumination condition, pose and facial expression.
We design an all-day semantic segmentation benchmark all-day CityScapes. It is the first semantic segmentation benchmark that contains samples from all-day scenarios, i.e., from dawn to night. Our dataset will be made publicly available at [https://isis-data.science.uva.nl/cv/1ADcityscape.zip].
We propose a novel task, termed Relation Conversation (ReC), which unifies the formulation of text generation, object localization, and relation comprehension. Based on the unified formulation, we construct the AS-V2 dataset, which consists of 127K high-quality relation conversation samples, to unlock the ReC capability for Multi-modal Large Language Models (MLLMs).
IllusionVQA is a Visual Question Answering (VQA) dataset with two sub-tasks. The first task tests comprehension on 435 instances in 12 optical illusion categories. Each instance consists of an image with an optical illusion, a question, and 3 to 6 options, one of which is the correct answer. We refer to this task as Logo IllusionVQA-Comprehension. The second task tests how well VLMs can differentiate geometrically impossible objects from ordinary objects when two objects are presented side by side. The task consists of 1000 instances following a similar format to the first task. We refer to this task as Logo IllusionVQA-Soft-Localization.
DAD-3DHeads dataset consists of 44,898 images collected from various sources (37,840 in the training set, 4,312 in the validation set, and 2,746 in the test set).
The dataset, comprising 1204 meticulously curated images, serves as a comprehensive resource for advancing real-time mosquito detection models. The dataset is strategically divided into training, validation, and test sets, accounting for 87%, 8%, and 5% of the images, respectively. A rigorous preprocessing phase involves auto-orientation and resizing to standardize dimensions at 640x640 pixels. To ensure dataset integrity, the filter null criterion mandates that all images must contain annotations. Augmentations, including flips, rotations, crops, and grayscale applications, enhance the dataset's diversity, fostering robust model training. With a focus on quality and variety, this dataset provides a solid foundation for evaluating and enhancing real-time mosquito detection models.
MMCode is a multi-modal code generation dataset designed to evaluate the problem-solving skills of code language models in visually rich contexts (i.e. images). It contains 3,548 questions paired with 6,620 images, derived from real-world programming challenges across 10 code competition websites, with Python solutions and tests provided. The dataset emphasizes the extreme demand for reasoning abilities, the interwoven nature of textual and visual contents, and the occurrence of questions containing multiple images.
Hugging Face Datasets (New!) | Website | Github Repository | arXiv e-Print
This dataset is proposed in the ICLR 2024 paper: Measuring Vision-Language STEM Skills of Neural Models. The problems in the real world often require solutions, combining knowledge from STEM (science, technology, engineering, and math). Unlike existing datasets, our dataset requires the understanding of multimodal vision-language information of STEM. Our dataset features one of the largest and most comprehensive datasets for the challenge. It includes 448 skills and 1,073,146 questions spanning all STEM subjects. Compared to existing datasets that often focus on examining expert-level ability, our dataset includes fundamental skills and questions designed based on the K-12 curriculum. We also add state-of-the-art foundation models such as CLIP and GPT-3.5-Turbo to our benchmark.