3,275 machine learning datasets
3,275 dataset results
Orchid2024 is a fine-grained classification dataset specifically designed for Chinese Cymbidium orchid cultivars. It includes data collected from 20 cities across 12 provincial administrative regions in China and encompasses 1,269 cultivars from 8 Chinese Cymbidium orchid species and 6 additional categories, totaling 156,630 images. The dataset covers nearly all common Chinese Cymbidium cultivars currently found in China, with its fine granularity and focus on the real world making it a unique and practical resource for researchers and practitioners.
Click to add a brief description of the dataset (Markdown and LaTeX enabled).
ALLO is an anomaly detection and localization dataset for space stations in lunar orbit. Synthetically rendered using Blender, ALLO provides realistic images of what a robotic manipulator on a space station will encounter including possible anomalies.
A collaborative effort between researchers at the Vascular Imaging Lab located at the University of Calgary and the Medical Image Computing Lab located at the University of Campinas (UNICAMP) originated the Calgary Campinas public brain magnetic resonance (MR) images dataset.
VisArgs is a densely annotated benchmark for visual argument understanding. It contains 1,611 images annotated with 5,112 visual premises (with regions), 5,574 commonsense premises, and reasoning trees connecting them into structured arguments. We propose three tasks for evaluating visual argument understanding: premise localization, premise identification, and conclusion.
MediConfusion is a challenging medical Visual Question Answering (VQA) benchmark dataset, that probes the failure modes of medical Multimodal Large Language Models (MLLMs) from a vision perspective. We reveal that state-of-the-art models are easily confused by image pairs that are otherwise visually dissimilar and clearly distinct for medical experts. <br /> Our benchmark consists of 176 confusing pairs. A confusing pair is a set of two images that share the same question and corresponding answer options, but the correct answer is different for the images. <br /> We evaluate models based on their ability to answer <i>both</i> questions correctly within a confusing pair, which we call <b>set accuracy</b>. This metric indicates how well models can tell the two images apart, as a model that selects the same answer option for both images for all pairs will receive 0% set accuracy. We also report <b>confusion</b>, a metric that describes the proportion of confusing pairs where the model ha
Provide:
The dataset includes annotations for burned area delineation and land cover segmentation, with a focus on European soil. The dataset is curated from various sources, including the Copernicus European Monitoring System (EMS) and Sentinel-2 feeds.
IVM-Mix-1M provide over 1M image-instruction pairs with corresponding instruction-relevant mask labels. Our IVM-Mix-1M dataset consists of three part: HumanLabelData, RobotMachineData and VQAMachineData. For the HumanLabelData and RobotMachineData, we provide well-orgnized images, mask label and language instructions. For the VQAMachineData, we only provide mask label and language instructions, please refer to https://huggingface.co/datasets/2toINF/IVM-Mix-1M and download the images from constituting datasets.
Dataset for Land Cover segmentation from sparse labels, using Sentinel-2 as source imagery.
VQA NLE synthetic dataset, made with LLaVA-1.5 using features from GQA dataset. Total number of unique datas: 66684
YesBut Dataset (https://yesbut-dataset.github.io) Understanding satire and humor is a challenging task for even current Vision-Language models. In this paper, we propose the challenging tasks of Satirical Image Detection (detecting whether an image is satirical), Understanding (generating the reason behind the image being satirical), and Completion (given one half of the image, selecting the other half from 2 given options, such that the complete image is satirical) and release a high-quality dataset YesBut, consisting of 2547 images, 1084 satirical and 1463 non-satirical, containing different artistic styles, to evaluate those tasks. Each satirical image in the dataset depicts a normal scenario, along with a conflicting scenario which is funny or ironic. Despite the success of current Vision-Language Models on multimodal tasks such as Visual QA and Image Captioning, our benchmarking experiments show that such models perform poorly on the proposed tasks on the YesBut Dataset in Zero-Sh
The dataset SCARED-C is introduced in the context of assessing robustness in endoscopic depth prediction models. It is part of the EndoDepth benchmark, which is designed to evaluate the performance of monocular depth prediction models specifically for endoscopic scenarios. The dataset features 16 different types of image corruptions, each with five levels of severity, encompassing challenges like lens distortion, resolution alterations, specular reflection, and color changes that are typical in endoscopic imaging. The ground truth is on the original testing set of SCARED.
Vision-language supervised fine-tuning effectively enhances VLLM performance, but existing visual instruction tuning datasets have limitations:
A Racial Fairness Benchmark Dataset for Face Forgery Detection.
We conducted a large crowdsourcing study of click patterns in an interactive segmentation scenario and collected 475K real-user clicks. Drawing on ideas from saliency tasks, we develop a clickability model that enables sampling clicks, which closely resemble actual user inputs. Using our model and dataset, we propose RClicks benchmark for a comprehensive comparison of existing interactive segmentation methods on realistic clicks. Specifically, we evaluate not only the average quality of methods, but also the robustness w.r.t. click patterns.
SAT-MTB-VSR is a large-scale dataset for satellite video super-resolution made from original videos of Jilin-1, which is a subset of the satellite video multitasking dataset SAT-MTB. The dataset is cropped from 18 videos captured by the Jilin-1 video satellite, covering a wide range of terrains, such as cities, docks, airports, suburbs, forests, and deserts, with a resolution of about 1 m. And the videos contain dynamic scenes, such as moving cars, airplanes, trains, and ships, which test the ability of the VSR method to deal with moving targets of different sizes and speeds. At the same time, due to the motion of the satellite, the video contains changes in viewing angle and lighting.
realfred is an embodied instruction following benchmark.
A synthetic dataset including driving under adverse weather conditions | Autonomous Driving
The Wikidata Reference Logo Dataset (WiRLD), a comprehensive collection of reference logos specifically designed to address the challenges of large-scale logo identification. Recognizing the limitations of existing logo datasets, which often have a restricted number of logo classes or lack public availability, the authors curated WiRLD to facilitate research on more realistic, large-scale logo identification tasks. WiRLD contains 100,000 reference logo images sourced from Wikidata, representing 100,000 distinct logo classes. Each entity in the dataset has one corresponding logo image. The dataset's focus on providing a vast and readily accessible collection of reference logos makes it particularly valuable for evaluating one-shot logo identification methods, especially in large-scale scenarios