3,275 machine learning datasets
3,275 dataset results
This is an image splicing dataset including different types of preprocessing and postprocessing techniques. Foreground objects are taken from HRSOD and background images are taken from BG20k datasets. 95000 train and 5000 test images are provided.
This dataset contains images and annotations for scene text detection and recognition. It is made up of two parts: (1) 1,175 images manually labeled with a total of 59,588 text instances at the line and word levels; and (2) 929 signboard images collected from the VinText, Total-Text, and ICDAR15 datasets. Each text instance in the first part of our dataset has a quadrilateral bounding box and a ground truth character sequence associated with it. In the second part, images are selected if they contain signboards. This portion of the dataset comprises 20,261 text instances at word levels. This brings the total text instances in our final dataset up to 79,814. Following the ICDAR15 standard, we annotated each image with all of the text instances, polygons, and content that were present. Manual annotations were done on each and every image.
GenAI-Bench benchmark consists of 1,600 challenging real-world text prompts sourced from professional designers. Compared to benchmarks such as PartiPrompt and T2I-CompBench, GenAI-Bench captures a wider range of aspects in the compositional text-to-visual generation, ranging from basic (scene, attribute, relation) to advanced (counting, comparison, differentiation, logic). GenAI-Bench benchmark also collects human alignment ratings (1-to-5 Likert scales) on images and videos generated by ten leading models, such as Stable Diffusion, DALL-E 3, Midjourney v6, Pika v1, and Gen2.
Multimodal Large Language Models (MLLMs) have shown significant promise in various applications, leading to broad interest from researchers and practitioners alike. However, a comprehensive evaluation of their long-context capabilities remains underexplored. To address these gaps, we introduce the MultiModal Needle-in-a-haystack (MMNeedle) benchmark, specifically designed to assess the long-context capabilities of MLLMs. Besides multi-image input, we employ image stitching to further increase the input context length, and develop a protocol to automatically generate labels for sub-image level retrieval. Essentially, MMNeedle evaluates MLLMs by stress-testing their capability to locate a target sub-image (needle) within a set of images (haystack) based on textual instructions and descriptions of image contents. This setup necessitates an advanced understanding of extensive visual contexts and effective information retrieval within long-context image inputs. With this benchmark, we evalu
The Temporal Logic Video (TLV) Dataset addresses the scarcity of state-of-the-art video datasets for long-horizon, temporally extended activity and object detection. It comprises two main components:
The MS-EVS Dataset is the first large-scale event-based dataset for face detection.
The Lusitano dataset was collected over a 3-month period, spanning from January to March, from Paulo de Oliveira, S.A., a prominent textile company, based in Covilhã, Portugal, renowned for its innovative contributions to the textile industry. To collect the images for the dataset, we placed one camera in front of a fabric inspection machine, along with a strong and nearly uniform light source. This dataset comprises 4096 × 1024 images, captured by an industrial-grade Teledyne Dalsa Linea camera. The camera’s high resolution and precision ensure the accurate depiction of textile samples, with the level of detail necessary for defect analysis. None of the defects depicted in this dataset are artificially generated; they stem from genuine occurrences observed during this collection period, and thus represent real-world challenges encountered in textile production processes. The dataset also showcases normal images. We announce two folders, train and test in the same folder architecture,
We release a large-scale Multi-Scenario Multi-Modal CTR dataset named AntM2C, built from real industrial data from Alipay. This dataset offers an impressive breadth and depth of information, covering CTR data from four diverse business scenarios, including advertisements, consumer coupons, mini-programs, and videos. Unlike existing datasets, AntM2C provides not only ID-based features but also five textual features and one image feature for both users and items, supporting more delicate multi-modal CTR prediction.
The deliberate manipulation of public opinion, especially through altered images, poses a significant danger to society. To fight this issue on a technical level we support the research community by releasing the Digital Forensics 2023 (DF2023) training and validation dataset.
Introduced by Khan. et al. Divide and conquer: Ill-light image enhancement via hybrid deep network https://www.sciencedirect.com/science/article/abs/pii/S0957417421004759
Key Points
InpaintCOCO is a benchmark to understand fine-grained concepts in multimodal models (vision-language) similar to Winoground. To our knowledge InpaintCOCO is the first benchmark, which consists of image pairs with minimum differences, so that the visual representation can be analyzed in a more standardized setting.
Medical report generation (MRG), which aims to automatically generate a textual description of a specific medical image (e.g., a chest X-ray), has recently received increasing research interest. Building on the success of image captioning, MRG has become achievable. However, generating language-specific radiology reports poses a challenge for data-driven models due to their reliance on paired image-report chest X-ray datasets, which are labor-intensive, time-consuming, and costly. In this paper, we introduce a chest X-ray benchmark dataset, namely CASIA-CXR, consisting of high-resolution chest radiographs accompanied by narrative reports originally written in French. To the best of our knowledge, this is the first public chest radiograph dataset with medical reports in this particular language. Importantly, we propose a simple yet effective multimodal encoder-decoder contextually-guided framework for medical report generation in French. We validated our framework through intra-language
Overview The IITKGP_Fence dataset is designed for tasks related to fence-like occlusion detection, defocus blur, depth mapping, and object segmentation. The captured data vaies in scene composition, background defocus, and object occlusions. The dataset comprises both labeled and unlabeled data, as well as additional video and RGB-D data. The contains ground truth occlusion masks (GT) for the corresponding images. We created the ground truth occlusion labels in a semi-automatic way with user interaction.
Post-Spraying Image Evaluation This dataset is for the paper Deep Learning for Precision Agriculture: Post-Spraying Evaluation and Deposition Estimation (https://arxiv.org/abs/2409.16213).
CodeSCAN is the first large-scale and diverse dataset of coding screenshots with pixel-perfect annotations. It features:
SimNICT is the first dataset for training universal non-ideal measurement CT (NICT) enhancement models.
This dataset is part of my bachelor thesis project. It was created by combining multiple open-source datasets from RoboFlow Universe as well as manual annotation.
We introduce low-light image enhancement benchmark dataset “Low-light Images of Streets (LoLI-Street),” which contains three subsets: train, validation, and test. The train and validation sets consist of 30k and 3k paired low-light and high-light images, respectively, and the real low-light test set (RLLT) contains 1k images under real-world low-light conditions, totaling 33k images.
Media-Text dataset comprising images of banners, posters, covers and another images characterised for media industry.