3,275 machine learning datasets
3,275 dataset results
The ISP-AD Dataset is a large-scale anomaly detection dataset, representing a real-world industrial use case. It contains 312,674 fault-free and 246,375 defective samples, including 245,664 synthetic defects and 711 real defects collected on the factory floor.
MICCAI Challenge 2024
The HDRT dataset is a large-scale dataset designed for infrared-guided high dynamic range (HDR) imaging. It includes aligned infrared (IR), standard dynamic range (SDR), and HDR images to facilitate research in multi-modal fusion, HDR imaging, and related areas.
AerialMPT is a dataset for pedestrian tracking in aerial image sequences and presents real-world challenges for MOT algorithms such as low frame rate, small moving objects, and complex backgrounds. AerialMPT consists of 14 sequences and 307 frames with an average size of 425 × 358 pixels. The images were acquired by DLR's 4K camera system from altitudes ranging from 600 m to 1400 m, resulting in spatial resolutions (GSDs) ranging from 8 cm/pixel to 13 cm/pixel. In a post-processing step, the images were co-registered, geo-referenced, and cropped for each region of interest, resulting in sequences of 2 fps. The images were acquired during different flight campaigns between 2016 and 2017, over different scenes containing pedestrians and with different crowd densities and movement complexities.
Pick-a-Filter is a semi-synthetic dataset constructed from Pick-a-Pic v1 to measure the capability of text-to-image models of adapting to heterogeneous preferences. We assign users from V1 randomly into two groups: those who prefer blue, cooler image tones (G1) and those who prefer red, warmer image tones (G2). After constructing this split, we apply the following logic to construct the dataset:
The dataset consists of images of bananas and apples. It was created by collecting images, under the Creative Commons license, from Flickr. The images illustrate bananas and apples with variations regarding the color, placement, size, and background. The motivation for the construction of this dataset stems from studies in cognitive science, where human perception is investigated using examples with discrete properties of bananas and apples. It can be used in the context of explainable/interpretable image classification as in: Dimas, G., Cholopoulou, E., & Iakovidis, D. K. (2023). E pluribus unum interpretable convolutional neural networks. Scientific Reports, 13(1), 11421. https://www.nature.com/articles/s41598-023-38459-1
VETRA is a dataset for vehicle tracking in aerial image sequences and presents unique challenges such as low frame rates, small and fast-moving objects, as well as high camera movement. These characteristics allow for extended tracking of numerous vehicles with varying motion behaviors over large areas and pose new challenges for MOT algorithms. VETRA consists of 52 image sequences captured by airplanes and helicopters using DLR’s 3k and 4k camera systems. The acquisition sites are located in Germany and Austria. In addition to the classical training, validation and test sets, VETRA offers a second test set specifically designed for the application of large area monitoring (LAM). The LAM sequences are recorded over 7 rural roads and motorways with a fixed camera speed and configuration. Each road section is captured at 4 different times of the day, enabling the performance of MOT algorithms to be evaluated under different traffic loads in a static environment. Furthermore, the feature
The structure for the dataset is as follows :
MeshFleet is a filtered and annotated dataset of High Quality vehicles derived from Objaverse XL. It contains the sha256 of the objects together with consitent object captions and vehicle parameters.
The Songdo Traffic dataset delivers precisely georeferenced vehicle trajectories captured through high-altitude bird's-eye view (BeV) drone footage over Songdo International Business District, South Korea. Comprising approximately 700,000 unique trajectories, this resource represents one of the most extensive aerial traffic datasets publicly available, distinguishing itself through exceptional temporal resolution that captures vehicle movements at 29.97 points per second, enabling unprecedented granularity for advanced urban mobility analysis.
The Songdo Vision dataset provides high-resolution (4K, 3840×2160 pixels) RGB images annotated with categorized axis-aligned bounding boxes (BBs) for vehicle detection from a high-altitude bird’s-eye view (BeV) perspective. Captured over Songdo International Business District, South Korea, this dataset consists of 5,419 annotated video frames, featuring approximately 300,000 vehicle instances categorized into four classes:
The BIRDeep Audio Annotations dataset is a collection of bird vocalizations from Doñana National Park, Spain. It was created as part of the BIRDeep project, which aims to optimize the detection and classification of bird species in audio recordings using deep learning techniques. The dataset is intended for use in training and evaluating models for bird vocalization detection and identification.
Click to add a brief description of the dataset (Markdown and LaTeX enabled).
The NCSE v2.0 is a digitized collection of six 19th-century English periodicals
A publicly available corpus of nineteenth-century newspaper text focused on crime in London, derived from the Gale British Library Newspapers corpus parts 1 and 2. The corpus comprises 600 newspaper excerpts and for each excerpt contains the original source image, the machine transcription of that image as found in the BLN and a gold standard manual transcription.
GroundCap is a novel grounded image captioning dataset derived from MovieNet, containing 52,350 movie frames with detailed grounded captions. The dataset uniquely features an ID-based system that maintains object identity throughout captions, enables tracking of object interactions, and grounds not only objects but also actions and locations in the scene.
Synthetic soccer players rendered on top of real world stadium images in 4K covering half a pitch each. Ground truth annotations in form of precise location of players on the pitch as well as 3D location of player pelvis and image bounding boxes.
A benchmark that focuses on the sampling dilemma in long-video tasks. The LSDBench dataset is designed to evaluate the sampling efficiency of long-video VLMs. It consists of multiple-choice question-answer pairs based on hour-long videos, focusing on dense and short-duration actions with high Necessary Sampling Density (NSD).
The existing multi-modality image fusion dataset lacks comprehensive coverage of adverse weather scenarios. To address this, we introduce AWMM-100k, a benchmark dataset constructed by selecting samples from RoadScene, MSRS, M3FD, and LLVIP, followed by controlled degradation processing to simulate adverse weather conditions. Combined with real-world data captured using a DJI M30T drone equipped with high-resolution visible and thermal cameras, AWMM-100k comprises 187,699 images covering rain, haze, and snow, each categorized into heavy, medium, and light intensities. This dataset supports research on multi-modality image fusion under challenging weather conditions and is also applicable to image restoration tasks such as dehazing, deraining, and desnowing. We thank the original dataset for its contribution. In addition, we believe this dataset significantly expands the scope of multimodal image processing and computer vision research, facilitating advancements in both image fusion and
GeoJEPAD is a multimodal dataset combining OpenStreetMap (OSM) data (attributes and geometries) with high-resolution aerial imagery from diverse urban areas. Sourced from NAIP and OSM and then processed, tiled, and cropped. Geometries and relations represented as graphs with optional visibility edges.