3,275 machine learning datasets
3,275 dataset results
CV-Cities comprises $223,736$ ground panoramic images and an equal number of satellite images all accompanied by high-precision GPS coordinates. These images represent sixteen representative cities across five continents. The ground images are $360^{\circ}$ panorama images with a resolution of $4,096 \times 2,048$ pixels, while the resolution of satellite images is $746 \times 746$ pixels, and are captured at a zoom level of $20$. The spatial resolution is $0.298 m$, corresponding to a latitude and longitude range of $0.002 \times 0.002^{\circ}$ (about $222 \times 222 m$). The images of each city in the dataset can be used for training and testing purposes.
Cross-View Geo-Localisation within urban regions is challenging in part due to the lack of geo-spatial structuring within current datasets and techniques. We propose utilising graph representations to model sequences of local observations and the connectivity of the target location. Modelling as a graph enables generating previously unseen sequences by sampling with new parameter configurations. SpaGBOL contains 98,855 panoramic streetview images across different seasons, and 19,771 corresponding satellite images from 10 mostly densely populated international cities. This translates to 5 panoramic images and one satellite image per graph node. Downloading instructions below.
This dataset provides the VCIP 2020 Grand Challenge on the NIR Image Colorization dataset. You can refer to https://jchenhkg.github.io/projects/NIR2RGB_VCIP_Challenge/ for a detailed description of this dataset. If you think this dataset is helpful, please feel free to cite our paper: @inproceedings{yang2023cooperative, title={Cooperative Colorization: Exploring Latent Cross-Domain Priors for NIR Image Spectrum Translation}, author={Yang, Xingxing and Chen, Jie and Yang, Zaifeng}, booktitle={Proceedings of the 31st ACM International Conference on Multimedia}, pages={2409--2417}, year={2023} }
We present datasets containing urban traffic and rural road scenes recorded using hyperspectral snap-shot sensors mounted on a moving car. The novel hyperspectral cameras used can capture whole spectral cubes at up to 15 Hz. This emerging new sensor modality enables hyperspectral scene analysis for autonomous driving tasks. Up to the best of the author’s knowledge no such dataset has been published so far. The datasets contain synchronized 3-D laser, spectrometer and hyperspectral data. Dense ground truth annotations are provided as semantic labels, material and traversability. The hyperspectral data ranges from visible to near infrared wavelengths. We explain our recoding platform and method, the associated data format along with a code library for easy data consumption. The datasets are publicly available for download.
High-quality underwater coral detection dataset for machine learning and computer vision research.
The dataset was proposed in LLaVA-CoT: Let Vision Language Models Reason Step-by-Step.
Dataset Overview This dataset contains individual-level data from a randomized controlled trial (RCT) conducted in northern Uganda, along with associated satellite imagery. It is designed to investigate how treatment effects may vary across different geographical and contextual settings by leveraging both tabular and image-based variables.
Introduction This dataset supports Ye et al. 2024 Nature Communications.
The newly introduced UP-COUNT dataset includes drone footage captured with cameras from the DJI Mini 2 family UAV. It encompasses diverse environments, including streets, plazas, public transport stops, parks and other green recreation places. We recorded 202 unique videos and then extracted frames with a step of one second, resulting in 10,000 images with a resolution of 3840 × 2160 pixels. The recordings were taken at different altitudes and speeds of flight, and with various densities of people. Acquisition conditions vary in daytime and lighting, creating challenging shadows. Extra altitude information is provided for each image. Next, the labels of people’s heads were hand-prepared, resulting in 352,487 instances. During the labelling process, each image was marked and checked by two different people, and the continuity of labels within each sequence was reviewed. The lowest- (26.0 meters) and the highest-altitude (101.0 meters) recorded among the sequences, with an average of 60.
Operating rooms (ORs) are complex, high-stakes environments requiring precise understanding of interactions among medical staff, tools, and equipment for enhancing surgical assistance, situational awareness, and patient safety. Current datasets fall short in scale, realism and do not capture the multimodal nature of OR scenes, limiting progress in OR modeling. To this end, we introduce MM-OR, a realistic and large-scale multimodal spatiotemporal OR dataset, and the first dataset to enable multimodal scene graph generation. MM-OR captures comprehensive OR scenes containing RGB-D data, detail views, audio, speech transcripts, robotic logs, and tracking data and is annotated with panoptic segmentations, semantic scene graphs, and downstream task labels. Further, we propose MM2SG, the first multimodal large vision-language model for scene graph generation, and through extensive experiments, demonstrate its ability to effectively leverage multimodal inputs. Together, MM-OR and MM2SG establi
The automated recognition of different vehicle classes and their orientation on aerial images is an important task in the field of traffic research and also finds applications in disaster management, among other things. For the further development of corresponding algorithms that deliver reliable results not only under laboratory conditions but also in real scenarios, training data sets that are as extensive and versatile as possible play a decisive role. For this purpose, we present our dataset EAGLE (oriEnted vehicle detection using Aerial imaGery in real-worLd scEnarios).
BRIGHT is the first open-access, globally distributed, event-diverse multimodal dataset specifically curated to support AI-based disaster response. It covers five types of natural disasters and two types of man-made disasters across 14 disaster events in 23 regions worldwide, with a particular focus on developing countries.
We propose a large-scale underwater instance segmentation dataset, UIIS10K, which includes 10,048 images with pixel-level annotations for 10 categories. As far as we know, this is the largest underwater instance segmentation dataset available and can be used as a benchmark for evaluating underwater segmentation methods.
The Indoor-6 dataset was created from multiple sessions captured in six indoor scenes over multiple days. The pseudo ground truth (pGT) 3D point clouds and camera poses for each scene are computed using COLMAP. All training data uses only colmap reconstruction from training images. Compared to 7-scenes, the scenes in Indoor-6 are larger, have multiple rooms, contains illumination variations as the images span multiple days and different times of day.
The Udacity dataset is mainly composed of video frames taken from urban roads. It provides a total number of 404,916 video frames for training and 5,614 video frames for testing. This dataset is challenging due to severe lighting changes, sharp road curves and busy traffic.
The CID (Campus Image Dataset) is a dataset captured in low-light env with the help of Android programming. Its basic unit is group, which is named by capture time and contains 8 exposure-time-varying raw image shot in a burst.
4D Light Field Dataset is a light field benchmark consisting of 24 carefully designed synthetic, densely sampled 4D light fields with highly accurate disparity ground truth.
ePillID is a benchmark for developing and evaluating computer vision models for pill identification. The ePillID benchmark is designed as a low-shot fine-grained benchmark, reflecting real-world challenges for developing image-based pill identification systems. The characteristics of the ePillID benchmark include: * Reference and consumer images: The reference images are taken with controlled lighting and backgrounds, and with professional equipment. The consumer images are taken with real-world settings including different lighting, backgrounds, and equipment. For most of the pills, one image per side (two images per pill type) is available from the NIH Pillbox dataset. * Low-shot and fine-grained setting: 13k images representing 9804 appearance classes (two sides for 4902 pill types). For most of the appearance classes, there exists only one reference image, making it a challenging low-shot recognition setting.
The dataset contains two subsets of synthetic, semantically segmented road-scene images, which have been created for developing and applying the methodology described in the paper "A Sim2Real Deep Learning Approach for the Transformation of Images from Multiple Vehicle-Mounted Cameras to a Semantically Segmented Image in Bird’s Eye View" (IEEE Xplore, arXiv, YouTube)
CholecT40 is the first endoscopic dataset introduced to enable research on fine-grained action recognition in laparoscopic surgery.