19,997 machine learning datasets
19,997 dataset results
ScenicOrNot (SoN) is a dataset of 185,548 images with associated natural beauty rating histograms. Each image in the dataset was rated at least five times. The images also have metadata like title and location.
WHU-Hi dataset (Wuhan UAV-borne hyperspectral image) is collected and shared by the RSIDEA research group of Wuhan University, and it could serve as a benchmark dataset for precise crop classification and hyperspectral image classification studies. The WHU-Hi dataset contains three individual UAV-borne hyperspectral datasets: WHU-Hi-LongKou, WHU-Hi-HanChuan, and WHU-Hi-HongHu. All the datasets were acquired in farming areas with various crop types in Hubei province, China, via a Headwall Nano-Hyperspec sensor mounted on a UAV platform. Compared with spaceborne and airborne hyperspectral platforms, unmanned aerial vehicle (UAV)-borne hyperspectral systems can acquire hyperspectral imagery with a high spatial resolution (which we refer to here as H2 imagery). The research was published in Remote Sensing of Environment.
Botswana is a hyperspectral image classification dataset. The NASA EO-1 satellite acquired a sequence of data over the Okavango Delta, Botswana in 2001-2004. The Hyperion sensor on EO-1 acquires data at 30 m pixel resolution over a 7.7 km strip in 242 bands covering the 400-2500 nm portion of the spectrum in 10 nm windows. Preprocessing of the data was performed by the UT Center for Space Research to mitigate the effects of bad detectors, inter-detector miscalibration, and intermittent anomalies. Uncalibrated and noisy bands that cover water absorption features were removed, and the remaining 145 bands were included as candidate features: [10-55, 82-97, 102-119, 134-164, 187-220]. The data analyzed in this study, acquired May 31, 2001, consist of observations from 14 identified classes representing the land cover types in seasonal swamps, occasional swamps, and drier woodlands located in the distal portion of the Delta.
A novel traffic flow dataset from publicly available web cameras in the suburbs of Chicago, IL.
CC-DBP is a dataset for knowledge base population research using Common Crawl and DBpedia.
Kuzushiji-Kanji is an imbalanced dataset of total 3832 Kanji characters (64x64 grayscale, 140,426 images), ranging from 1,766 examples to only a single example per class. Kuzushiji is a Japanese cursive writing style.
A dataset consisting of 502 English dialogs with 12,000 annotated utterances between a user and an assistant discussing movie preferences in natural language.
CocoDoom is a collection of pre-recorded data extracted from Doom gaming sessions along with annotations in the MS Coco format.
MuMu is a new dataset of more than 31k albums classified into 250 genre classes.
Wikipedia Generation is a dataset for article generation from Wikipedia from references at the end of Wikipedia page and the top 10 search results for the Wikipedia topic.
Fundus Image Registration Dataset (FIRE) is a dataset consisting of 129 retinal images forming 134 image pairs. These image pairs are split into 3 different categories depending on their characteristics. The images were acquired with a Nidek AFC-210 fundus camera, which acquires images with a resolution of 2912x2912 pixels and a FOV of 45° both in the x and y dimensions. Images were acquired at the Papageorgiou Hospital, Aristotle University of Thessaloniki, Thessaloniki from 39 patients.
Imp1k is a new dataset of designs annotated with importance information.
The MERL-RAV (MERL Reannotation of AFLW with Visibility) Dataset contains over 19,000 face images in a full range of head poses. Each face is manually labeled with the ground-truth locations of 68 landmarks, with the additional information of whether each landmark is unoccluded, self-occluded (due to extreme head poses), or externally occluded. The images were annotated by professional labelers, supervised by researchers at Mitsubishi Electric Research Laboratories (MERL).
The Deep Noise Suppression (DNS) challenge is designed to foster innovation in the area of noise suppression to achieve superior perceptual speech quality.
The METU Trademark Dataset is a large dataset (the largest publicly available logo dataset as of 2014, and the largest one not requiring any preprocessing as of 2017), which is composed of more than 900K real logos belonging to real companies worldwide. The dataset also includes query sets of varying difficulties, allowing Trademark Retrieval researchers to benchmark their methods against other methods to progress the field.
The DeepMind Alchemy environment is a meta-reinforcement learning benchmark that presents tasks sampled from a task distribution with deep underlying structure. It was created to test for the ability of agents to reason and plan via latent state inference, as well as useful exploration and experimentation.
ARC Direct Answer Questions (ARC-DA) dataset consists of 2,985 grade-school level, direct-answer ("open response", "free form") science questions derived from the ARC multiple-choice question set released as part of the AI2 Reasoning Challenge in 2018.
Advising Corpus is a dataset based on an entirely new collection of dialogues in which university students are being advised which classes to take. These were collected at the University of Michigan with IRB approval. They were released as part of DSTC 7 track 1 and used again in DSTC 8 track 2.
CEDAR Signature is a database of off-line signatures for signature verification. Each of 55 individuals contributed 24 signatures thereby creating 1,320 genuine signatures. Some were asked to forge three other writers’ signatures, eight times per subject, thus creating 1,320 forgeries. Each signature was scanned at 300 dpi gray-scale and binarized using a gray-scale histogram. Salt pepper noise removal and slant normalization were two steps involved in image preprocessing. The database has 24 genuines and 24 forgeries available for each writer.
Chickenpox Cases in Hungary is a spatio-temporal dataset of weekly chickenpox (childhood disease) cases from Hungary. It can be used as a longitudinal dataset for benchmarking the predictive performance of spatiotemporal graph neural network architectures. The dataset consists of a county-level adjacency matrix and time series of the county-level reported cases between 2005 and 2015. There are 2 specific related tasks: