Datasets

19,997 machine learning datasets

19,997 dataset results

ScenicOrNot

ScenicOrNot (SoN) is a dataset of 185,548 images with associated natural beauty rating histograms. Each image in the dataset was rated at least five times. The images also have metadata like title and location.

4 papers0 benchmarksImages

WHU-Hi (Wuhan UAV-borne hyperspectral image)

WHU-Hi dataset (Wuhan UAV-borne hyperspectral image) is collected and shared by the RSIDEA research group of Wuhan University, and it could serve as a benchmark dataset for precise crop classification and hyperspectral image classification studies. The WHU-Hi dataset contains three individual UAV-borne hyperspectral datasets: WHU-Hi-LongKou, WHU-Hi-HanChuan, and WHU-Hi-HongHu. All the datasets were acquired in farming areas with various crop types in Hubei province, China, via a Headwall Nano-Hyperspec sensor mounted on a UAV platform. Compared with spaceborne and airborne hyperspectral platforms, unmanned aerial vehicle (UAV)-borne hyperspectral systems can acquire hyperspectral imagery with a high spatial resolution (which we refer to here as H2 imagery). The research was published in Remote Sensing of Environment.

4 papers0 benchmarksHyperspectral images, Images

Botswana

Botswana is a hyperspectral image classification dataset. The NASA EO-1 satellite acquired a sequence of data over the Okavango Delta, Botswana in 2001-2004. The Hyperion sensor on EO-1 acquires data at 30 m pixel resolution over a 7.7 km strip in 242 bands covering the 400-2500 nm portion of the spectrum in 10 nm windows. Preprocessing of the data was performed by the UT Center for Space Research to mitigate the effects of bad detectors, inter-detector miscalibration, and intermittent anomalies. Uncalibrated and noisy bands that cover water absorption features were removed, and the remaining 145 bands were included as candidate features: [10-55, 82-97, 102-119, 134-164, 187-220]. The data analyzed in this study, acquired May 31, 2001, consist of observations from 14 identified classes representing the land cover types in seasonal swamps, occasional swamps, and drier woodlands located in the distal portion of the Delta.

4 papers3 benchmarksHyperspectral images

STREETS

A novel traffic flow dataset from publicly available web cameras in the suburbs of Chicago, IL.

4 papers0 benchmarks

CC-DBP

CC-DBP is a dataset for knowledge base population research using Common Crawl and DBpedia.

4 papers0 benchmarksTexts

Kuzushiji-Kanji

Kuzushiji-Kanji is an imbalanced dataset of total 3832 Kanji characters (64x64 grayscale, 140,426 images), ranging from 1,766 examples to only a single example per class. Kuzushiji is a Japanese cursive writing style.

4 papers0 benchmarksImages

CCPE-M (Coached Conversational Preference Elicitation dataset for Movies)

A dataset consisting of 502 English dialogs with 12,000 annotated utterances between a user and an assistant discussing movie preferences in natural language.

4 papers0 benchmarksTexts

CocoDoom

CocoDoom is a collection of pre-recorded data extracted from Doom gaming sessions along with annotations in the MS Coco format.

4 papers0 benchmarksImages

MuMu

MuMu is a new dataset of more than 31k albums classified into 250 genre classes.

4 papers0 benchmarksAudio, Images, Texts

Wikipedia Generation

Wikipedia Generation is a dataset for article generation from Wikipedia from references at the end of Wikipedia page and the top 10 search results for the Wikipedia topic.

4 papers0 benchmarks

FIRE (Fundus Image Registration Dataset)

Fundus Image Registration Dataset (FIRE) is a dataset consisting of 129 retinal images forming 134 image pairs. These image pairs are split into 3 different categories depending on their characteristics. The images were acquired with a Nidek AFC-210 fundus camera, which acquires images with a resolution of 2912x2912 pixels and a FOV of 45° both in the x and y dimensions. Images were acquired at the Papageorgiou Hospital, Aristotle University of Thessaloniki, Thessaloniki from 39 patients.

4 papers1 benchmarksImages, Medical

Imp1k

Imp1k is a new dataset of designs annotated with importance information.

4 papers0 benchmarksImages

MERL-RAV (MERL Reannotation of AFLW with Visibility)

The MERL-RAV (MERL Reannotation of AFLW with Visibility) Dataset contains over 19,000 face images in a full range of head poses. Each face is manually labeled with the ground-truth locations of 68 landmarks, with the additional information of whether each landmark is unoccluded, self-occluded (due to extreme head poses), or externally occluded. The images were annotated by professional labelers, supervised by researchers at Mitsubishi Electric Research Laboratories (MERL).

4 papers22 benchmarksImages

Interspeech 2021 Deep Noise Suppression Challenge

The Deep Noise Suppression (DNS) challenge is designed to foster innovation in the area of noise suppression to achieve superior perceptual speech quality.

4 papers0 benchmarks

METU Trademark

The METU Trademark Dataset is a large dataset (the largest publicly available logo dataset as of 2014, and the largest one not requiring any preprocessing as of 2017), which is composed of more than 900K real logos belonging to real companies worldwide. The dataset also includes query sets of varying difficulties, allowing Trademark Retrieval researchers to benchmark their methods against other methods to progress the field.

4 papers0 benchmarksImages

Alchemy

The DeepMind Alchemy environment is a meta-reinforcement learning benchmark that presents tasks sampled from a task distribution with deep underlying structure. It was created to test for the ability of agents to reason and plan via latent state inference, as well as useful exploration and experimentation.

4 papers0 benchmarksEnvironment

ARC-DA (ARC Direct Answer Questions)

ARC Direct Answer Questions (ARC-DA) dataset consists of 2,985 grade-school level, direct-answer ("open response", "free form") science questions derived from the ARC multiple-choice question set released as part of the AI2 Reasoning Challenge in 2018.

4 papers0 benchmarksTexts

Advising Corpus

Advising Corpus is a dataset based on an entirely new collection of dialogues in which university students are being advised which classes to take. These were collected at the University of Michigan with IRB approval. They were released as part of DSTC 7 track 1 and used again in DSTC 8 track 2.

4 papers3 benchmarksTexts

CEDAR Signature

CEDAR Signature is a database of off-line signatures for signature verification. Each of 55 individuals contributed 24 signatures thereby creating 1,320 genuine signatures. Some were asked to forge three other writers’ signatures, eight times per subject, thus creating 1,320 forgeries. Each signature was scanned at 300 dpi gray-scale and binarized using a gray-scale histogram. Salt pepper noise removal and slant normalization were two steps involved in image preprocessing. The database has 24 genuines and 24 forgeries available for each writer.

4 papers1 benchmarksImages

Chickenpox Cases in Hungary

Chickenpox Cases in Hungary is a spatio-temporal dataset of weekly chickenpox (childhood disease) cases from Hungary. It can be used as a longitudinal dataset for benchmarking the predictive performance of spatiotemporal graph neural network architectures. The dataset consists of a county-level adjacency matrix and time series of the county-level reported cases between 2005 and 2015. There are 2 specific related tasks:

4 papers0 benchmarksGraphs

PreviousPage 236 of 1000Next