19,997 machine learning datasets
19,997 dataset results
Is one of the largest egocentric datasets in the object search task with eyetracking information available
The DeepMind Q&A Dataset consists of two datasets for Question Answering, CNN and DailyMail. Each dataset contains many documents (90k and 197k each), and each document companies on average 4 questions approximately. Each question is a sentence with one missing word/phrase which can be found from the accompanying document/context.
Intended to provide freely available data sets in various formats together with basic annotation to be useful for applications in computational linguistics, translation studies and cross-linguistic corpus studies.
The ECUST Food Dataset is a food recognition dataset that contains 2978 images
esXNLI is a bilingual NLI dataset. It comprises 2,490 examples from 5 different genres that were originally annotated in Spanish, and translated into English by professional translators. It serves as a counterpoint to XNLI, which was originally annotated in English and translated into 14 other languages, including Spanish. The dataset was conceived to be used in conjunction with the XNLI development set to analyse the effect of translation in cross-lingual transfer learning.
A massive, deduplicated corpus of 7.4M Python files from GitHub.
The EXEQ-300k dataset contains 290,479 detailed questions with corresponding math headlines from Mathematics Stack Exchange. The dataset can be used to generate concise math headlines from detailed math questions.
The FB15k-237-low dataset is a variation of the FB15k-237 dataset where relations with a low number of triplets are kept.
Consists of 76 million geo-tagged images in 16 cosmopolitan cities.
Fraxtil is an audio dataset where given a raw audio track, the goal is to produce a choreography step chart, similar to those used in the Dance Dance Revolution video game. It contains 90 songs choreographed by a single author, with 450 charts for the 90 songs.
A high-quality dataset for machine translation evaluation that aims at being one of the first non-synthetic gender-balanced test datasets.
Goldfinch is a dataset for fine-grained recognition challenges. It contains a list of bird, butterfly, aircraft, and dog categories with relevant Google image search and Flickr search URLs. In addition, it also includes a set of active learning annotations on dog categories.
HASY is a dataset of single symbols similar to MNIST. It contains 168,233 instances of 369 classes. HASY contains two challenges: A classification challenge with 10 pre-defined folds for 10-fold cross-validation and a verification challenge.
The dataset consists of the features associated with 402 5-second sound samples. The 402 sounds range from easily identifiable everyday sounds to intentionally obscured artificial ones. The dataset aims to lower the barrier for the study of aural phenomenology as the largest available audio dataset to include an analysis of causal attribution. Each sample has been annotated with crowd-sourced descriptions, as well as familiarity, imageability, arousal, and valence ratings.
A dataset for benchmarking action recognition algorithms in natural environments, while making use of 3D information. The dataset contains around 650 video clips, across 14 classes. In addition, two state of the art action recognition algorithms are extended to make use of the 3D data, and five new interest point detection strategies are also proposed, that extend to the 3D data.
Publicly available dataset in the hotel domain (50M versus 0.9M) and additionally, the largest recommendation dataset in a single domain and with textual reviews (50M versus 22M).
An annotated dataset is released to enable dynamic scene classification that includes 80 hours of diverse high quality driving video data clips collected in the San Francisco Bay area. The dataset includes temporal annotations for road places, road types, weather, and road surface conditions.
iFakeFaceDB is a face image dataset for the study of synthetic face manipulation detection, comprising about 87,000 synthetic face images generated by the Style-GAN model and transformed with the GANprintR approach. All images were aligned and resized to the size of 224 x 224.
The Image and Video Advertisements collection consists of an image dataset of 64,832 image ads, and a video dataset of 3,477 ads. The data contains rich annotations encompassing the topic and sentiment of the ads, questions and answers describing what actions the viewer is prompted to take and the reasoning that the ad presents to persuade the viewer ("What should I do according to this ad, and why should I do it? "), and symbolic references ads make (e.g. a dove symbolizes peace).
The IndicNLP corpus is a large-scale, general-domain corpus containing 2.7 billion words for 10 Indian languages from two language families.