19,997 machine learning datasets
19,997 dataset results
EyeCar is a dataset of driving videos of vehicles involved in rear-end collisions paired with eye fixation data captured from human subjects. It contains 21 front-view videos that were captured in various traffic, weather, and day light conditions. Each video is 30sec in length and contains typical driving tasks (e.g., lanekeeping, merging-in, and braking) ending to rear-end collisions.
Movie Dialog dataset (MDD) is designed to measure how well models can perform at goal and non-goal orientated dialog centered around the topic of movies (question answering, recommendation and discussion).
An enormous question answer pair corpus produced by applying a novel neural network architecture on the knowledge base Freebase to transduce facts into natural language questions.
360-SOD contains 500 high-resolution equirectangular images.
An exhaustive list of stop lemmas created from 12 corpora across multiple domains, consisting of over 13 million words, from which more than 200,000 lemmas were generated, and 11 publicly available stop word lists comprising over 1000 words, from which nearly 400 unique lemmas were generated.
A set of synthetic MNIST-style datasets for four orthographies used in Afro-Asiatic and Niger-Congo languages: Ge`ez (Ethiopic), Vai, Osmanya, and N'Ko. These datasets serve as "drop-in" replacements for MNIST.
Repository of a generative art dataset by computer artist Andy Lomas.
An English-Arabic named entity transliteration and classification dataset built from freely available parallel translation corpora. The dataset contains 79,924 instances, each instance is a triplet (e, a, c), where e is the English named entity, a is its Arabic transliteration and c is its class that can be either a Person, a Location, or an Organization. The ANETAC dataset is mainly aimed for the researchers that are working on Arabic named entity transliteration, but it can also be used for named entity classification purposes.
APE is useful to evaluate Machine Translation automatic post-editing (APE), which is the task of improving the output of a blackbox MT system by automatically fixing its mistakes. The act of post-editing text can be fully specified as a sequence of delete and insert actions in given positions.
The APT Malware dataset is used to train classifiers to predict if a given malware belongs to the “Advanced Persistent Threat” (APT) type or not. It contains 3131 samples spread over 24 different unique malware classes.
A new underwater dataset that has been recorded in an harbor and provides several sequences with synchronized measurements from a monocular camera, a MEMS-IMU and a pressure sensor.
The AU-AIR is a multi-modal aerial dataset captured by a UAV. Having visual data, object annotations, and flight data (time, GPS, altitude, IMU sensor data, velocities), AU-AIR meets vision and robotics for UAVs.
The Basic Dataset for Sorani Kurdish Automatic Speech Recognition (BD-4SK-ASR) is a dataset for automatic speech recognition for Sorani Kurdish.
Bangladeshi Sign Language Image Dataset (BdSLImset) is a dataset that contains images of different Bangladeshi sign letters.
The Blog Authorship Corpus consists of the collected posts of 19,320 bloggers gathered from blogger.com in August 2004. The corpus incorporates a total of 681,288 posts and over 140 million words - or approximately 35 posts and 7250 words per person.
A video database for testing change detection algorithms.
The dataset is annotated with stance towards one topic, namely, the independence of Catalonia.
The Climate Change Claims dataset for generating fact checking summaries contains claims broadly related to climate change and global warming from climatefeedback.org. It contains 1k documents from 104 different claims from 97 different domains.
CLUENER2020 is a well-defined fine-grained dataset for named entity recognition in Chinese. CLUENER2020 contains 10 categories.
Colorectal Adenoma contains 177 whole slide images (156 contain adenoma) gathered and labelled by pathologists from the Department of Pathology, The Chinese PLA General Hospital.