Datasets

19,997 machine learning datasets

19,997 dataset results

EyeCar

EyeCar is a dataset of driving videos of vehicles involved in rear-end collisions paired with eye fixation data captured from human subjects. It contains 21 front-view videos that were captured in various traffic, weather, and day light conditions. Each video is 30sec in length and contains typical driving tasks (e.g., lanekeeping, merging-in, and braking) ending to rear-end collisions.

2 papers0 benchmarksImages

MDD (Movie Dialog dataset)

Movie Dialog dataset (MDD) is designed to measure how well models can perform at goal and non-goal orientated dialog centered around the topic of movies (question answering, recommendation and discussion).

2 papers0 benchmarksTexts

30MQA (30M Factoid Question-Answer Corpus)

An enormous question answer pair corpus produced by applying a novel neural network architecture on the knowledge base Freebase to transduce facts into natural language questions.

2 papers0 benchmarksTexts

360-SOD

360-SOD contains 500 high-resolution equirectangular images.

2 papers0 benchmarksImages

Aesthetics Text Corpus

An exhaustive list of stop lemmas created from 12 corpora across multiple domains, consisting of over 13 million words, from which more than 200,000 lemmas were generated, and 11 publicly available stop word lists comprising over 1000 words, from which nearly 400 unique lemmas were generated.

2 papers0 benchmarks

AfroMNIST

A set of synthetic MNIST-style datasets for four orthographies used in Afro-Asiatic and Niger-Congo languages: Ge`ez (Ethiopic), Vai, Osmanya, and N'Ko. These datasets serve as "drop-in" replacements for MNIST.

2 papers0 benchmarks

ALGAD (Andy Lomas Generative Art Dataset)

Repository of a generative art dataset by computer artist Andy Lomas.

2 papers0 benchmarksImages

ANETAC (Arabic Named Entity Transliteration and Classification)

An English-Arabic named entity transliteration and classification dataset built from freely available parallel translation corpora. The dataset contains 79,924 instances, each instance is a triplet (e, a, c), where e is the English named entity, a is its Arabic transliteration and c is its class that can be either a Person, a Location, or an Organization. The ANETAC dataset is mainly aimed for the researchers that are working on Arabic named entity transliteration, but it can also be used for named entity classification purposes.

2 papers0 benchmarks

APE (Automatic Post-Editing)

APE is useful to evaluate Machine Translation automatic post-editing (APE), which is the task of improving the output of a blackbox MT system by automatically fixing its mistakes. The act of post-editing text can be fully specified as a sequence of delete and insert actions in given positions.

2 papers0 benchmarksTexts

APT-Malware

The APT Malware dataset is used to train classifiers to predict if a given malware belongs to the “Advanced Persistent Threat” (APT) type or not. It contains 3131 samples spread over 24 different unique malware classes.

2 papers0 benchmarks

Aqualoc

A new underwater dataset that has been recorded in an harbor and provides several sequences with synchronized measurements from a monocular camera, a MEMS-IMU and a pressure sensor.

2 papers0 benchmarksImages

AU-AIR

The AU-AIR is a multi-modal aerial dataset captured by a UAV. Having visual data, object annotations, and flight data (time, GPS, altitude, IMU sensor data, velocities), AU-AIR meets vision and robotics for UAVs.

2 papers0 benchmarksImages

BD-4SK-ASR (Basic Dataset for Sorani Kurdish Automatic Speech Recognition)

The Basic Dataset for Sorani Kurdish Automatic Speech Recognition (BD-4SK-ASR) is a dataset for automatic speech recognition for Sorani Kurdish.

2 papers0 benchmarksSpeech

BdSLImset (Bangladeshi Sign Language Image Dataset)

Bangladeshi Sign Language Image Dataset (BdSLImset) is a dataset that contains images of different Bangladeshi sign letters.

2 papers0 benchmarksImages

Blog Authorship Corpus

The Blog Authorship Corpus consists of the collected posts of 19,320 bloggers gathered from blogger.com in August 2004. The corpus incorporates a total of 681,288 posts and over 140 million words - or approximately 35 posts and 7250 words per person.

2 papers0 benchmarks

CDNET (Change Detection)

A video database for testing change detection algorithms.

2 papers0 benchmarks

CIC (Catalonia Independence Corpus)

The dataset is annotated with stance towards one topic, namely, the independence of Catalonia.

2 papers0 benchmarksTexts

Climate Claims

The Climate Change Claims dataset for generating fact checking summaries contains claims broadly related to climate change and global warming from climatefeedback.org. It contains 1k documents from 104 different claims from 97 different domains.

2 papers0 benchmarksTexts

CLUENER2020

CLUENER2020 is a well-defined fine-grained dataset for named entity recognition in Chinese. CLUENER2020 contains 10 categories.

2 papers0 benchmarks

Colorectal Adenoma

Colorectal Adenoma contains 177 whole slide images (156 contain adenoma) gathered and labelled by pathologists from the Department of Pathology, The Chinese PLA General Hospital.

2 papers0 benchmarksImages, Medical

PreviousPage 299 of 1000Next