19,997 machine learning datasets
19,997 dataset results
EuRoC MAV is a visual-inertial datasets collected on-board a Micro Aerial Vehicle (MAV). The dataset contains stereo images, synchronized IMU measurements, and accurate motion and structure ground-truth. The datasets facilitates the design and evaluation of visual-inertial localization algorithms on real flight data
The TUT Acoustic Scenes 2017 dataset is a collection of recordings from various acoustic scenes all from distinct locations. For each recording location 3-5 minute long audio recordings are captured and are split into 10 seconds which act as unit of sample for this task. All the audio clips are recorded with 44.1 kHz sampling rate and 24 bit resolution.
BRATS 2016 is a brain tumor segmentation dataset. It shares the same training set as BRATS 2015, which consists of 220 HHG and 54 LGG. Its testing dataset consists of 191 cases with unknown grades. Image Source: https://sites.google.com/site/braintumorsegmentation/home/brats_2016
The KLEJ benchmark (Kompleksowa Lista Ewaluacji Językowych) is a set of nine evaluation tasks for the Polish language understanding task.
EVE (End-to-end Video-based Eye-tracking) is a dataset for eye-tracking. It is collected from 54 participants and consists of 4 camera views, over 12 million frames and 1327 unique visual stimuli (images, video, text), adding up to approximately 105 hours of video data in total.
JEC-QA is a LQA (Legal Question Answering) dataset collected from the National Judicial Examination of China. It contains 26,365 multiple-choice and multiple-answer questions in total. The task of the dataset is to predict the answer using the questions and relevant articles. To do well on JEC-QA, both retrieving and answering are important.
Provides a wide range of raw sensor data that is accessible on almost any modern-day smartphone together with a high-quality ground-truth track.
ANTIQUE is a collection of 2,626 open-domain non-factoid questions from a diverse set of categories. The dataset contains 34,011 manual relevance annotations. The questions were asked by real users in a community question answering service, i.e., Yahoo! Answers. Relevance judgments for all the answers to each question were collected through crowdsourcing.
This is a dataset for evaluating summarisation methods for research papers.
A large-scale cloze-style biomedical MRC dataset. Care was taken to reduce noise, compared to the previous BIOREAD dataset of Pappas et al. (2018).
FewGLUE consists of a random selection of 32 training examples from the SuperGLUE training sets and up to 20,000 unlabeled examples for each SuperGLUE task.
A corpus of real-world texts.
HINT3 is a dataset for intent detection. It consists of 3 different datasets each containing a diverse set of intents in a single domain - mattress products retail, fitness supplements retail and online gaming named SOFMattress, Curekart and Powerplay11.
A benchmark which bridges the gap between freely available, documented, and motivated artificial benchmarks and properties of real industrial problems. The resulting industrial benchmark (IB) has been made publicly available to the RL community by publishing its Java and Python code, including an OpenAI Gym wrapper, on Github.
A dataset of ~19K questions that are elicited while a person is reading through a document.
Logo-2K+:A Large-Scale Logo Dataset for Scalable Logo Classification The Logo-2K+ dataset contains a diverse range of logo classes from real-world logo images. It contains 167,140 images with 10 root categories and 2,341 leaf categories. The 10 different root categories are: Food, Clothes, Institution, Accessories, Transportation, Electronic, Necessities, Cosmetic, Leisure and Medical.
The first summarization collection containing question-driven summaries of answers to consumer health questions. This dataset can be used to evaluate single or multi-document summaries generated by algorithms using extractive or abstractive approaches.
Provides detailed, graph-based annotations of social situations depicted in movie clips. Each graph consists of several types of nodes, to capture who is present in the clip, their emotional and physical attributes, their relationships (i.e., parent/child), and the interactions between them. Most interactions are associated with topics that provide additional details, and reasons that give motivations for actions.
Contains 300 scans of 10 people in a wide range of poses together with an evaluation methodology.
PARANMT-50M is a dataset for training paraphrastic sentence embeddings. It consists of more than 50 million English-English sentential paraphrase pairs.