19,997 machine learning datasets
19,997 dataset results
STACKEX expands beyond the only existing genre (i.e., academic writing) in keyphrase generation tasks.
Maternal and Infant (MATINF) Dataset is a large-scale dataset jointly labeled for classification, question answering and summarization in the domain of maternity and baby caring in Chinese. An entry in the dataset includes four fields: question (Q), description (D), class (C) and answer (A).
The RUN dataset is based on OpenStreetMap (OSM). The map contains rich layers and an abundance of entities of different types. Each entity is complex and can contain (at least) four labels: name, type, is building=y/n, and house number. An entity can spread over several tiles. As the maps do not overlap, only very few entities are shared among them. The RUN dataset aligns NL navigation instructions to coordinates of their corresponding route on the OSM map.
word2word contains easy-to-use word translations for 3,564 language pairs.
The Messidor database has been established to facilitate studies on computer-assisted diagnoses of diabetic retinopathy. The research community is welcome to test its algorithms on this database. In this section, you will find instructions on how to download the database.
Cata7 is the first cataract surgical instrument dataset for semantic segmentation. The dataset consists of seven videos while each video records a complete cataract surgery. All videos are from Beijing Tongren Hospital. Each video is split into a sequence of images, where resolution is 1920×1080 pixels. To reduce redundancy, the videos are downsampled from 30 fps to 1 fps. Also, images without surgical instruments are manually removed. Each image is labeled with precise edges and types of surgical instruments. This dataset contains 2,500 images, which are divided into training and test sets. The training set consists of five video sequences and test set consists of two video sequence.
ARID is a large-scale, multi-view object dataset collected with an RGB-D camera mounted on a mobile robot.
NYU-VP is a new dataset for multi-model fitting, vanishing point (VP) estimation in this case. Each image is annotated with up to eight vanishing points, and pre-extracted line segments are provided which act as data points for a robust estimator. Due to its size, the dataset is the first to allow for supervised learning of a multi-model fitting task.
WikiHowQA is a Community-based Question Answering dataset, which can be used for both answer selection and abstractive summarization tasks. It contains 76,687 questions in the train set, 8,000 in the development set and 22,354 in the test set.
This dataset contains speech recordings along with speaker physical parameters (height, weight, shoulder size, age ) as well as regional information and linguistic information.
The UCLA Aerial Event Dataest has been captured by a low-cost hex-rotor with a GoPro camera, which is able to eliminate the high frequency vibration of the camera and hold in air autonomously through a GPS and a barometer. It can also fly 20 ∼ 90m above the ground and stays 5 minutes in air.
The ADL Piano MIDI is a dataset of 11,086 piano pieces from different genres. This dataset is based on the Lakh MIDI dataset, which is a collection on 45,129 unique MIDI files that have been matched to entries in the Million Song Dataset. Most pieces in the Lakh MIDI dataset have multiple instruments, so for each file the authors of ADL Piano MIDI dataset extracted only the tracks with instruments from the "Piano Family" (MIDI program numbers 1-8). This process generated a total of 9,021 unique piano MIDI files. Theses 9,021 files were then combined with other approximately 2,065 files scraped from publicly-available sources on the internet. All the files in the final collection were de-duped according to their MD5 checksum.
AI2D-RST is a multimodal corpus of 1000 English-language diagrams that represent topics in primary school natural sciences, such as food webs, life cycles, moon phases and human physiology. The corpus is based on the Allen Institute for Artificial Intelligence Diagrams (AI2D) dataset, a collection of diagrams with crowd-sourced descriptions, which was originally developed to support research on automatic diagram understanding and visual question answering.
Dataset to address the problem of detecting people Looking At Each Other (LAEO) in video sequences.
The BanglaWriting dataset contains single-page handwritings of 260 individuals of different personalities and ages. Each page includes bounding-boxes that bounds each word, along with the unicode representation of the writing. This dataset contains 21,234 words and 32,787 characters in total. Moreover, this dataset includes 5,470 unique words of Bangla vocabulary. Apart from the usual words, the dataset comprises 261 comprehensible overwriting and 450 incomprehensible overwriting. All of the bounding boxes and word labels are manually-generated. The dataset can be used for complex optical character/word recognition, writer identification, and handwritten word segmentation. Furthermore, this dataset is suitable for extracting age-based and gender-based variation of handwriting.
The corpus represents the largest existing corpus of Catalan containing 687 million words, which is a significant increase given that until now the biggest corpus of Catalan, CuCWeb, counts 166 million words.
Chinese dataset on COVID-19 misinformation. CHECKED provides ground-truth on credibility, carefully obtained by ensuring the specific sources are used. CHECKED includes microblogs related to COVID-19, identified by using a specific list of keywords, covering a total 2120 microblogs published from December 2019 to August 2020. The dataset contains a rich set of multimedia information for each microblog including ground-truth label, textual, visual, response, and social network information.
Chinese Text in the Wild is a dataset of Chinese text with about 1 million Chinese characters from 3850 unique ones annotated by experts in over 30000 street view images. This is a challenging dataset with good diversity containing planar text, raised text, text under poor illumination, distant text, partially occluded text, etc.
ClovaCall is a new large-scale Korean call-based speech corpus under a goal-oriented dialog scenario from more than 11,000 people. The raw dataset of ClovaCall includes approximately 112,000 pairs of a short sentence and its corresponding spoken utterance in a restaurant reservation domain.
Coached Conversational Preference Elicitation is a dataset consisting of 502 English dialogs with 12,000 annotated utterances between a user and an assistant discussing movie preferences in natural language. It was collected using a Wizard-of-Oz methodology between two paid crowd-workers, where one worker plays the role of an 'assistant', while the other plays the role of a 'user'.