19,997 machine learning datasets
19,997 dataset results
CoDEx comprises a set of knowledge graph completion datasets extracted from Wikidata and Wikipedia that improve upon existing knowledge graph completion benchmarks in scope and level of difficulty. CoDEx comprises three knowledge graphs varying in size and structure, multilingual descriptions of entities and relations, and tens of thousands of hard negative triples that are plausible but verified to be false.
A richer dataset based on real items on Craigslist.
Cube++ is a novel dataset for the color constancy problem that continues on the Cube+ dataset. It includes 4890 images of different scenes under various conditions. For calculating the ground truth illumination, a calibration object with known surface colors was placed in every scene.
Dem@Care is providing the following datasets, which are collected during lab and home experiments. The data collection took place in the Greek Alzheimer’s Association for Dementia and Related Disorders in Thessaloniki, Greece and in participants’ homes. The datasets include video and audio recordings as well as data from physiological sensors. Moreover, they include data from sleep, motion and plug sensors.
Europeana Newspapers consists of four datasets with 100 pages each for the languages Dutch, French, German (including Austrian) as part of the Europeana Newspapers project is expected to contribute to the further development and improvement of named entity recognition systems with a focus on historical content.
The Flick Cropping Dataset consists of high quality cropping and pairwise ranking annotations used to evaluate the performance of automatic image cropping approaches.
1000 query triples on 120 tables.
GolfDB is a high-quality video dataset created for general recognition applications in the sport of golf, and specifically for the task of golf swing sequencing.
The dataset consists of 3640 bursts (made up of 28461 images in total), organized into subfolders, plus the results of an image processing pipeline. Each burst consists of the raw burst input (in DNG format) and certain metadata not present in the images, as sidecar files.
The Headlines dataset for sarcasm detection is collected from two news website. TheOnion aims at producing sarcastic versions of current events. The dataset includes all the headlines from News in Brief and News in Photos categories (which are sarcastic) and real (and non-sarcastic) news headlines from HuffPost. This dataset has following advantages over the existing Twitter datasets:
A new RGB-D video dataset, i.e., UCLA Human-Human-Object Interaction (HHOI) dataset, which includes 3 types of human-human interactions, i.e., shake hands, high-five, pull up, and 2 types of human-object-human interactions, i.e., throw and catch, and hand over a cup. On average, there are 23.6 instances per interaction performed by totally 8 actors recorded from various views. Each interaction lasts 2-7 seconds presented at 10-15 fps.
A parallel corpus of Hindi and English, and HindMonoCorp, a monolingual corpus of Hindi in their release version 0.5. Both corpora were collected from web sources and preprocessed primarily for the training of statistical machine translation systems. HindEnCorp consists of 274k parallel sentences (3.9 million Hindi and 3.8 million English tokens). HindMonoCorp amounts to 787 million tokens in 44 million sentences.
HJDataset is a large dataset of Historical Japanese Documents with Complex Layouts. It contains over 250,000 layout element annotations of seven types. In addition to bounding boxes and masks of the content regions, it also includes the hierarchical structures and reading orders for layout elements. The dataset is constructed using a combination of human and machine efforts.
An abnormal activity data-set for research use that contains 4,83,566 annotated frames.
Intrinsic Images in the Wild is a large scale, public dataset for intrinsic image decompositions of real-world scenes selected from the OpenSurfaces dataset. Each image is annotated with crowdsourced pairwise comparisons of material properties.
The IPN Hand dataset is a benchmark video dataset with sufficient size, variation, and real-world elements able to train and evaluate deep neural networks for continuous Hand Gesture Recognition (HGR).
Consists of user-generated aerial videos from social media with annotations of instance-level building damage masks. This provides the first benchmark for quantitative evaluation of models to assess building damage using aerial videos.
Dataset for lyrics alignment and transcription evaluation. It contains 20 music pieces under CC license from the Jamendo website along with their lyrics, with:
Libri-Adapt aims to support unsupervised domain adaptation research on speech recognition models.
A large-scale Indonesian summarization dataset consisting of harvested articles from Liputan6.com, an online news portal, resulting in 215,827 document-summary pairs.