19,997 machine learning datasets
19,997 dataset results
The DEAP dataset consists of two parts:
Complementary Commonsense (Com2Sense) is a dataset for benchmarking commonsense reasoning ability of NLP models. This dataset contains 4k statement true/false sentence pairs. The dataset is crowdsourced and enhanced with an adversarial model-in-the-loop setup to incentivize challenging samples. To facilitate a systematic analysis of commonsense capabilities, the dataset is designed along the dimensions of knowledge domains, reasoning scenarios and numeracy.
TimeDial presents a crowdsourced English challenge set, for temporal commonsense reasoning, formulated as a multiple choice cloze task with around 1.5k carefully curated dialogs. The dataset is derived from the DailyDialog, which is a multi-turn dialog corpus.
SEDE is a dataset comprised of 12,023 complex and diverse SQL queries and their natural language titles and descriptions, written by real users of the Stack Exchange Data Explorer out of a natural interaction. These pairs contain a variety of real-world challenges which were rarely reflected so far in any other semantic parsing dataset. The goal of this dataset is to take a significant step towards evaluation of Text-to-SQL models in a real-world setting. Compared to other Text-to-SQL datasets, SEDE contains at least 10 times more SQL queries templates (queries after canonization and anonymization of values) than other datasets, and has the most diverse set of utterances and SQL queries (in terms of 3-grams) out of all single-domain datasets. SEDE introduces real-world challenges, such as under-specification, usage of parameters in queries, dates manipulation and more.
Place Pulse is a crowdsourcing effort that aims to map which areas of a city are perceived as safer, livelier, wealthier, more active, beautiful and friendly. By asking users to select images from a pair, Place Pulse collected more than 1.5 million reports that evaluate more than 100,000 images from 56 cities.
UDIS-D is a large image dataset for image stitching or image registration. It contains different overlap rates, varying degrees of parallax, and variable scenes such as indoor, outdoor, night, dark, snow, and zooming.
AIT-QA is a dataset for Table Question Answering (Table-QA) which is specific to the airline industry. The dataset consists of 515 questions authored by human annotators on 116 tables extracted from public U.S. SEC filings of major airline companies for the fiscal years 2017-2019. It also contains annotations pertaining to the nature of questions, marking those that require hierarchical headers, domain-specific terminology, and paraphrased forms.
JetNet is a particle cloud dataset, containing gluon, top quark, light quark jets saved in .csv format.
iMiGUE is a dataset for emotional artificial intelligence research: identity-free video dataset for Micro-Gesture Understanding and Emotion analysis (iMiGUE). Different from existing public datasets, iMiGUE focuses on nonverbal body gestures without using any identity information, while the predominant researches of emotion analysis concern sensitive biometric data, like face and speech. Most importantly, iMiGUE focuses on micro-gestures, i.e., unintentional behaviors driven by inner feelings, which are different from ordinary scope of gestures from other gesture datasets which are mostly intentionally performed for illustrative purposes. Furthermore, iMiGUE is designed to evaluate the ability of models to analyze the emotional states by integrating information of recognized micro-gesture, rather than just recognizing prototypes in the sequences separately (or isolatedly).
RailSem19 offers 8500 unique images taken from a the ego-perspective of a rail vehicle (trains and trams). Extensive semantic annotations are provided, both geometry-based (rail-relevant polygons, all rails as polylines) and dense label maps with many Cityscapes-compatible road labels. Many frames show areas of intersection between road and rail vehicles (railway crossings, trams driving on city streets). RailSem19 is usefull for rail applications and road applications alike.
The first large demoire dataset. The dataset contains 135,000 image pairs, each containing an image contaminated with moire patterns and its corresponding uncontaminated reference image.
Crello dataset consists of design templates obtained from online design service, crello.com. The dataset contains designs for various display formats, such as social media posts, banner ads, blog headers, or printed posters, all in a vector format. In dataset construction, design templates and associated resources (e.g., linked images) from crello.com were first downloaded. After the initial data acquisition, the data structure was inspected and identified useful vector graphic information in each template. Next, mal-formed templates or those having more than 50 elements were eliminated, resulting in 23,182 templates. The data was paritioned to 18,768 / 2,315 / 2,278 examples for train, validation, and test splits.
ACS PUMS stands for American Community Survey (ACS) Public Use Microdata Sample (PUMS) and has been used to construct several tabular datasets for studying fairness in machine learning:
MultiEURLEX is a multilingual dataset for topic classification of legal documents. The dataset comprises 65k European Union (EU) laws, officially translated in 23 languages, annotated with multiple labels from the EUROVOC taxonomy. The dataset covers 23 official EU languages from 7 language families.
Depth in the Wild is a dataset for single-image depth perception in the wild, i.e., recovering depth from a single image taken in unconstrained settings. It consists of images in the wild annotated with relative depth between pairs of random points.
Hilti SLAM Challenge is a dataset for Simultaneous Localization and Mapping (SLAM) algorithms due to sparsity, varying illumination conditions, and dynamic objects. The sensor platform used to collect this dataset contains a number of visual, lidar and inertial sensors which have all been rigorously calibrated. All data is temporally aligned to support precise multi-sensor fusion. Each dataset includes accurate ground truth to allow direct testing of SLAM results. Raw data as well as intrinsic and extrinsic sensor calibration data from twelve datasets in various environments is provided. Each environment represents common scenarios found in building construction sites in various stages of completion.
BuildingNet is a large-scale dataset of 3D building models whose exteriors are consistently labeled. The dataset consists on 513K annotated mesh primitives, grouped into 292K semantic part components across 2K building models. The dataset covers several building categories, such as houses, churches, skyscrapers, town halls, libraries, and castles.
Overall duration per microphone: about 36 hours (31 hrs train / 2.5 hrs dev / 2.5 hrs test) Count of microphones: 3 (Microsoft Kinect, Yamaha, Samson) Count of wave-files per microphone: about 14500 Overall count of participations: 180 (130 male / 50 female)
Breast cancer (BC) has become the greatest threat to women’s health worldwide. Clinically, identification of axillary lymph node (ALN) metastasis and other tumor clinical characteristics such as ER, PR, and so on, are important for evaluating the prognosis and guiding the treatment for BC patients.
PTR is a new large-scale diagnostic visual reasoning dataset for research around part-based conceptual, relational and physical reasoning. PTR contains around 70k RGBD synthetic images with ground truth object and part level annotations regarding semantic instance segmentation, color attributes, spatial and geometric relationships, and certain physical properties such as stability. These images are paired with 700k machine-generated questions covering various types of reasoning types.