3,148 machine learning datasets
3,148 dataset results
Rico is a public UI corpus with 72K Android UI screens mined from 9.7K Android apps (Deka et al., 2017). Each screen in Rico comes with a screenshot image and a view hierarchy of a collection of UI objects. Authors manually removed screens whose view hierarchies do not match their screenshots by asking annotators to visually verify whether the bounding boxes of view hierarchy leaves match each UI object on the corresponding screenshot image. This filtering results in 25K unique screens.
The Discovery datasets consists of adjacent sentence pairs (s1,s2) with a discourse marker (y) that occurred at the beginning of s2. They were extracted from the depcc web corpus.
CLEVR-Dialog is a large diagnostic dataset for studying multi-round reasoning in visual dialog. Specifically, that authors construct a dialog grammar that is grounded in the scene graphs of the images from the CLEVR dataset. This combination results in a dataset where all aspects of the visual dialog are fully annotated. In total, CLEVR-Dialog contains 5 instances of 10-round dialogs for about 85k CLEVR images, totaling to 4.25M question-answer pairs.
Chart2Text is a dataset that was crawled from 23,382 freely accessible pages from statista.com in early March of 2020, yielding a total of 8,305 charts, and associated summaries. For each chart, the chart image, the underlying data table, the title, the axis labels, and a human-written summary describing the statistic was downloaded.
Consist of 23,533 statements extracted from all U.S. general election presidential debates and annotated by human coders. The ClaimBuster dataset can be leveraged in building computational methods to identify claims that are worth fact-checking from the myriad of sources of digital or traditional media.
LC-QuAD is a Large Question Answering dataset with 30,000 pairs of questions and its corresponding SPARQL query. The target knowledge base is Wikidata and DBpedia, specifically the 2018 version.
LectureBank Dataset is a manually collected dataset of lecture slides. It contains 1,352 online lecture files from 60 courses covering 5 different domains, including Natural Language Processing (nlp), Machine Learning (ml), Artificial Intelligence (ai), Deep Learning (dl) and Information Retrieval (ir). In addition, it also contains the corresponding annotations for each slide.
Multi-XScience is a large-scale dataset for multi-document summarization of scientific articles. It has 30,369, 5,066 and 5,093 samples for the train, validation and test split respectively. The average document length is 778.08 words and the average summary length is 116.44 words.
WikiAtomicEdits is a corpus of 43 million atomic edits across 8 languages. These edits are mined from Wikipedia edit history and consist of instances in which a human editor has inserted a single contiguous phrase into, or deleted a single contiguous phrase from, an existing sentence.
A corpus that encompasses the complete history of conversations between contributors to Wikipedia, one of the largest online collaborative communities. By recording the intermediate states of conversations---including not only comments and replies, but also their modifications, deletions and restorations---this data offers an unprecedented view of online conversation.
Retrieval Question-Answering (ReQA) benchmark tests a model’s ability to retrieve relevant answers efficiently from a large set of documents.
Bigram Relatedness Dataset (BiRD) is a large, fine-grained, bigram relatedness dataset, using a comparative annotation technique called Best Worst Scaling. Each of BiRD's 3,345 English term pairs involves at least one bigram. BiRD is made freely available to foster further research on how meaning can be represented and how meaning can be composed.
Rainbow is multi-task benchmark for common-sense reasoning that uses different existing QA datasets: aNLI, Cosmos QA, HellaSWAG. Physical IQa, Social IQa, WinoGrande.
Action-Based Conversations Dataset (ABCD) is a goal-oriented dialogue fully-labeled dataset with over 10K human-to-human dialogues containing 55 distinct user intents requiring unique sequences of actions constrained by policies to achieve task success. The dataset is proposed to study customer service dialogue systems in more realistic settings.
XFORMAL is a multilingual formal style transfer benchmark of multiple formal reformulations of informal text in Brazilian Portuguese, French, and Italian.
e-ViL is a benchmark for explainable vision-language tasks. e-ViL spans across three datasets of human-written NLEs (natural language explanations), and provides a unified evaluation framework that is designed to be re-usable for future works.
OpenMEVA is a benchmark for evaluating open-ended story generation metrics. OpenMEVA provides a comprehensive test suite to assess the capabilities of metrics, including (a) the correlation with human judgments, (b) the generalization to different model outputs and datasets, (c) the ability to judge story coherence, and (d) the robustness to perturbations. To this end, OpenMEVA includes both manually annotated stories and auto-constructed test examples.
Chinese Medical Named Entity Recognition, a dataset first released in CHIP20204, is used for CMeEE task. Given a pre-defined schema, the task is to identify and extract entities from the given sentence and classify them into nine categories: disease, clinical manifestations, drugs, medical equipment, medical procedures, body, medical examinations, microorganisms, and department.
PointQA is a set of datasets for Visual Question Datasets (VQA) that require a pointer to an object in the image to be answered correctly. The different datasets are: PointQA-Local, PointQA-LookTwice and PointQA-General.
ReaSCAN is a synthetic navigation task that requires models to reason about surroundings over syntactically difficult languages.