3,148 machine learning datasets
3,148 dataset results
This resource contains 10.5 million paragraphs with associated statement labels, realized as one paragraph per file, one sentence per line. Each file is placed in a subdirectory named after its annotated class. The statements were extracted from author-annotated environments, where we only selected the first paragraph,immediately following the heading. Headings include both structural sections (e.g. Introduction), as well as scholarly statement annotations, (e.g. Definition, Proof, Remark).
This is a second public release of the arXMLiv dataset generated by the KWARC research group. It contains 1,232,186 HTML5 scientific documents from the arXiv.org preprint archive, converted from their respective TeX sources. A 13% increase in available articles over the 08.2017 release.
Scroll Readability Dataset contains scroll interactions of 598 participants reading advanced and elementary texts from the OneStopEnglish corpus.
The Sequence labellIng evaLuatIon benChmark fOr spoken laNguagE (SILICONE) benchmark is a collection of resources for training, evaluating, and analyzing natural language understanding systems specifically designed for spoken language. All datasets are in the English language and covers a large variety of domains (e.g daily life, scripted scenarios, joint task completion, phone call conversations, and televsion dialogue). Some datasets additionally include emotion and/or sentiment labels.
This dataset contains:
This dataset gathers three types of pairs: Title-to-Abstract (Training: 22,811/Development: 2095/Test: 2095), Abstract-to-Conclusion and Future work (Training: 22,811/Development: 2095/Test: 2095), Conclusion and Future work-to-Title (Training: 15,902/Development: 2095/Test: 2095) from PubMed. Each pair contains a pair of input and output as well as the corresponding terms(from original KB and link prediction results).
This dataset gathers 14,857 entities, 133 relations, and entities corresponding tokenized text from PubMed. It contains 875,698 training pairs, 109,462 development pairs, and 109,462 test pairs.
ReviewRobot Dataset Overview This repository contains data for paper ReviewRobot: Explainable Paper Review Generation based on Knowledge Synthesis. [Dataset]
NewsMTSC is a dataset for target-dependent sentiment classification (TSC) on news articles reporting on policy issues. The dataset consists of more than 11k labeled sentences, which we sampled from news articles from online US news outlets.
WikiBioCTE is a dataset for controllable text edition based on the existing dataset WikiBio (originally created for table-to-text generation). In the task of controllable text edition the input is a long text, a question, and a target answer, and the output is a minimally modified text, so that it fits the target answer. This task is very important in many situations, such as changing some conditions, consequences, or properties in a legal document, or changing some key information of an event in a news text.
CEREC is a large scale corpus for entity resolution in email conversations. The corpus consists of 6001 email threads from the Enron Email Corpus containing 36,448 email messages and 60,383 entity coreference chains. The annotation is carried out as a two-step process with minimal manual effort.
Green family of datasets for emergent communications on relations.
We introduce a first Vietnamese Spelling Correction dataset containing manual labelling mistakes and corresponding correct words.
DiaKG is a high-quality Chinese dataset for Diabetes knowledge graph.
D-OCC is a large-scale dataset of 5,617 dialogues to enable fine-grained evaluation and analysis of various dialogue systems. It is used to study common grounding in dynamic environments.
LIGHT-Quests is an extension of LIGHT, a large-scale crowd-sourced fantasy text-game, to generate a dataset of quests. These contain natural language motivations paired with in-game goals and human demonstrations; completing a quest might require dialogue or actions (or both).
The MT40K dataset for predicting malware threat intelligence is a collection of 40,000 triples generated from 27,354 unique entities and 34 relations. The corpus consists of approximately 1,100 de-identified plain text threat reports written between 2006-2021 and all CVE vulnerability descriptions created between 1990 to 2021. The annotated keyphrases were classified into entities derived from semantic categories defined in malware threat ontologies.
Dataset of 5,591 labeled issue tickets. Originally created by Herzig et al. in : "It’s Not a Bug, It’s a Feature: How Misclassification Impacts Bug Prediction" (paper)
MLQuestions is a domain-adaptation dataset for the machine learning domain containing 50K unaligned passages and 35K unaligned questions, and 3K aligned passage and question pairs.
Webis-ConcluGen-21 is a large-scale corpus of 136,996 samples of argumentative texts and their conclusions used for the task of generating informative conclusions.