Datasets

3,148 machine learning datasets

3,148 dataset results

Scientific statement classification dataset from arXMLiv 08.2018

This resource contains 10.5 million paragraphs with associated statement labels, realized as one paragraph per file, one sentence per line. Each file is placed in a subdirectory named after its annotated class. The statements were extracted from author-annotated environments, where we only selected the first paragraph,immediately following the heading. Headings include both structural sections (e.g. Introduction), as well as scholarly statement annotations, (e.g. Definition, Proof, Remark).

1 papers0 benchmarksTexts

arXMLiv:08.2018

This is a second public release of the arXMLiv dataset generated by the KWARC research group. It contains 1,232,186 HTML5 scientific documents from the arXiv.org preprint archive, converted from their respective TeX sources. A 13% increase in available articles over the 08.2017 release.

1 papers0 benchmarksTexts

Scroll Readability Dataset

Scroll Readability Dataset contains scroll interactions of 598 participants reading advanced and elementary texts from the OneStopEnglish corpus.

1 papers0 benchmarksTexts

SILICONE Benchmark (SILICONE)

The Sequence labellIng evaLuatIon benChmark fOr spoken laNguagE (SILICONE) benchmark is a collection of resources for training, evaluating, and analyzing natural language understanding systems specifically designed for spoken language. All datasets are in the English language and covers a large variety of domains (e.g daily life, scripted scenarios, joint task completion, phone call conversations, and televsion dialogue). Some datasets additionally include emotion and/or sentiment labels.

1 papers2 benchmarksTexts

Flat Real World Simulink Models

This dataset contains:

1 papers0 benchmarksTexts

PubMed Term, Abstract, Conclusion, Title Dataset

This dataset gathers three types of pairs: Title-to-Abstract (Training: 22,811/Development: 2095/Test: 2095), Abstract-to-Conclusion and Future work (Training: 22,811/Development: 2095/Test: 2095), Conclusion and Future work-to-Title (Training: 15,902/Development: 2095/Test: 2095) from PubMed. Each pair contains a pair of input and output as well as the corresponding terms(from original KB and link prediction results).

1 papers0 benchmarksTexts

PubMed Paper Reading Dataset

This dataset gathers 14,857 entities, 133 relations, and entities corresponding tokenized text from PubMed. It contains 875,698 training pairs, 109,462 development pairs, and 109,462 test pairs.

1 papers0 benchmarksTexts

ReviewRobot Dataset

ReviewRobot Dataset Overview This repository contains data for paper ReviewRobot: Explainable Paper Review Generation based on Knowledge Synthesis. [Dataset]

1 papers0 benchmarksGraphs, Texts

NewsMTSC

NewsMTSC is a dataset for target-dependent sentiment classification (TSC) on news articles reporting on policy issues. The dataset consists of more than 11k labeled sentences, which we sampled from news articles from online US news outlets.

1 papers0 benchmarksTexts

WikiBioCTE

WikiBioCTE is a dataset for controllable text edition based on the existing dataset WikiBio (originally created for table-to-text generation). In the task of controllable text edition the input is a long text, a question, and a target answer, and the output is a minimally modified text, so that it fits the target answer. This task is very important in many situations, such as changing some conditions, consequences, or properties in a legal document, or changing some key information of an event in a news text.

1 papers0 benchmarksTexts

CEREC (Corpus for Entity Resolution in Email Conversations)

CEREC is a large scale corpus for entity resolution in email conversations. The corpus consists of 6001 email threads from the Enron Email Corpus containing 36,448 email messages and 60,383 entity coreference chains. The annotation is carried out as a two-step process with minimal manual effort.

1 papers0 benchmarksTexts

TexRel

Green family of datasets for emergent communications on relations.

1 papers0 benchmarksImages, Texts

Viwiki-Spelling (Vietnamese Spelling Correction Dataset)

We introduce a first Vietnamese Spelling Correction dataset containing manual labelling mistakes and corresponding correct words.

1 papers0 benchmarksTexts

DiaKG

DiaKG is a high-quality Chinese dataset for Diabetes knowledge graph.

1 papers0 benchmarksTexts

D-OCC (Dynamic-OneCommon Corpus)

D-OCC is a large-scale dataset of 5,617 dialogues to enable fine-grained evaluation and analysis of various dialogue systems. It is used to study common grounding in dynamic environments.

1 papers0 benchmarksTexts, Videos

LIGHT-Quests

LIGHT-Quests is an extension of LIGHT, a large-scale crowd-sourced fantasy text-game, to generate a dataset of quests. These contain natural language motivations paired with in-game goals and human demonstrations; completing a quest might require dialogue or actions (or both).

1 papers0 benchmarksTexts

MT40K

The MT40K dataset for predicting malware threat intelligence is a collection of 40,000 triples generated from 27,354 unique entities and 34 relations. The corpus consists of approximately 1,100 de-identified plain text threat reports written between 2006-2021 and all CVE vulnerability descriptions created between 1990 to 2021. The annotated keyphrases were classified into entities derived from semantic categories defined in malware threat ontologies.

1 papers0 benchmarksTexts

PreviousPage 110 of 158Next

Datasets

Scientific statement classification dataset from arXMLiv 08.2018

arXMLiv:08.2018

Scroll Readability Dataset

SILICONE Benchmark (SILICONE)

Flat Real World Simulink Models

PubMed Term, Abstract, Conclusion, Title Dataset

PubMed Paper Reading Dataset

ReviewRobot Dataset

NewsMTSC

WikiBioCTE

CEREC (Corpus for Entity Resolution in Email Conversations)

TexRel

Viwiki-Spelling (Vietnamese Spelling Correction Dataset)

DiaKG

D-OCC (Dynamic-OneCommon Corpus)

LIGHT-Quests

MT40K

BugClassify

MLQuestions

Webis-ConcluGen-21

Datasets

Scientific statement classification dataset from arXMLiv 08.2018

arXMLiv:08.2018

Scroll Readability Dataset

SILICONE Benchmark (SILICONE)

Flat Real World Simulink Models

PubMed Term, Abstract, Conclusion, Title Dataset

PubMed Paper Reading Dataset

ReviewRobot Dataset

NewsMTSC

WikiBioCTE

CEREC (Corpus for Entity Resolution in Email Conversations)

TexRel

Viwiki-Spelling (Vietnamese Spelling Correction Dataset)

DiaKG

D-OCC (Dynamic-OneCommon Corpus)

LIGHT-Quests

MT40K

BugClassify

MLQuestions

Webis-ConcluGen-21