Datasets

3,148 machine learning datasets

3,148 dataset results

SSN (Semantic Scholar Network)

SSN (short for Semantic Scholar Network) is a scientific papers summarization dataset which contains 141K research papers in different domains and 661K citation relationships. The entire dataset constitutes a large connected citation graph.

5 papers0 benchmarksGraphs, Texts

NFCorpus

NFCorpus is a full-text English retrieval data set for Medical Information Retrieval. It contains a total of 3,244 natural language queries (written in non-technical English, harvested from the NutritionFacts.org site) with 169,756 automatically extracted relevance judgments for 9,964 medical documents (written in a complex terminology-heavy language), mostly from PubMed.

5 papers1 benchmarksTexts

CQADupStack

CQADupStack is a benchmark dataset for community question-answering research. It contains threads from twelve StackExchange subforums, annotated with duplicate question information. Pre-defined training and test splits are provided, both for retrieval and classification experiments, to ensure maximum comparability between different studies using the set. Furthermore, it comes with a script to manipulate the data in various ways.

5 papers1 benchmarksTexts

Vent

The Vent dataset is a large annotated dataset of text, emotions, and social connections. It comprises more than 33 millions of posts by nearly a million of users together with their social connections. Each post has an associated emotion. There are 705 different emotions, organized in 63 "emotion categories", forming a two-level taxonomy of affects.

5 papers0 benchmarksGraphs, Texts

NELA-GT-2019

NELA-GT-2019 is an updated version of the NELA-GT-2018 dataset. NELA-GT-2019 contains 1.12M news articles from 260 sources collected between January 1st 2019 and December 31st 2019. Just as with NELA-GT-2018, these sources come from a wide range of mainstream news sources and alternative news sources. Included with the dataset are source-level ground truth labels from 7 different assessment sites covering multiple dimensions of veracity.

5 papers0 benchmarksTexts

Twitter Abusive Behavior

80k tweets annotated concerning Inappropriate Speech (more particularly in matters of Abusive and Hateful speech) as well as Normal and Spam.

5 papers0 benchmarksTexts

QAConv

QAConv is a new question answering (QA) dataset that uses conversations as a knowledge source. We focus on informative conversations including business emails, panel discussions, and work channels. Unlike opendomain and task-oriented dialogues, these conversations are usually long, complex, asynchronous, and involve strong domain knowledge. In total, we collect 34,204 QA pairs, including span-based, free-form, and unanswerable questions, from 10,259 selected conversations with both human-written and machine-generated questions. We segment long conversations into chunks, and use a question generator and dialogue summarizer as auxiliary tools to collect multi-hop questions. The dataset has two testing scenarios, chunk mode and full mode, depending on whether the grounded chunk is provided or retrieved from a large conversational pool.

5 papers0 benchmarksTexts

DaN+

DaN+ is a new multi-domain corpus and annotation guidelines for Danish nested named entities (NEs) and lexical normalization to support research on cross-lingual cross-domain learning for a less-resourced language.

5 papers0 benchmarksTexts

ParaQA

ParaQA is a question answering (QA) dataset with multiple paraphrased responses for single-turn conversation over knowledge graphs (KG). The dataset was created using a semi-automated framework for generating diverse paraphrasing of the answers using techniques such as back-translation. The existing datasets for conversational question answering over KGs (single-turn/multi-turn) focus on question paraphrasing and provide only up to one answer verbalization. However, ParaQA contains 5000 question-answer pairs with a minimum of two and a maximum of eight unique paraphrased responses for each question.

5 papers0 benchmarksTexts

EDT

The EDT dataset is designed for corporate event detection and text-based stock prediction (trading strategy) benchmark.

5 papers0 benchmarksFinancial, Texts

CCPM (Chinese Classical Poetry Matching)

Introduction

5 papers0 benchmarksTexts

Ruddit

Ruddit is a dataset of English language Reddit comments that has fine-grained, real-valued scores for offensive language detection between -1 (maximally supportive) and 1 (maximally offensive).

5 papers0 benchmarksTexts

GitHub-Python

Repair AST parse (syntax) errors in Python code

5 papers2 benchmarksTexts

Taiga Corpus (An open-source corpus for machine learning.)

Taiga is a corpus, where text sources and their meta-information are collected according to popular ML tasks.

5 papers0 benchmarksTexts

RuCoS (Russian Reading Comprehension with Commonsense Reasoning)

Russian reading comprehension with Commonsense reasoning (RuCoS) is a large-scale reading comprehension dataset that requires commonsense reasoning. RuCoS consists of queries automatically generated from CNN/Daily Mail news articles; the answer to each query is a text span from a summarizing passage of the corresponding news. The goal of RuCoS is to evaluate a machine`s ability of commonsense reasoning in reading comprehension.

5 papers2 benchmarksTexts

LiDiRus (Linguistic Diagnostic for Russian)

LiDiRus is a diagnostic dataset that covers a large volume of linguistic phenomena, while allowing you to evaluate information systems on a simple test of textual entailment recognition. See more details diagnostics.

5 papers1 benchmarksTexts

CLIP (CLIP: A Dataset for Extracting Action Items for Physicians from Hospital Discharge Notes)

We created a dataset of clinical action items annotated over MIMIC-III. This dataset, which we call CLIP, is annotated by physicians and covers 718 discharge summaries, representing 107,494 sentences. Annotations were collected as character-level spans to discharge summaries after applying surrogate generation to fill in the anonymized templates from MIMIC-III text with faked data. We release these spans, their aggregation into sentence-level labels, and the sentence tokenizer used to aggregate the spans and label sentences. We also release the surrogate data generator, and the document IDs used for training, validation, and test splits, to enable reproduction. The spans are annotated with 0 or more labels of 7 different types, representing the different actions that may need to be taken: Appointment, Lab, Procedure, Medication, Imaging, Patient Instructions, and Other. We encourage the community to use this dataset to develop methods for automatically extracting clinical action items

5 papers0 benchmarksTexts

Hummingbird

Hummingbird is a dataset to examine stylistic lexical cues from human perception and BERT used to characterize their discrepancy. In HUMMINGBIRD crowd-workers relabeled benchmarking datasets for style classification tasks.

5 papers0 benchmarksTexts

MindCraft

MindCraft is a fine-grained dataset of collaborative tasks performed by pairs of human subjects in the 3D virtual blocks world of Minecraft. It provides information that captures partners' beliefs of the world and of each other as an interaction unfolds, bringing abundant opportunities to study human collaborative behaviors in situated language communication.

5 papers0 benchmarksTexts

ItaCoLA

ItaCoLA is a corpus for monolingual and cross-lingual acceptability judgments which contains almost 10,000 sentences with acceptability judgments.

5 papers2 benchmarksTexts

PreviousPage 62 of 158Next