Datasets

3,148 machine learning datasets

3,148 dataset results

CTFW

CTFW is a large annotated procedural text dataset in the cybersecurity domain (3154 documents). It is used to generate flow graphs from procedural texts.

1 papers0 benchmarksGraphs, Texts

Evidence-based Factual Error Correction

Intermediate annotations from the FEVER dataset that describe original facts extracted from Wikipedia and the mutations that were applied, yielding the claims in FEVER.

1 papers0 benchmarksTexts

SciCo (Scientific Concept Induction Corpus)

SciCo is an expert-annotated dataset for hierarchical CDCR (cross-document coreference resolution) for concepts in scientific papers, with the goal of jointly inferring coreference clusters and hierarchy between them.

1 papers0 benchmarksTexts

Dataset of Propaganda Techniques of the State-Sponsored Information Operation of the People's Republic of China

This data is for the Mis2-KDD 2021 under review paper: Dataset of Propaganda Techniques of the State-Sponsored Information Operation of the People’s Republic of China

1 papers9 benchmarksTexts

CMeIE (Chinese Medical Information Extraction Dataset)

Chinese Medical Information Extraction, a dataset that is also released in CHIP2020, is used for CMeIE task. The task is aimed at identifying both entities and relations in a sentence following the schema constraints. There are 53 relations defined in the dataset, including 10 synonymous sub-relationships and 43 other sub-relationships.

1 papers1 benchmarksMedical, Texts

CHIP-CTC

CHIP Clinical Trial Classification, a dataset aimed at classifying clinical trials eligibility criteria, which are fundamental guidelines of clinical trials defined to identify whether a subject meets a clinical trial or not, is used for the CHIP-CTC task. All text data are collected from the website of the Chinese Clinical Trial Registry (ChiCTR) , and a total of 44 categories are defined. The task is like text classification; although it is not a new task, studies and corpus for the Chinese clinical trial criterion are still limited, and we hope to promote future researches for social benefits.

1 papers3 benchmarksTexts

Counting Probe (Counting Probe based on Visual7W)

Probing cross-modal capabilities of Vision & Language models with a counting task.

1 papers0 benchmarksImages, Texts

BestRev (Understanding Peer Review of Software Engineering Papers)

Survey instrument, analysis code, and anonymized responses for the paper on review practices in SE.

1 papers0 benchmarksTexts

Personal Events in Dialogue Corpus

The PEDC is a corpus of 14 episodes of This American Life podcast transcripts that have been annotated for events. The corpus contains excerpts from these episodes (listed in Tabe 1) that are dialogue. The granularity of annotation in this corpus is the token; each token is either annotated as an event, or a nonevent. For more information please download the corpus, and see the annotation guide for more specifics on how we define event, and the README for how the annotations are encoded. Also, much more information regarding the corpus, and its use is in the Automatic extraction of personal events from dialogue paper.

1 papers0 benchmarksTexts

WikiPII

WikiPII, an automatically labeled dataset composed of Wikipedia biography pages, annotated for personal information extraction.

1 papers0 benchmarksTexts

Russian Event2Mind

The work provides a comprehensive overview of the corpus for the Russian language for the commonsense inference task. Namely, we construct event phrases, which cover a wide range of everyday situations with labelled intents and reactions of the event main participant and emotions of other people involved.

1 papers1 benchmarksTexts

Vāksañcayaḥ (Sanskrit Speech Corpus by IIT Bombay)

This Sanskrit speech corpus has more than 78 hours of audio data and contains recordings of 45,953 sentences with a sampling rate of 22KHz. The content is mainly readings of texts spanning over various Śāstras of Saṃskṛtam literature and also includes contemporary stories, radio program, extempore discourse, etc.

1 papers0 benchmarksSpeech, Texts

Amharic Error Corpus

Amharic Error Corpus is a manually annotated spelling error corpus for Amharic, lingua franca in Ethiopia. The corpus is designed to be used for the evaluation of spelling error detection and correction. The misspellings are tagged as non-word and real-word errors. In addition, the contextual information available in the corpus makes it useful in dealing with both types of spelling errors.

1 papers0 benchmarksTexts

Message Content Rephrasing

We introduce a new task of rephrasing for amore natural virtual assistant. Currently, vir-tual assistants work in the paradigm of intent-slot tagging and the slot values are directlypassed as-is to the execution engine. However,this setup fails in some scenarios such as mes-saging when the query given by the user needsto be changed before repeating it or sending itto another user. For example, for queries like‘ask my wife if she can pick up the kids’ or ‘re-mind me to take my pills’, we need to rephrasethe content to ‘can you pick up the kids’ and‘take your pills’. In this paper, we study theproblem of rephrasing with messaging as ause case and release a dataset of 3000 pairs oforiginal query and rephrased query. We showthat BART, a pre-trained transformers-basedmasked language model with auto-regressivedecoding, is a strong baseline for the task, andshow improvements by adding a copy-pointerand copy loss to it. We analyze different trade-offs of BART-based and LSTM-based seq2seqmodels

1 papers0 benchmarksTexts

CHORD (CHOrus Recognition Dataset)

CHORD is the first chorus recognition dataset containing 627 songs for public use.

1 papers0 benchmarksAudio, Texts

Antibody Watch

Antibody Watch is a dataset of text snippets extracted from over 2000 PubMed articles with annotations denoting specificity of antibodies.

1 papers0 benchmarksTexts

COVID-19 & Election

These datasets were used in the paper 'Evaluation of Thematic Coherence in Microblogs' (ACL, 2021). The data is structured as follows: each file represents a cluster of tweets which contains the tweet IDs, the journalist annotations for quality evaluation and issue identification, as well as the metric evaluation scores. Note that a set of 50 clusters, equally split between COVID-19 and Election domains, is shared between the 3 annotators and thus contains 3 labels.

1 papers0 benchmarksTexts

MultiCite

MultiCite is a dataset of 12,653 citation contexts from over 1,200 computational linguistics papers used for Citation context analysis (CCA). MultiCite contains multi-sentence, multi-label citation contexts within full paper texts.

1 papers0 benchmarksTexts

ExBAN (ExBAN Corpus (Explanations for BAyesian Networks))

The ExBAN dataset: a corpus of NL explanations generated by crowd-sourced participants presented with the task of explaining simple Bayesian Network (BN) graphical representations. These explanations, in a separate collection effort, are rated for clarity and informativeness.

1 papers0 benchmarksTexts

SBU-WSD-Corpus

SBU-WSD-Corpus is a corpus for Persian Word Sense Disambiguation (WSD). It is manually annotated with senses from the Persian WordNet (FarsNet) sense inventory. SBU-WSD-Corpus consists of 19 Persian documents in different domains such as Sports, Science, Arts, etc. It includes 5892 content words of Persian running text and 3371 manually sense annotated words (2073 nouns, 566 verbs, 610 adjectives, and 122 adverbs).

1 papers0 benchmarksTexts

PreviousPage 111 of 158Next