3,148 machine learning datasets
3,148 dataset results
CTFW is a large annotated procedural text dataset in the cybersecurity domain (3154 documents). It is used to generate flow graphs from procedural texts.
Intermediate annotations from the FEVER dataset that describe original facts extracted from Wikipedia and the mutations that were applied, yielding the claims in FEVER.
SciCo is an expert-annotated dataset for hierarchical CDCR (cross-document coreference resolution) for concepts in scientific papers, with the goal of jointly inferring coreference clusters and hierarchy between them.
This data is for the Mis2-KDD 2021 under review paper: Dataset of Propaganda Techniques of the State-Sponsored Information Operation of the People’s Republic of China
Chinese Medical Information Extraction, a dataset that is also released in CHIP2020, is used for CMeIE task. The task is aimed at identifying both entities and relations in a sentence following the schema constraints. There are 53 relations defined in the dataset, including 10 synonymous sub-relationships and 43 other sub-relationships.
CHIP Clinical Trial Classification, a dataset aimed at classifying clinical trials eligibility criteria, which are fundamental guidelines of clinical trials defined to identify whether a subject meets a clinical trial or not, is used for the CHIP-CTC task. All text data are collected from the website of the Chinese Clinical Trial Registry (ChiCTR) , and a total of 44 categories are defined. The task is like text classification; although it is not a new task, studies and corpus for the Chinese clinical trial criterion are still limited, and we hope to promote future researches for social benefits.
Probing cross-modal capabilities of Vision & Language models with a counting task.
Survey instrument, analysis code, and anonymized responses for the paper on review practices in SE.
The PEDC is a corpus of 14 episodes of This American Life podcast transcripts that have been annotated for events. The corpus contains excerpts from these episodes (listed in Tabe 1) that are dialogue. The granularity of annotation in this corpus is the token; each token is either annotated as an event, or a nonevent. For more information please download the corpus, and see the annotation guide for more specifics on how we define event, and the README for how the annotations are encoded. Also, much more information regarding the corpus, and its use is in the Automatic extraction of personal events from dialogue paper.
WikiPII, an automatically labeled dataset composed of Wikipedia biography pages, annotated for personal information extraction.
The work provides a comprehensive overview of the corpus for the Russian language for the commonsense inference task. Namely, we construct event phrases, which cover a wide range of everyday situations with labelled intents and reactions of the event main participant and emotions of other people involved.
This Sanskrit speech corpus has more than 78 hours of audio data and contains recordings of 45,953 sentences with a sampling rate of 22KHz. The content is mainly readings of texts spanning over various Śāstras of Saṃskṛtam literature and also includes contemporary stories, radio program, extempore discourse, etc.
Amharic Error Corpus is a manually annotated spelling error corpus for Amharic, lingua franca in Ethiopia. The corpus is designed to be used for the evaluation of spelling error detection and correction. The misspellings are tagged as non-word and real-word errors. In addition, the contextual information available in the corpus makes it useful in dealing with both types of spelling errors.
We introduce a new task of rephrasing for amore natural virtual assistant. Currently, vir-tual assistants work in the paradigm of intent-slot tagging and the slot values are directlypassed as-is to the execution engine. However,this setup fails in some scenarios such as mes-saging when the query given by the user needsto be changed before repeating it or sending itto another user. For example, for queries like‘ask my wife if she can pick up the kids’ or ‘re-mind me to take my pills’, we need to rephrasethe content to ‘can you pick up the kids’ and‘take your pills’. In this paper, we study theproblem of rephrasing with messaging as ause case and release a dataset of 3000 pairs oforiginal query and rephrased query. We showthat BART, a pre-trained transformers-basedmasked language model with auto-regressivedecoding, is a strong baseline for the task, andshow improvements by adding a copy-pointerand copy loss to it. We analyze different trade-offs of BART-based and LSTM-based seq2seqmodels
CHORD is the first chorus recognition dataset containing 627 songs for public use.
Antibody Watch is a dataset of text snippets extracted from over 2000 PubMed articles with annotations denoting specificity of antibodies.
These datasets were used in the paper 'Evaluation of Thematic Coherence in Microblogs' (ACL, 2021). The data is structured as follows: each file represents a cluster of tweets which contains the tweet IDs, the journalist annotations for quality evaluation and issue identification, as well as the metric evaluation scores. Note that a set of 50 clusters, equally split between COVID-19 and Election domains, is shared between the 3 annotators and thus contains 3 labels.
MultiCite is a dataset of 12,653 citation contexts from over 1,200 computational linguistics papers used for Citation context analysis (CCA). MultiCite contains multi-sentence, multi-label citation contexts within full paper texts.
The ExBAN dataset: a corpus of NL explanations generated by crowd-sourced participants presented with the task of explaining simple Bayesian Network (BN) graphical representations. These explanations, in a separate collection effort, are rated for clarity and informativeness.
SBU-WSD-Corpus is a corpus for Persian Word Sense Disambiguation (WSD). It is manually annotated with senses from the Persian WordNet (FarsNet) sense inventory. SBU-WSD-Corpus consists of 19 Persian documents in different domains such as Sports, Science, Arts, etc. It includes 5892 content words of Persian running text and 3371 manually sense annotated words (2073 nouns, 566 verbs, 610 adjectives, and 122 adverbs).