Datasets

3,148 machine learning datasets

3,148 dataset results

XQA

XQA is a data which consists of a total amount of 90k question-answer pairs in nine languages for cross-lingual open-domain question answering.

6 papers0 benchmarksTexts

TalkSumm

The TalkSumm dataset contains 1705 automatically-generated summaries of scientific papers from ACL, NAACL, EMNLP, SIGDIAL (2015-2018), and ICML (2017-2018).

6 papers0 benchmarksTexts

KnowledgeNet is a benchmark dataset for the task of automatically populating a knowledge base (Wikidata) with facts expressed in natural language text on the web. KnowledgeNet provides text exhaustively annotated with facts, thus enabling the holistic end-to-end evaluation of knowledge base population systems as a whole, unlike previous benchmarks that are more suitable for the evaluation of individual subcomponents (e.g., entity linking, relation extraction).

6 papers0 benchmarksTexts

WikiCREM

An unsupervised dataset for co-reference resolution. Presented in the publication: Kocijan et. al, WikiCREM: A Large Unsupervised Corpus for Coreference Resolution, presented at EMNLP 2019.

6 papers0 benchmarksTexts

BiPaR

BiPaR is a manually annotated bilingual parallel novel-style machine reading comprehension (MRC) dataset, developed to support monolingual, multilingual and cross-lingual reading comprehension on novels. The biggest difference between BiPaR and existing reading comprehension datasets is that each triple (Passage, Question, Answer) in BiPaR is written in parallel in two languages. BiPaR is diverse in prefixes of questions, answer types and relationships between questions and passages. Answering the questions requires reading comprehension skills of coreference resolution, multi-sentence reasoning, and understanding of implicit causality.

6 papers0 benchmarksTexts

VisPro

VisPro dataset contains coreference annotation of 29,722 pronouns from 5,000 dialogues.

6 papers0 benchmarksImages, Texts

GICoref (Gender Inclusive Coreference)

GICoref is a fully annotated coreference resolution dataset written by and about trans people.

6 papers0 benchmarksTexts

EHR-Rel

EHR-RelB is a benchmark dataset for biomedical concept relatedness, consisting of 3630 concept pairs sampled from electronic health records (EHRs). EHR-RelA is a smaller dataset of 111 concept pairs, which are mainly unrelated.

6 papers0 benchmarksBiomedical, Texts

ArCOV-19

ArCOV-19 is an Arabic COVID-19 Twitter dataset that covers the period from 27th of January till 30th of April 2020. ArCOV-19 is the first publicly-available Arabic Twitter dataset covering COVID-19 pandemic that includes over 1M tweets alongside the propagation networks of the most-popular subset of them (i.e., most-retweeted and -liked).

6 papers0 benchmarksTexts

CITE

CITE is a crowd-sourced resource for multimodal discourse: this resource characterises inferences in image-text contexts in the domain of cooking recipes in the form of coherence relations.

6 papers0 benchmarksImages, Texts

CLUECorpus2020

CLUECorpus2020 is a large-scale corpus that can be used directly for self-supervised learning such as pre-training of a language model, or language generation. It has 100G raw corpus with 35 billion Chinese characters, which is retrieved from Common Crawl.

6 papers0 benchmarksTexts

CoarseWSD-20

The CoarseWSD-20 dataset is a coarse-grained sense disambiguation dataset built from Wikipedia (nouns only) targeting 2 to 5 senses of 20 ambiguous words. It was specifically designed to provide an ideal setting for evaluating Word Sense Disambiguation (WSD) models (e.g. no senses in test sets missing from training), both quantitively and qualitatively.

6 papers0 benchmarksTexts

DaNE (Danish Dependency Treebank)

Danish Dependency Treebank (DaNE) is a named entity annotation for the Danish Universal Dependencies treebank using the CoNLL-2003 annotation scheme.

6 papers4 benchmarksTexts

HurricaneEmo

HurricaneEmo is an emotion dataset that contains 15,000 English tweets spanning three hurricanes: Harvey, Irma, and Maria.

6 papers0 benchmarksTexts

MovieFIB (Movie Fill-in-the-Blank)

A quantitative benchmark for developing and understanding video of fill-in-the-blank question-answering dataset with over 300,000 examples, based on descriptive video annotations for the visually impaired.

6 papers0 benchmarksTexts, Videos

Moviescope

Moviescope is a large-scale dataset of 5,000 movies with corresponding video trailers, posters, plots and metadata. Moviescope is based on the IMDB 5000 dataset consisting of 5.043 movie records. It is augmented by crawling video trailers associated with each movie from YouTube and text plots from Wikipedia.

6 papers0 benchmarksAudio, Texts, Videos

ReINTEL

10,000 news collected from a social network in Vietnam.

6 papers0 benchmarksTexts

RELX

RELX is a benchmark dataset for cross-lingual relation classification in English, French, German, Spanish and Turkish.

6 papers0 benchmarksTexts

Scruples

Dataset with 625,000 ethical judgments over 32,000 real-life anecdotes. Each anecdote recounts a complex ethical situation, often posing moral dilemmas, paired with a distribution of judgments contributed by the community members.

6 papers0 benchmarksTexts

TutorialBank

TutorialBank is a publicly available dataset which aims to facilitate NLP education and research. The dataset consists of links to over 6,300 high-quality resources on NLP and related fields. The corpus’s magnitude, manual collection and focus on annotation for education in addition to research differentiates it from other corpora.

6 papers0 benchmarksTexts

PreviousPage 56 of 158Next