Datasets

3,148 machine learning datasets

3,148 dataset results

JEC-QA

JEC-QA is a LQA (Legal Question Answering) dataset collected from the National Judicial Examination of China. It contains 26,365 multiple-choice and multiple-answer questions in total. The task of the dataset is to predict the answer using the questions and relevant articles. To do well on JEC-QA, both retrieving and answering are important.

13 papers0 benchmarksTexts

ANTIQUE

ANTIQUE is a collection of 2,626 open-domain non-factoid questions from a diverse set of categories. The dataset contains 34,011 manual relevance annotations. The questions were asked by real users in a community question answering service, i.e., Yahoo! Answers. Relevance judgments for all the answers to each question were collected through crowdsourcing.

13 papers0 benchmarksTexts

arXiv Summarization Dataset

This is a dataset for evaluating summarisation methods for research papers.

13 papers8 benchmarksTexts

BIOMRC

A large-scale cloze-style biomedical MRC dataset. Care was taken to reduce noise, compared to the previous BIOREAD dataset of Pappas et al. (2018).

13 papers2 benchmarksMedical, Texts

HINT3

HINT3 is a dataset for intent detection. It consists of 3 different datasets each containing a diverse set of intents in a single domain - mattress products retail, fitness supplements retail and online gaming named SOFMattress, Curekart and Powerplay11.

13 papers0 benchmarksTexts

MEDIQA-AnS (MEDIQA-Answer Summarization)

The first summarization collection containing question-driven summaries of answers to consumer health questions. This dataset can be used to evaluate single or multi-document summaries generated by algorithms using extractive or abstractive approaches.

13 papers0 benchmarksTexts

PARANMT-50M

PARANMT-50M is a dataset for training paraphrastic sentence embeddings. It consists of more than 50 million English-English sentential paraphrase pairs.

13 papers0 benchmarksTexts

SciTLDR

A new multi-target dataset of 5.4K TLDRs over 3.2K papers. SciTLDR contains both author-written and expert-derived TLDRs, where the latter are collected using a novel annotation protocol that produces high-quality summaries while minimizing annotation burden.

13 papers0 benchmarksTexts

KorSTS

KorSTS is a dataset for semantic textural similarity (STS) in Korean. The dataset is constructed by automatically the STS-B dataset. To ensure translation quality, two professional translators with at least seven years of experience who specialize in academic papers/books as well as business contracts post-edited a half of the dataset each and cross-checked each other’s translation afterward. The KorSTS dataset comprises 5,749 training examples translated automatically and 2,879 evaluation examples translated manually.

13 papers0 benchmarksTexts

PersonalDialog

PersonalDialog is a large-scale multi-turn dialogue dataset containing various traits from a large number of speakers. The dataset consists of 20.83M sessions and 56.25M utterances from 8.47M speakers. Each utterance is associated with a speaker who is marked with traits like Age, Gender, Location, Interest Tags, etc. Several anonymization schemes are designed to protect the privacy of each speaker.

13 papers0 benchmarksTexts

BC2GM

Created by Smith et al. at 2008, the BioCreative II Gene Mention Recognition (BC2GM) Dataset contains data where participants are asked to identify a gene mention in a sentence by giving its start and end characters. The training set consists of a set of sentences, and for each sentence a set of gene mentions (GENE annotations). [registration required for access], in English language. Containing 20 in n/a file format.

13 papers2 benchmarksTexts

NaturalProofs

The NaturalProofs Dataset is a large-scale dataset for studying mathematical reasoning in natural language. NaturalProofs consists of roughly 20,000 theorem statements and proofs, 12,500 definitions, and 1,000 additional pages (e.g. axioms, corollaries) derived from ProofWiki, an online compendium of mathematical proofs written by a community of contributors.

13 papers0 benchmarksTexts

GooAQ

GooAQ is a large-scale dataset with a variety of answer types. This dataset contains over 5 million questions and 3 million answers collected from Google. GooAQ questions are collected semi-automatically from the Google search engine using its autocomplete feature. This results in naturalistic questions of practical interest that are nonetheless short and expressed using simple language. GooAQ answers are mined from Google's responses to the collected questions, specifically from the answer boxes in the search results. This yields a rich space of answer types, containing both textual answers (short and long) as well as more structured ones such as collections.

13 papers0 benchmarksTexts

UPFD (User Preference-aware Fake News Detection)

For benchmarking, please refer to its variant UPFD-POL and UPFD-GOS.

13 papers0 benchmarksGraphs, Texts

DialFact

DialFact is a testing benchmark dataset of 22,245 annotated conversational claims, paired with pieces of evidence from Wikipedia. There are three sub-tasks in DialFact: 1) Verifiable claim detection task distinguishes whether a response carries verifiable factual information; 2) Evidence retrieval task retrieves the most relevant Wikipedia snippets as evidence; 3) Claim verification task predicts a dialogue response to be supported, refuted, or not enough information.

13 papers0 benchmarksTexts

GUM (Georgetown University Multilayer corpus)

GUM is an open source multilayer English corpus of richly annotated texts from twelve text types. Annotations include:

13 papers1 benchmarksSpeech, Texts

ProsocialDialog

Most existing dialogue systems fail to respond properly to potentially unsafe user utterances by either ignoring or passively agreeing with them.

13 papers2 benchmarksDialog, Texts

ViQuAE

ViQuAE is a dataset for KVQAE (Knowledge-based Visual Question Answering about named Entities), a task which consists in answering questions about named entities grounded in a visual context using a Knowledge Base. It is the first KVQAE dataset to cover a wide range of entity types (e.g. persons, landmarks, and products). We argue that KVQAE is a clear, well-defined task that can be evaluated easily, making it suitable to track the progress of multimodal entity representation’s quality. Multimodal entity representation is a central issue that will allow to make human-machine interactions more natural. For example, while watching a movie, one might wonder ‘‘Where did I already see this actress?’’ or ‘‘Did she ever win an Oscar?’’

13 papers0 benchmarksImages, Texts

Robust04

The goal of the Robust track is to improve the consistency of retrieval technology by focusing on poorly performing topics. In addition, the track brings back a classic, ad hoc retrieval task in TREC that provides a natural home for new participants. An ad hoc task in TREC investigates the performance of systems that search a static set of documents using previously-unseen topics. For each topic, participants create a query and submit a ranking of the top 1000 documents for that topic.

13 papers0 benchmarksTexts

ROSCOE

ROSCOE is a suite of interpretable, unsupervised automatic scores that improve and extend previous text generation evaluation metrics.

13 papers0 benchmarksTexts

PreviousPage 40 of 158Next