NewsQA

TextsCustomIntroduced 2017-01-01

The NewsQA dataset is a crowd-sourced machine reading comprehension dataset of 120,000 question-answer pairs.

Documents are CNN news articles.
Questions are written by human users in natural language.
Answers may be multiword passages of the source text.
Questions may be unanswerable.
NewsQA is collected using a 3-stage, siloed process.
Questioners see only an article’s headline and highlights.
Answerers see the question and the full article, then select an answer passage.
Validators see the article, the question, and a set of answers that they rank.
NewsQA is more natural and more challenging than previous datasets.

Source: https://www.microsoft.com/en-us/research/project/newsqa-dataset/ Image Source: Trischler et al

Benchmarks

Question Answering/EM Question Answering/F1