AmbigNQ

Introduced 2020-04-22

The AmbigNQ dataset is a valuable resource for exploring ambiguity in open-domain question answering. Let me provide you with some details:

  1. Task Description:

    • Ambiguity is inherent in open-domain question answering, especially when dealing with new topics. It can be challenging to formulate questions that have a single, unambiguous answer.
    • The AmbigQA task involves predicting a set of question-answer pairs, where each plausible answer is paired with a disambiguated rewrite of the original question.
  2. Dataset Construction:

    • To study this task, the researchers constructed the AmbigNQ dataset.
    • AmbigNQ covers 14,042 questions from NQ-open, which is an existing open-domain QA benchmark.
    • Surprisingly, over half of the questions in NQ-open exhibit ambiguity.
    • The types of ambiguity are diverse and sometimes subtle, often becoming apparent only after examining evidence provided by a very large text corpus.
  3. Dataset Versions:

    • There are three versions of the AmbigNQ dataset:
      • Light Version: Contains only inputs and outputs.
      • Full Version: Includes all annotation metadata.
      • Evidence Version: Provides semi-oracle evidence articles along with questions and answers.

(1) AmbigQA - University of Washington. https://nlp.cs.washington.edu/ambigqa/. (2) ambig_qa.py · ambig_qa at main - Hugging Face. https://huggingface.co/datasets/ambig_qa/blob/main/ambig_qa.py. (3) dataset_infos.json · ambig_qa at main - Hugging Face. https://huggingface.co/datasets/ambig_qa/blob/main/dataset_infos.json. (4) AmbigQA/AmbigNQ README - GitHub: Let’s build from here. https://github.com/shmsw25/AmbigQA.