TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/UnifiedQA: Crossing Format Boundaries With a Single QA Sys...

UnifiedQA: Crossing Format Boundaries With a Single QA System

Daniel Khashabi, Sewon Min, Tushar Khot, Ashish Sabharwal, Oyvind Tafjord, Peter Clark, Hannaneh Hajishirzi

2020-05-02Findings of the Association for Computational Linguistics 2020Question AnsweringMulti-task Language UnderstandingCommon Sense ReasoningMulti-Task LearningLanguage ModellingMultiple-choice
PaperPDFCode(official)Code

Abstract

Question answering (QA) tasks have been posed using a variety of formats, such as extractive span selection, multiple choice, etc. This has led to format-specialized models, and even to an implicit division in the QA community. We argue that such boundaries are artificial and perhaps unnecessary, given the reasoning abilities we seek to teach are not governed by the format. As evidence, we use the latest advances in language modeling to build a single pre-trained QA model, UnifiedQA, that performs surprisingly well across 17 QA datasets spanning 4 diverse formats. UnifiedQA performs on par with 9 different models that were trained on individual datasets themselves. Even when faced with 12 unseen datasets of observed formats, UnifiedQA performs surprisingly well, showing strong generalization from its out-of-format training data. Finally, simply fine-tuning this pre-trained QA model into specialized models results in a new state of the art on 6 datasets, establishing UnifiedQA as a strong starting point for building QA systems.

Results

TaskDatasetMetricValueModel
Question AnsweringSIQAAccuracy79.8UnifiedQA 3B
Question AnsweringPIQAAccuracy85.3UnifiedQA 3B
Question AnsweringOpenBookQAAccuracy87.2UnifiedQA 11B
Common Sense ReasoningWinoGrandeAccuracy89.4UnifiedQA 11B (fine-tuned)
Common Sense ReasoningWinoGrandeAccuracy73.3Unified QA 406M (fine-tuned)
Common Sense ReasoningCommonsenseQAAccuracy79.1UnifiedQA 11B (fine-tuned)
Common Sense ReasoningCommonsenseQAAccuracy78.1T5-XXL 11B (fine-tuned)
Common Sense ReasoningCommonsenseQAAccuracy76.2UnifiedQA 11B (zero-shot)
Common Sense ReasoningCommonsenseQAAccuracy64UnifiedQA 440M (fine-tuned)
Common Sense ReasoningCommonsenseQAAccuracy62.5BART-large 440M (fine-tuned)

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes2025-07-17SGCL: Unifying Self-Supervised and Supervised Learning for Graph Recommendation2025-07-17Making Language Model a Hierarchical Classifier and Generator2025-07-17