BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, Kristina Toutanova

2019-05-24NAACL 2019 6Reading Comprehension Question Answering Transfer Learning

Abstract

In this paper we study yes/no questions that are naturally occurring --- meaning that they are generated in unprompted and unconstrained settings. We build a reading comprehension dataset, BoolQ, of such questions, and show that they are unexpectedly challenging. They often query for complex, non-factoid information, and require difficult entailment-like inference to solve. We also explore the effectiveness of a range of transfer learning baselines. We find that transferring from entailment data is more effective than transferring from paraphrase or extractive QA data, and that it, surprisingly, continues to be very beneficial even when starting from massive pre-trained language models such as BERT. Our best method trains BERT on MultiNLI and then re-trains it on our train set. It achieves 80.4% accuracy compared to 90% accuracy of human annotators (and 62% majority-baseline), leaving a significant gap for future work.

Results

Task	Dataset	Metric	Value	Model
Question Answering	BoolQ	Accuracy	80.4	BERT-MultiNLI 340M (fine-tuned)
Question Answering	BoolQ	Accuracy	75.57	BiDAF-MultiNLI (fine-tuned)
Question Answering	BoolQ	Accuracy	72.87	GPT-1 117M (fine-tuned)
Question Answering	BoolQ	Accuracy	71.41	BiDAF + ELMo (fine-tuned)
Question Answering	BoolQ	Accuracy	62.17	Majority baseline

Related Papers

RaMen: Multi-Strategy Multi-Modal Learning for Bundle Construction2025-07-18 From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17 Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17 Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17 City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17 Disentangling coincident cell events using deep transfer learning and compressive sensing2025-07-17 Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16 Is This Just Fantasy? Language Model Representations Reflect Human Judgments of Event Plausibility2025-07-16