TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/How Reasonable are Common-Sense Reasoning Tasks: A Case-St...

How Reasonable are Common-Sense Reasoning Tasks: A Case-Study on the Winograd Schema Challenge and SWAG

Paul Trichelair, Ali Emami, Adam Trischler, Kaheer Suleman, Jackie Chi Kit Cheung

2018-11-05IJCNLP 2019 11Coreference ResolutionCommon Sense Reasoning
PaperPDFCode(official)

Abstract

Recent studies have significantly improved the state-of-the-art on common-sense reasoning (CSR) benchmarks like the Winograd Schema Challenge (WSC) and SWAG. The question we ask in this paper is whether improved performance on these benchmarks represents genuine progress towards common-sense-enabled systems. We make case studies of both benchmarks and design protocols that clarify and qualify the results of previous work by analyzing threats to the validity of previous experimental designs. Our protocols account for several properties prevalent in common-sense benchmarks including size limitations, structural regularities, and variable instance difficulty.

Results

TaskDatasetMetricValueModel
Coreference ResolutionWinograd Schema ChallengeAccuracy69.2GPT-2 Medium 774M (partial scoring)
Coreference ResolutionWinograd Schema ChallengeAccuracy64.5GPT-2 Medium 774M (full scoring)
Coreference ResolutionWinograd Schema ChallengeAccuracy61.5GPT-2 Small 117M (partial scoring)
Coreference ResolutionWinograd Schema ChallengeAccuracy55.7GPT-2 Small 117M (full scoring)

Related Papers

Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes2025-07-17LoSiA: Efficient High-Rank Fine-Tuning via Subnet Localization and Optimization2025-07-06CORE-KG: An LLM-Driven Knowledge Graph Construction Framework for Human Smuggling Networks2025-06-20EditInspector: A Benchmark for Evaluation of Text-Guided Image Edits2025-06-11CheckManual: A New Challenge and Benchmark for Manual-based Appliance Manipulation2025-06-11Prime the search: Using large language models for guiding geometric task and motion planning by warm-starting tree search2025-06-08AmbiK: Dataset of Ambiguous Tasks in Kitchen Environment2025-06-04ATLAS: Learning to Optimally Memorize the Context at Test Time2025-05-29