TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Back to Square One: Artifact Detection, Training and Commo...

Back to Square One: Artifact Detection, Training and Commonsense Disentanglement in the Winograd Schema

Yanai Elazar, Hongming Zhang, Yoav Goldberg, Dan Roth

2021-04-16EMNLP 2021 11Artifact DetectionCoreference ResolutionCommon Sense ReasoningDisentanglementBias DetectionLanguage Modelling
PaperPDF

Abstract

The Winograd Schema (WS) has been proposed as a test for measuring commonsense capabilities of models. Recently, pre-trained language model-based approaches have boosted performance on some WS benchmarks but the source of improvement is still not clear. This paper suggests that the apparent progress on WS may not necessarily reflect progress in commonsense reasoning. To support this claim, we first show that the current evaluation method of WS is sub-optimal and propose a modification that uses twin sentences for evaluation. We also propose two new baselines that indicate the existence of artifacts in WS benchmarks. We then develop a method for evaluating WS-like sentences in a zero-shot setting to account for the commonsense reasoning abilities acquired during the pretraining and observe that popular language models perform randomly in this setting when using our more strict evaluation. We conclude that the observed progress is mostly due to the use of supervision in training WS models, which is not likely to successfully support all the required commonsense reasoning skills and knowledge.

Results

TaskDatasetMetricValueModel
Question AnsweringCOPAAccuracy50Random chance baseline
Question AnsweringPIQAAccuracy50Random chance baseline
Common Sense ReasoningWinoGrandeAccuracy58.7ALBERT-xxlarge 235M
Common Sense ReasoningWinoGrandeAccuracy56.3RoBERTa-base 125M
Common Sense ReasoningWinoGrandeAccuracy55.6BERT-large 345M
Common Sense ReasoningWinoGrandeAccuracy54.9RoBERTa-large 355M
Common Sense ReasoningWinoGrandeAccuracy53.1BERT-base 110M
Common Sense ReasoningWinoGrandeAccuracy52.8ALBERT-base 11M
Common Sense ReasoningWinoGrandeAccuracy50Random baseline
Coreference ResolutionWinograd Schema ChallengeAccuracy78.8ALBERT-xxlarge 235M
Coreference ResolutionWinograd Schema ChallengeAccuracy73.9RoBERTa-large 354M
Coreference ResolutionWinograd Schema ChallengeAccuracy63RoBERTa-base 125M
Coreference ResolutionWinograd Schema ChallengeAccuracy61.4BERT-large 340M
Coreference ResolutionWinograd Schema ChallengeAccuracy56.5BERT-base 110M
Coreference ResolutionWinograd Schema ChallengeAccuracy55.4ALBERT-base 11M
Coreference ResolutionWinograd Schema ChallengeAccuracy50Random chance baseline

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21CSD-VAR: Content-Style Decomposition in Visual Autoregressive Models2025-07-18Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes2025-07-17Making Language Model a Hierarchical Classifier and Generator2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities2025-07-17Assay2Mol: large language model-based drug design using BioAssay context2025-07-16