TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Unreasonable Effectiveness of Rule-Based Heuristics in Sol...

Unreasonable Effectiveness of Rule-Based Heuristics in Solving Russian SuperGLUE Tasks

Tatyana Iazykova, Denis Kapelyushnik, Olga Bystrova, Andrey Kutuzov

2021-05-03Reading ComprehensionQuestion AnsweringNatural Language InferenceCommon Sense ReasoningNatural Language UnderstandingWord Sense Disambiguation
PaperPDF

Abstract

Leader-boards like SuperGLUE are seen as important incentives for active development of NLP, since they provide standard benchmarks for fair comparison of modern language models. They have driven the world's best engineering teams as well as their resources to collaborate and solve a set of tasks for general language understanding. Their performance scores are often claimed to be close to or even higher than the human performance. These results encouraged more thorough analysis of whether the benchmark datasets featured any statistical cues that machine learning based language models can exploit. For English datasets, it was shown that they often contain annotation artifacts. This allows solving certain tasks with very simple rules and achieving competitive rankings. In this paper, a similar analysis was done for the Russian SuperGLUE (RSG), a recently published benchmark set and leader-board for Russian natural language understanding. We show that its test datasets are vulnerable to shallow heuristics. Often approaches based on simple rules outperform or come close to the results of the notorious pre-trained language models like GPT-3 or BERT. It is likely (as the simplest explanation) that a significant part of the SOTA models performance in the RSG leader-board is due to exploiting these shallow heuristics and that has nothing in common with real language understanding. We provide a set of recommendations on how to improve these datasets, making the RSG leader-board even more representative of the real progress in Russian NLU.

Results

TaskDatasetMetricValueModel
Reading ComprehensionMuSeRCAverage F10.671heuristic majority
Reading ComprehensionMuSeRCEM 0.237heuristic majority
Reading ComprehensionMuSeRCAverage F10.45Random weighted
Reading ComprehensionMuSeRCEM 0.071Random weighted
Question AnsweringDaNetQAAccuracy0.642heuristic majority
Question AnsweringDaNetQAAccuracy0.52Random weighted
Question AnsweringDaNetQAAccuracy0.503majority_class
Common Sense ReasoningRWSDAccuracy0.597Random weighted
Common Sense ReasoningRWSDAccuracy0.669heuristic majority
Common Sense ReasoningRWSDAccuracy0.669majority_class
Common Sense ReasoningPARusAccuracy0.498majority_class
Common Sense ReasoningPARusAccuracy0.48Random weighted
Common Sense ReasoningPARusAccuracy0.478heuristic majority
Common Sense ReasoningRuCoSAverage F10.26heuristic majority
Common Sense ReasoningRuCoSEM 0.257heuristic majority
Common Sense ReasoningRuCoSAverage F10.25Random weighted
Common Sense ReasoningRuCoSEM 0.247Random weighted
Common Sense ReasoningRuCoSAverage F10.25majority_class
Common Sense ReasoningRuCoSEM 0.247majority_class
Word Sense DisambiguationRUSSEAccuracy0.595heuristic majority
Word Sense DisambiguationRUSSEAccuracy0.587majority_class
Word Sense DisambiguationRUSSEAccuracy0.528Random weighted
Natural Language InferenceRCBAccuracy0.438heuristic majority
Natural Language InferenceRCBAverage F10.4heuristic majority
Natural Language InferenceRCBAccuracy0.374Random weighted
Natural Language InferenceRCBAverage F10.319Random weighted
Natural Language InferenceRCBAccuracy0.484majority_class
Natural Language InferenceRCBAverage F10.217majority_class
Natural Language InferenceLiDiRusMCC0.147heuristic majority
Natural Language InferenceTERRaAccuracy0.549heuristic majority
Natural Language InferenceTERRaAccuracy0.513majority_class
Natural Language InferenceTERRaAccuracy0.483Random weighted

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes2025-07-17Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16Is This Just Fantasy? Language Model Representations Reflect Human Judgments of Event Plausibility2025-07-16LRCTI: A Large Language Model-Based Framework for Multi-Step Evidence Retrieval and Reasoning in Cyber Threat Intelligence Credibility Verification2025-07-15