Unreasonable Effectiveness of Rule-Based Heuristics in Solving Russian SuperGLUE Tasks

Tatyana Iazykova, Denis Kapelyushnik, Olga Bystrova, Andrey Kutuzov

2021-05-03Reading Comprehension Question Answering Natural Language Inference Common Sense Reasoning Natural Language Understanding Word Sense Disambiguation

Paper PDF

Abstract

Leader-boards like SuperGLUE are seen as important incentives for active development of NLP, since they provide standard benchmarks for fair comparison of modern language models. They have driven the world's best engineering teams as well as their resources to collaborate and solve a set of tasks for general language understanding. Their performance scores are often claimed to be close to or even higher than the human performance. These results encouraged more thorough analysis of whether the benchmark datasets featured any statistical cues that machine learning based language models can exploit. For English datasets, it was shown that they often contain annotation artifacts. This allows solving certain tasks with very simple rules and achieving competitive rankings. In this paper, a similar analysis was done for the Russian SuperGLUE (RSG), a recently published benchmark set and leader-board for Russian natural language understanding. We show that its test datasets are vulnerable to shallow heuristics. Often approaches based on simple rules outperform or come close to the results of the notorious pre-trained language models like GPT-3 or BERT. It is likely (as the simplest explanation) that a significant part of the SOTA models performance in the RSG leader-board is due to exploiting these shallow heuristics and that has nothing in common with real language understanding. We provide a set of recommendations on how to improve these datasets, making the RSG leader-board even more representative of the real progress in Russian NLU.

Results

Task	Dataset	Metric	Value	Model
Reading Comprehension	MuSeRC	Average F1	0.671	heuristic majority
Reading Comprehension	MuSeRC	EM	0.237	heuristic majority
Reading Comprehension	MuSeRC	Average F1	0.45	Random weighted
Reading Comprehension	MuSeRC	EM	0.071	Random weighted
Question Answering	DaNetQA	Accuracy	0.642	heuristic majority
Question Answering	DaNetQA	Accuracy	0.52	Random weighted
Question Answering	DaNetQA	Accuracy	0.503	majority_class
Common Sense Reasoning	RWSD	Accuracy	0.597	Random weighted
Common Sense Reasoning	RWSD	Accuracy	0.669	heuristic majority
Common Sense Reasoning	RWSD	Accuracy	0.669	majority_class
Common Sense Reasoning	PARus	Accuracy	0.498	majority_class
Common Sense Reasoning	PARus	Accuracy	0.48	Random weighted
Common Sense Reasoning	PARus	Accuracy	0.478	heuristic majority
Common Sense Reasoning	RuCoS	Average F1	0.26	heuristic majority
Common Sense Reasoning	RuCoS	EM	0.257	heuristic majority
Common Sense Reasoning	RuCoS	Average F1	0.25	Random weighted
Common Sense Reasoning	RuCoS	EM	0.247	Random weighted
Common Sense Reasoning	RuCoS	Average F1	0.25	majority_class
Common Sense Reasoning	RuCoS	EM	0.247	majority_class
Word Sense Disambiguation	RUSSE	Accuracy	0.595	heuristic majority
Word Sense Disambiguation	RUSSE	Accuracy	0.587	majority_class
Word Sense Disambiguation	RUSSE	Accuracy	0.528	Random weighted
Natural Language Inference	RCB	Accuracy	0.438	heuristic majority
Natural Language Inference	RCB	Average F1	0.4	heuristic majority
Natural Language Inference	RCB	Accuracy	0.374	Random weighted
Natural Language Inference	RCB	Average F1	0.319	Random weighted
Natural Language Inference	RCB	Accuracy	0.484	majority_class
Natural Language Inference	RCB	Average F1	0.217	majority_class
Natural Language Inference	LiDiRus	MCC	0.147	heuristic majority
Natural Language Inference	TERRa	Accuracy	0.549	heuristic majority
Natural Language Inference	TERRa	Accuracy	0.513	majority_class
Natural Language Inference	TERRa	Accuracy	0.483	Random weighted

Abstract

Results

Task	Dataset	Metric	Value	Model
Reading Comprehension	MuSeRC	Average F1	0.671	heuristic majority
Reading Comprehension	MuSeRC	EM	0.237	heuristic majority
Reading Comprehension	MuSeRC	Average F1	0.45	Random weighted
Reading Comprehension	MuSeRC	EM	0.071	Random weighted
Question Answering	DaNetQA	Accuracy	0.642	heuristic majority
Question Answering	DaNetQA	Accuracy	0.52	Random weighted
Question Answering	DaNetQA	Accuracy	0.503	majority_class
Common Sense Reasoning	RWSD	Accuracy	0.597	Random weighted
Common Sense Reasoning	RWSD	Accuracy	0.669	heuristic majority
Common Sense Reasoning	RWSD	Accuracy	0.669	majority_class
Common Sense Reasoning	PARus	Accuracy	0.498	majority_class
Common Sense Reasoning	PARus	Accuracy	0.48	Random weighted
Common Sense Reasoning	PARus	Accuracy	0.478	heuristic majority
Common Sense Reasoning	RuCoS	Average F1	0.26	heuristic majority
Common Sense Reasoning	RuCoS	EM	0.257	heuristic majority
Common Sense Reasoning	RuCoS	Average F1	0.25	Random weighted
Common Sense Reasoning	RuCoS	EM	0.247	Random weighted
Common Sense Reasoning	RuCoS	Average F1	0.25	majority_class
Common Sense Reasoning	RuCoS	EM	0.247	majority_class
Word Sense Disambiguation	RUSSE	Accuracy	0.595	heuristic majority
Word Sense Disambiguation	RUSSE	Accuracy	0.587	majority_class
Word Sense Disambiguation	RUSSE	Accuracy	0.528	Random weighted
Natural Language Inference	RCB	Accuracy	0.438	heuristic majority
Natural Language Inference	RCB	Average F1	0.4	heuristic majority
Natural Language Inference	RCB	Accuracy	0.374	Random weighted
Natural Language Inference	RCB	Average F1	0.319	Random weighted
Natural Language Inference	RCB	Accuracy	0.484	majority_class
Natural Language Inference	RCB	Average F1	0.217	majority_class
Natural Language Inference	LiDiRus	MCC	0.147	heuristic majority
Natural Language Inference	TERRa	Accuracy	0.549	heuristic majority
Natural Language Inference	TERRa	Accuracy	0.513	majority_class
Natural Language Inference	TERRa	Accuracy	0.483	Random weighted

Unreasonable Effectiveness of Rule-Based Heuristics in Solving Russian SuperGLUE Tasks

Abstract

Results

Related Papers

Unreasonable Effectiveness of Rule-Based Heuristics in Solving Russian SuperGLUE Tasks

Abstract

Results

Related Papers