RussianSuperGLUE: A Russian Language Understanding Evaluation Benchmark

Tatiana Shavrina, Alena Fenogenova, Anton Emelyanov, Denis Shevelev, Ekaterina Artemova, Valentin Malykh, Vladislav Mikhailov, Maria Tikhonova, Andrey Chertok, Andrey Evlampiev

2020-10-29EMNLP 2020 11Reading Comprehension Lexical Entailment Question Answering Natural Language Inference Common Sense Reasoning Natural Language Understanding Logical Reasoning Question Answering Diagnostic Word Sense Disambiguation

Paper PDF Code Code(official)

Abstract

In this paper, we introduce an advanced Russian general language understanding evaluation benchmark -- RussianGLUE. Recent advances in the field of universal language models and transformers require the development of a methodology for their broad diagnostics and testing for general intellectual skills - detection of natural language inference, commonsense reasoning, ability to perform simple logical operations regardless of text subject or lexicon. For the first time, a benchmark of nine tasks, collected and organized analogically to the SuperGLUE methodology, was developed from scratch for the Russian language. We provide baselines, human level evaluation, an open-source framework for evaluating models (https://github.com/RussianNLP/RussianSuperGLUE), and an overall leaderboard of transformer models for the Russian language. Besides, we present the first results of comparing multilingual models in the adapted diagnostic test set and offer the first steps to further expanding or assessing state-of-the-art models independently of language.

Results

Task	Dataset	Metric	Value	Model
Reading Comprehension	MuSeRC	Average F1	0.806	Human Benchmark
Reading Comprehension	MuSeRC	EM	0.42	Human Benchmark
Reading Comprehension	MuSeRC	Average F1	0.587	Baseline TF-IDF1.1
Reading Comprehension	MuSeRC	EM	0.242	Baseline TF-IDF1.1
Question Answering	DaNetQA	Accuracy	0.915	Human Benchmark
Question Answering	DaNetQA	Accuracy	0.621	Baseline TF-IDF1.1
Common Sense Reasoning	RWSD	Accuracy	0.662	Baseline TF-IDF1.1
Common Sense Reasoning	RWSD	Accuracy	0.84	Human Benchmark
Common Sense Reasoning	PARus	Accuracy	0.982	Human Benchmark
Common Sense Reasoning	PARus	Accuracy	0.486	Baseline TF-IDF1.1
Common Sense Reasoning	RuCoS	Average F1	0.93	Human Benchmark
Common Sense Reasoning	RuCoS	EM	0.89	Human Benchmark
Common Sense Reasoning	RuCoS	Average F1	0.26	Baseline TF-IDF1.1
Common Sense Reasoning	RuCoS	EM	0.252	Baseline TF-IDF1.1
Word Sense Disambiguation	RUSSE	Accuracy	0.805	Human Benchmark
Word Sense Disambiguation	RUSSE	Accuracy	0.57	Baseline TF-IDF1.1
Natural Language Inference	RCB	Accuracy	0.702	Human Benchmark
Natural Language Inference	RCB	Average F1	0.68	Human Benchmark
Natural Language Inference	RCB	Accuracy	0.441	Baseline TF-IDF1.1
Natural Language Inference	RCB	Average F1	0.301	Baseline TF-IDF1.1
Natural Language Inference	LiDiRus	MCC	0.626	Human Benchmark
Natural Language Inference	LiDiRus	MCC	0.06	Baseline TF-IDF1.1
Natural Language Inference	TERRa	Accuracy	0.92	Human Benchmark
Natural Language Inference	TERRa	Accuracy	0.471	Baseline TF-IDF1.1

RussianSuperGLUE: A Russian Language Understanding Evaluation Benchmark

Abstract

Results

Related Papers

RussianSuperGLUE: A Russian Language Understanding Evaluation Benchmark

Abstract

Results

Related Papers