TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/RussianSuperGLUE: A Russian Language Understanding Evaluat...

RussianSuperGLUE: A Russian Language Understanding Evaluation Benchmark

Tatiana Shavrina, Alena Fenogenova, Anton Emelyanov, Denis Shevelev, Ekaterina Artemova, Valentin Malykh, Vladislav Mikhailov, Maria Tikhonova, Andrey Chertok, Andrey Evlampiev

2020-10-29EMNLP 2020 11Reading ComprehensionLexical EntailmentQuestion AnsweringNatural Language InferenceCommon Sense ReasoningNatural Language UnderstandingLogical Reasoning Question AnsweringDiagnosticWord Sense Disambiguation
PaperPDFCodeCode(official)

Abstract

In this paper, we introduce an advanced Russian general language understanding evaluation benchmark -- RussianGLUE. Recent advances in the field of universal language models and transformers require the development of a methodology for their broad diagnostics and testing for general intellectual skills - detection of natural language inference, commonsense reasoning, ability to perform simple logical operations regardless of text subject or lexicon. For the first time, a benchmark of nine tasks, collected and organized analogically to the SuperGLUE methodology, was developed from scratch for the Russian language. We provide baselines, human level evaluation, an open-source framework for evaluating models (https://github.com/RussianNLP/RussianSuperGLUE), and an overall leaderboard of transformer models for the Russian language. Besides, we present the first results of comparing multilingual models in the adapted diagnostic test set and offer the first steps to further expanding or assessing state-of-the-art models independently of language.

Results

TaskDatasetMetricValueModel
Reading ComprehensionMuSeRCAverage F10.806Human Benchmark
Reading ComprehensionMuSeRCEM 0.42Human Benchmark
Reading ComprehensionMuSeRCAverage F10.587Baseline TF-IDF1.1
Reading ComprehensionMuSeRCEM 0.242Baseline TF-IDF1.1
Question AnsweringDaNetQAAccuracy0.915Human Benchmark
Question AnsweringDaNetQAAccuracy0.621Baseline TF-IDF1.1
Common Sense ReasoningRWSDAccuracy0.662Baseline TF-IDF1.1
Common Sense ReasoningRWSDAccuracy0.84Human Benchmark
Common Sense ReasoningPARusAccuracy0.982Human Benchmark
Common Sense ReasoningPARusAccuracy0.486Baseline TF-IDF1.1
Common Sense ReasoningRuCoSAverage F10.93Human Benchmark
Common Sense ReasoningRuCoSEM 0.89Human Benchmark
Common Sense ReasoningRuCoSAverage F10.26Baseline TF-IDF1.1
Common Sense ReasoningRuCoSEM 0.252Baseline TF-IDF1.1
Word Sense DisambiguationRUSSEAccuracy0.805Human Benchmark
Word Sense DisambiguationRUSSEAccuracy0.57Baseline TF-IDF1.1
Natural Language InferenceRCBAccuracy0.702Human Benchmark
Natural Language InferenceRCBAverage F10.68Human Benchmark
Natural Language InferenceRCBAccuracy0.441Baseline TF-IDF1.1
Natural Language InferenceRCBAverage F10.301Baseline TF-IDF1.1
Natural Language InferenceLiDiRusMCC0.626Human Benchmark
Natural Language InferenceLiDiRusMCC0.06Baseline TF-IDF1.1
Natural Language InferenceTERRaAccuracy0.92Human Benchmark
Natural Language InferenceTERRaAccuracy0.471Baseline TF-IDF1.1

Related Papers

Smart fault detection in satellite electrical power system2025-07-18From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes2025-07-17Demographic-aware fine-grained classification of pediatric wrist fractures2025-07-17Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16