TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/TAPE: Assessing Few-shot Russian Language Understanding

TAPE: Assessing Few-shot Russian Language Understanding

Ekaterina Taktasheva, Tatiana Shavrina, Alena Fenogenova, Denis Shevelev, Nadezhda Katricheva, Maria Tikhonova, Albina Akhmetgareeva, Oleg Zinkevich, Anastasiia Bashmakova, Svetlana Iordanskaia, Alena Spiridonova, Valentina Kurenshchikova, Ekaterina Artemova, Vladislav Mikhailov

2022-10-23Question AnsweringFew-Shot LearningAdversarial TextLogical ReasoningAdversarial AttackEthicsZero-Shot Learning
PaperPDFCode(official)

Abstract

Recent advances in zero-shot and few-shot learning have shown promise for a scope of research and practical purposes. However, this fast-growing area lacks standardized evaluation suites for non-English languages, hindering progress outside the Anglo-centric paradigm. To address this line of research, we propose TAPE (Text Attack and Perturbation Evaluation), a novel benchmark that includes six more complex NLU tasks for Russian, covering multi-hop reasoning, ethical concepts, logic and commonsense knowledge. The TAPE's design focuses on systematic zero-shot and few-shot NLU evaluation: (i) linguistic-oriented adversarial attacks and perturbations for analyzing robustness, and (ii) subpopulations for nuanced interpretation. The detailed analysis of testing the autoregressive baselines indicates that simple spelling-based perturbations affect the performance the most, while paraphrasing the input has a more negligible effect. At the same time, the results demonstrate a significant gap between the neural and human baselines for most tasks. We publicly release TAPE (tape-benchmark.com) to foster research on robust LMs that can generalize to new tasks when little to no supervision is available.

Results

TaskDatasetMetricValueModel
Question AnsweringCheGeKaAccuracy64.5Human benchmark
Question AnsweringMultiQAccuracy91Human benchmark
Question AnsweringRuOpenBookQAAccuracy86.5Human benchmark
Question AnsweringRuOpenBookQAAccuracy57.9RuGPT-3 Small
Question AnsweringRuOpenBookQAAccuracy57.2RuGPT-3 Medium
Question AnsweringRuOpenBookQAAccuracy55.5RuGPT-3 Large
EthicsEthicsAccuracy68.6RuGPT-3 Large
EthicsEthicsAccuracy68.3RuGPT-3 Meduim
EthicsEthicsAccuracy55.5RuGPT-3 Small
EthicsEthicsAccuracy52.9Human benchmark
EthicsEthics (per ethics)Accuracy67.6Human benchmark
EthicsEthics (per ethics)Accuracy60.9RuGPT-3 Small
EthicsEthics (per ethics)Accuracy44.9RuGPT-3 Large
EthicsEthics (per ethics)Accuracy44.1RuGPT-3 Medium
Logical ReasoningWinograd AutomaticAccuracy87Human benchmark
Logical ReasoningWinograd AutomaticAccuracy57.9RuGPT-3 Small
Logical ReasoningWinograd AutomaticAccuracy57.2RuGPT-3 Medium
Logical ReasoningWinograd AutomaticAccuracy55.5RuGPT-3 Large
Logical ReasoningRuWorldTreeAccuracy 83.7Human benchmark
Logical ReasoningRuWorldTreeAccuracy 40.7RuGPT-3 Large
Logical ReasoningRuWorldTreeAccuracy 38RuGPT-3 Medium
Logical ReasoningRuWorldTreeAccuracy 34RuGPT-3 Small

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17GLAD: Generalizable Tuning for Vision-Language Models2025-07-17Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16Is This Just Fantasy? Language Model Representations Reflect Human Judgments of Event Plausibility2025-07-16Warehouse Spatial Question Answering with LLM Agent2025-07-14