TAPE: Assessing Few-shot Russian Language Understanding

Ekaterina Taktasheva, Tatiana Shavrina, Alena Fenogenova, Denis Shevelev, Nadezhda Katricheva, Maria Tikhonova, Albina Akhmetgareeva, Oleg Zinkevich, Anastasiia Bashmakova, Svetlana Iordanskaia, Alena Spiridonova, Valentina Kurenshchikova, Ekaterina Artemova, Vladislav Mikhailov

2022-10-23Question Answering Few-Shot Learning Adversarial Text Logical Reasoning Adversarial Attack Ethics Zero-Shot Learning

Paper PDF Code(official)

Abstract

Recent advances in zero-shot and few-shot learning have shown promise for a scope of research and practical purposes. However, this fast-growing area lacks standardized evaluation suites for non-English languages, hindering progress outside the Anglo-centric paradigm. To address this line of research, we propose TAPE (Text Attack and Perturbation Evaluation), a novel benchmark that includes six more complex NLU tasks for Russian, covering multi-hop reasoning, ethical concepts, logic and commonsense knowledge. The TAPE's design focuses on systematic zero-shot and few-shot NLU evaluation: (i) linguistic-oriented adversarial attacks and perturbations for analyzing robustness, and (ii) subpopulations for nuanced interpretation. The detailed analysis of testing the autoregressive baselines indicates that simple spelling-based perturbations affect the performance the most, while paraphrasing the input has a more negligible effect. At the same time, the results demonstrate a significant gap between the neural and human baselines for most tasks. We publicly release TAPE (tape-benchmark.com) to foster research on robust LMs that can generalize to new tasks when little to no supervision is available.

Results

Task	Dataset	Metric	Value	Model
Question Answering	CheGeKa	Accuracy	64.5	Human benchmark
Question Answering	MultiQ	Accuracy	91	Human benchmark
Question Answering	RuOpenBookQA	Accuracy	86.5	Human benchmark
Question Answering	RuOpenBookQA	Accuracy	57.9	RuGPT-3 Small
Question Answering	RuOpenBookQA	Accuracy	57.2	RuGPT-3 Medium
Question Answering	RuOpenBookQA	Accuracy	55.5	RuGPT-3 Large
Ethics	Ethics	Accuracy	68.6	RuGPT-3 Large
Ethics	Ethics	Accuracy	68.3	RuGPT-3 Meduim
Ethics	Ethics	Accuracy	55.5	RuGPT-3 Small
Ethics	Ethics	Accuracy	52.9	Human benchmark
Ethics	Ethics (per ethics)	Accuracy	67.6	Human benchmark
Ethics	Ethics (per ethics)	Accuracy	60.9	RuGPT-3 Small
Ethics	Ethics (per ethics)	Accuracy	44.9	RuGPT-3 Large
Ethics	Ethics (per ethics)	Accuracy	44.1	RuGPT-3 Medium
Logical Reasoning	Winograd Automatic	Accuracy	87	Human benchmark
Logical Reasoning	Winograd Automatic	Accuracy	57.9	RuGPT-3 Small
Logical Reasoning	Winograd Automatic	Accuracy	57.2	RuGPT-3 Medium
Logical Reasoning	Winograd Automatic	Accuracy	55.5	RuGPT-3 Large
Logical Reasoning	RuWorldTree	Accuracy	83.7	Human benchmark
Logical Reasoning	RuWorldTree	Accuracy	40.7	RuGPT-3 Large
Logical Reasoning	RuWorldTree	Accuracy	38	RuGPT-3 Medium
Logical Reasoning	RuWorldTree	Accuracy	34	RuGPT-3 Small

TAPE: Assessing Few-shot Russian Language Understanding

Abstract

Results

Related Papers

TAPE: Assessing Few-shot Russian Language Understanding

Abstract

Results

Related Papers