GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, Samuel R. Bowman

2018-04-20WS 2018 11Natural Language Inference Natural Language Understanding Transfer Learning QQP Diagnostic

Paper PDF Code Code Code Code Code Code Code Code Code Code Code

Abstract

For natural language understanding (NLU) technology to be maximally useful, both practically and as a scientific object of study, it must be general: it must be able to process language in a way that is not exclusively tailored to any one specific task or dataset. In pursuit of this objective, we introduce the General Language Understanding Evaluation benchmark (GLUE), a tool for evaluating and analyzing the performance of models across a diverse range of existing NLU tasks. GLUE is model-agnostic, but it incentivizes sharing knowledge across tasks because certain tasks have very limited training data. We further provide a hand-crafted diagnostic test suite that enables detailed linguistic analysis of NLU models. We evaluate baselines based on current methods for multi-task and transfer learning and find that they do not immediately give substantial improvements over the aggregate performance of training a separate model per task, indicating room for improvement in developing general and robust NLU systems.

Results

Task	Dataset	Metric	Value	Model
Natural Language Inference	MultiNLI	Matched	72.2	Multi-task BiLSTM + Attn
Natural Language Inference	MultiNLI	Mismatched	72.1	Multi-task BiLSTM + Attn

Related Papers

RaMen: Multi-Strategy Multi-Modal Learning for Bundle Construction2025-07-18 Smart fault detection in satellite electrical power system2025-07-18 Disentangling coincident cell events using deep transfer learning and compressive sensing2025-07-17 Demographic-aware fine-grained classification of pediatric wrist fractures2025-07-17 Best Practices for Large-Scale, Pixel-Wise Crop Mapping and Transfer Learning Workflows2025-07-16 Trustworthy Tree-based Machine Learning by $MoS_2$ Flash-based Analog CAM with Inherent Soft Boundaries2025-07-16 LRCTI: A Large Language Model-Based Framework for Multi-Step Evidence Retrieval and Reasoning in Cyber Threat Intelligence Credibility Verification2025-07-15 Robust-Multi-Task Gradient Boosting2025-07-15