TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/UL2: Unifying Language Learning Paradigms

UL2: Unifying Language Learning Paradigms

Yi Tay, Mostafa Dehghani, Vinh Q. Tran, Xavier Garcia, Jason Wei, Xuezhi Wang, Hyung Won Chung, Siamak Shakeri, Dara Bahri, Tal Schuster, Huaixiu Steven Zheng, Denny Zhou, Neil Houlsby, Donald Metzler

2022-05-10Text ClassificationQuestion AnsweringMulti-task Language UnderstandingText GenerationCoreference ResolutionNatural Language InferenceCommon Sense ReasoningLong-range modelingInformation RetrievalArithmetic ReasoningRetrievalMMLUWord Sense Disambiguation
PaperPDFCodeCode(official)

Abstract

Existing pre-trained models are generally geared towards a particular class of problems. To date, there seems to be still no consensus on what the right architecture and pre-training setup should be. This paper presents a unified framework for pre-training models that are universally effective across datasets and setups. We begin by disentangling architectural archetypes with pre-training objectives -- two concepts that are commonly conflated. Next, we present a generalized & unified perspective for self-supervision in NLP and show how different pre-training objectives can be cast as one another and how interpolating between different objectives can be effective. We then propose Mixture-of-Denoisers (MoD), a pre-training objective that combines diverse pre-training paradigms together. We furthermore introduce a notion of mode switching, wherein downstream fine-tuning is associated with specific pre-training schemes. We conduct extensive ablative experiments to compare multiple pre-training objectives and find that our method pushes the Pareto-frontier by outperforming T5 & GPT-like models across multiple diverse setups. By scaling our model up to 20B parameters, we achieve SOTA performance on 50 well-established supervised finetuning based NLP tasks. Our model also achieve strong results at in-context learning, outperforming 175B GPT-3 on zero-shot SuperGLUE and tripling the performance of T5-XXL on one-shot summarization. On 0-shot MMLU, UL2 20B outperforms T0 and T5 models. UL2 20B also works well with chain-of-thought prompting and reasoning, making it an appealing choice for research into reasoning at a small to medium scale of 20B parameters. Finally, we apply FLAN instruction tuning to the UL2 20B model, achieving MMLU and Big-Bench scores competitive to FLAN-PaLM 62B. We release Flax-based T5X checkpoints for the UL2 20B & Flan-UL2 20B.

Results

TaskDatasetMetricValueModel
Transfer LearningMMLAverage (%)39.2UL2 20B (5-shot)
Question AnsweringCOPAAccuracy99UL2 20B (fine-tuned)
Question AnsweringCOPAAccuracy85UL2 20B (0-shot)
Question AnsweringBoolQAccuracy90.8UL2 20B (fine-tuned)
Question AnsweringBoolQAccuracy63.1UL2 20B (0-shot)
Common Sense ReasoningARC (Challenge)Accuracy49.5UL2 20B (chain-of-thought + self-consistency)
Common Sense ReasoningARC (Challenge)Accuracy42.9UL2 20B (chain-of-thought)
Common Sense ReasoningARC (Challenge)Accuracy29.8UL2 20B (zero-shot)
Common Sense ReasoningARC (Easy)Accuracy69.8UL2 20B (chain-of-thought + self-consistency)
Common Sense ReasoningARC (Easy)Accuracy38.4UL2 20B (chain-of-thought)
Common Sense ReasoningARC (Easy)Accuracy32.2UL2 20B (0-shot)
Common Sense ReasoningCommonsenseQAAccuracy55.7UL2 20B (chain-of-thought + self-consistency)
Common Sense ReasoningCommonsenseQAAccuracy51.4UL2 20B (chain-of-thought)
Common Sense ReasoningCommonsenseQAAccuracy34.2UL2 20B (zero-shot)
Word Sense DisambiguationWords in ContextAccuracy77.3UL2 20B (fine-tuned)
Word Sense DisambiguationWords in ContextAccuracy49.8UL2 20B (0-shot)
Language ModellingSCROLLSAvg.37.87UL2
Language ModellingSCROLLSNrtv24.2UL2
Language ModellingSCROLLSQspr37.6UL2
Language ModellingSCROLLSCNLI88.7UL2 20B
Coreference ResolutionWinograd Schema ChallengeAccuracy98.1UL2 20B (fine-tuned)
Coreference ResolutionWinograd Schema ChallengeAccuracy79.9UL2 20B (0-shot)
Multi-Task LearningMMLAverage (%)39.2UL2 20B (5-shot)
Arithmetic ReasoningGSM8KAccuracy4.4UL2 20B (chain-of-thought)
Arithmetic ReasoningGSM8KParameters (Billion)20UL2 20B (chain-of-thought)
Arithmetic ReasoningGSM8KAccuracy4.1UL2 20B (0-shot)
Arithmetic ReasoningGSM8KParameters (Billion)20UL2 20B (0-shot)

Related Papers

Making Language Model a Hierarchical Classifier and Generator2025-07-17From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes2025-07-17Overview of the TalentCLEF 2025: Skill and Job Title Intelligence for Human Capital Management2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17