TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Scaling Instruction-Finetuned Language Models

Scaling Instruction-Finetuned Language Models

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Alex Castro-Ros, Marie Pellat, Kevin Robinson, Dasha Valter, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, Jason Wei

2022-10-20Question AnsweringMulti-task Language UnderstandingParaphrase IdentificationCoreference ResolutionCross-Lingual Question AnsweringMMLU
PaperPDFCodeCodeCodeCodeCodeCodeCodeCodeCode

Abstract

Finetuning language models on a collection of datasets phrased as instructions has been shown to improve model performance and generalization to unseen tasks. In this paper we explore instruction finetuning with a particular focus on (1) scaling the number of tasks, (2) scaling the model size, and (3) finetuning on chain-of-thought data. We find that instruction finetuning with the above aspects dramatically improves performance on a variety of model classes (PaLM, T5, U-PaLM), prompting setups (zero-shot, few-shot, CoT), and evaluation benchmarks (MMLU, BBH, TyDiQA, MGSM, open-ended generation). For instance, Flan-PaLM 540B instruction-finetuned on 1.8K tasks outperforms PALM 540B by a large margin (+9.4% on average). Flan-PaLM 540B achieves state-of-the-art performance on several benchmarks, such as 75.2% on five-shot MMLU. We also publicly release Flan-T5 checkpoints, which achieve strong few-shot performance even compared to much larger models, such as PaLM 62B. Overall, instruction finetuning is a general method for improving the performance and usability of pretrained language models.

Results

TaskDatasetMetricValueModel
Transfer LearningBBH-algAverage (%)66.5Flan-PaLM 540B (3-shot, fine-tuned, CoT + SC)
Transfer LearningBBH-algAverage (%)62.2PaLM 540B (CoT + self-consistency)
Transfer LearningBBH-algAverage (%)61.3Flan-PaLM 540B (3-shot, fine-tuned, CoT)
Transfer LearningBBH-algAverage (%)57.6PaLM 540B (CoT)
Transfer LearningBBH-algAverage (%)48.2Flan-PaLM 540B (3-shot, fine-tuned)
Transfer LearningBBH-algAverage (%)38.3PaLM 540B
Transfer LearningMMLAverage (%)73.5llama 2(65b)
Transfer LearningMMLAverage (%)59.5GPT-3 Davinci 175B (CoT)
Transfer LearningMMLAverage (%)45.5Flan-T5-XL 3B (CoT)
Transfer LearningMMLAverage (%)45.1Flan-T5-Large 780M
Transfer LearningMMLAverage (%)40.5Flan-T5-Large 780M (CoT)
Transfer LearningMMLAverage (%)39.7GPT-3 Davinci 175B (5-shot)
Transfer LearningMMLAverage (%)35.9Flan-T5-Base 250M
Transfer LearningMMLAverage (%)33.7Flan-T5-Base 250M (CoT)
Transfer LearningMMLAverage (%)28.7Flan-T5-Small 80M
Transfer LearningMGSMAverage (%)72Flan-PaLM 540B (8-shot, fine-tuned, CoT + SC)
Transfer LearningMGSMAverage (%)60.4Flan-U-PaLM 540B (CoT)
Transfer LearningMGSMAverage (%)57Flan-PaLM 540B (8-shot, fine-tuned, CoT)
Transfer LearningMGSMAverage (%)36text-davinci-003
Transfer LearningMGSMAverage (%)35code-davinci-002
Transfer LearningMGSMAverage (%)23.7text-davinci-002
Transfer LearningMGSMAverage (%)21.2Flan-PaLM 540B (8-shot, fine-tuned)
Transfer LearningMGSMAverage (%)5.7GPT-3 Davinci 175B
Transfer LearningBBH-nlpAverage (%)78.4Flan-PaLM 540B (3-shot, fine-tuned, CoT + SC)
Transfer LearningBBH-nlpAverage (%)78.2PaLM 540B (CoT + self-consistency)
Transfer LearningBBH-nlpAverage (%)72.4Flan-PaLM 540B (3-shot, fine-tuned, CoT)
Transfer LearningBBH-nlpAverage (%)71.2PaLM 540B (CoT)
Transfer LearningBBH-nlpAverage (%)70Flan-PaLM 540B (5-shot, finetuned)
Transfer LearningBBH-nlpAverage (%)62.7PaLM 540B
Question AnsweringTyDiQA-GoldPEM68.3Flan-U-PaLM 540B (direct-prompting)
Question AnsweringTyDiQA-GoldPEM67.8Flan-PaLM 540B (direct-prompting)
Coreference ResolutionWinograd Schema ChallengeAccuracy89.82Flan-T5 XXL (zero -shot)
Multi-Task LearningBBH-algAverage (%)66.5Flan-PaLM 540B (3-shot, fine-tuned, CoT + SC)
Multi-Task LearningBBH-algAverage (%)62.2PaLM 540B (CoT + self-consistency)
Multi-Task LearningBBH-algAverage (%)61.3Flan-PaLM 540B (3-shot, fine-tuned, CoT)
Multi-Task LearningBBH-algAverage (%)57.6PaLM 540B (CoT)
Multi-Task LearningBBH-algAverage (%)48.2Flan-PaLM 540B (3-shot, fine-tuned)
Multi-Task LearningBBH-algAverage (%)38.3PaLM 540B
Multi-Task LearningMMLAverage (%)73.5llama 2(65b)
Multi-Task LearningMMLAverage (%)59.5GPT-3 Davinci 175B (CoT)
Multi-Task LearningMMLAverage (%)45.5Flan-T5-XL 3B (CoT)
Multi-Task LearningMMLAverage (%)45.1Flan-T5-Large 780M
Multi-Task LearningMMLAverage (%)40.5Flan-T5-Large 780M (CoT)
Multi-Task LearningMMLAverage (%)39.7GPT-3 Davinci 175B (5-shot)
Multi-Task LearningMMLAverage (%)35.9Flan-T5-Base 250M
Multi-Task LearningMMLAverage (%)33.7Flan-T5-Base 250M (CoT)
Multi-Task LearningMMLAverage (%)28.7Flan-T5-Small 80M
Multi-Task LearningMGSMAverage (%)72Flan-PaLM 540B (8-shot, fine-tuned, CoT + SC)
Multi-Task LearningMGSMAverage (%)60.4Flan-U-PaLM 540B (CoT)
Multi-Task LearningMGSMAverage (%)57Flan-PaLM 540B (8-shot, fine-tuned, CoT)
Multi-Task LearningMGSMAverage (%)36text-davinci-003
Multi-Task LearningMGSMAverage (%)35code-davinci-002
Multi-Task LearningMGSMAverage (%)23.7text-davinci-002
Multi-Task LearningMGSMAverage (%)21.2Flan-PaLM 540B (8-shot, fine-tuned)
Multi-Task LearningMGSMAverage (%)5.7GPT-3 Davinci 175B
Multi-Task LearningBBH-nlpAverage (%)78.4Flan-PaLM 540B (3-shot, fine-tuned, CoT + SC)
Multi-Task LearningBBH-nlpAverage (%)78.2PaLM 540B (CoT + self-consistency)
Multi-Task LearningBBH-nlpAverage (%)72.4Flan-PaLM 540B (3-shot, fine-tuned, CoT)
Multi-Task LearningBBH-nlpAverage (%)71.2PaLM 540B (CoT)
Multi-Task LearningBBH-nlpAverage (%)70Flan-PaLM 540B (5-shot, finetuned)
Multi-Task LearningBBH-nlpAverage (%)62.7PaLM 540B

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16Is This Just Fantasy? Language Model Representations Reflect Human Judgments of Event Plausibility2025-07-16Learning What Matters: Probabilistic Task Selection via Mutual Information for Model Finetuning2025-07-16Step-wise Policy for Rare-tool Knowledge (SPaRK): Offline RL that Drives Diverse Tool Use in LLMs2025-07-15