TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Finetuned Language Models Are Zero-Shot Learners

Finetuned Language Models Are Zero-Shot Learners

Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, Quoc V. Le

2021-09-03ICLR 2022 4Machine TranslationQuestion AnsweringSentence CompletionSentiment AnalysisCoreference ResolutionNatural Language InferenceCommon Sense ReasoningRTEZero-Shot LearningLanguage Modelling
PaperPDFCodeCodeCodeCodeCodeCode(official)CodeCode

Abstract

This paper explores a simple method for improving the zero-shot learning abilities of language models. We show that instruction tuning -- finetuning language models on a collection of tasks described via instructions -- substantially improves zero-shot performance on unseen tasks. We take a 137B parameter pretrained language model and instruction-tune it on over 60 NLP tasks verbalized via natural language instruction templates. We evaluate this instruction-tuned model, which we call FLAN, on unseen task types. FLAN substantially improves the performance of its unmodified counterpart and surpasses zero-shot 175B GPT-3 on 20 of 25 tasks that we evaluate. FLAN even outperforms few-shot GPT-3 by a large margin on ANLI, RTE, BoolQ, AI2-ARC, OpenbookQA, and StoryCloze. Ablation studies reveal that number of finetuning datasets, model scale, and natural language instructions are key to the success of instruction tuning.

Results

TaskDatasetMetricValueModel
Machine TranslationWMT2016 Romanian-EnglishBLEU score38.1FLAN 137B (few-shot, k=9)
Machine TranslationWMT2016 Romanian-EnglishBLEU score37.3FLAN 137B (zero-shot)
Machine TranslationWMT2014 French-EnglishBLEU score37.9FLAN 137B (few-shot, k=9)
Machine TranslationWMT2014 French-EnglishBLEU score35.9FLAN 137B (zero-shot)
Machine TranslationWMT2016 English-GermanBLEU score27FLAN 137B (zero-shot)
Machine TranslationWMT2016 English-GermanBLEU score26.1FLAN 137B (few-shot, k=11)
Machine TranslationWMT2016 German-EnglishBLEU score40.7FLAN 137B (few-shot, k=11)
Machine TranslationWMT2016 German-EnglishBLEU score38.9FLAN 137B (zero-shot)
Machine TranslationWMT2016 English-RomanianBLEU score20.5FLAN 137B (few-shot, k=9)
Machine TranslationWMT2016 English-RomanianBLEU score18.9FLAN 137B (zero-shot)
Machine TranslationWMT2014 English-FrenchBLEU score33.9FLAN 137B (zero-shot)
Machine TranslationWMT2014 English-FrenchBLEU score33.8FLAN 137B (few-shot, k=9)
Question AnsweringCOPAAccuracy94FLAN 137B (prompt-tuned)
Question AnsweringCOPAAccuracy91FLAN 137B (zero-shot)
Question AnsweringCOPAAccuracy87FLAN 137B (few-shot, k=16)
Question AnsweringOBQAAccuracy78.4FLAN 137B (zero-shot)
Question AnsweringOBQAAccuracy78.2FLAN 137B (few-shot, k=16)
Question AnsweringMultiRCF183.4FLAN 137B (prompt-tuned)
Question AnsweringMultiRCF177.5FLAN 137B (zero-shot)
Question AnsweringMultiRCF172.1FLAN 137B (1-shot)
Question AnsweringPIQAAccuracy81.7FLAN 137B (few-shot, k=10)
Question AnsweringPIQAAccuracy80.5FLAN 137B (0-shot)
Question AnsweringStoryClozeAccuracy94.7FLAN 137B (few-shot, k=10)
Question AnsweringStoryClozeAccuracy93.4FLAN 137B (zero-shot)
Question AnsweringBoolQAccuracy86.3FLAN 137B (prompt-tuned)
Question AnsweringBoolQAccuracy84.6FLAN 137B (4-shot)
Question AnsweringBoolQAccuracy82.9FLAN 137B (0-shot)
Question AnsweringNaturalQAEM20.7FLAN 137B (zero-shot)
Question AnsweringTriviaQAEM56.7FLAN 137B (zero-shot)
Common Sense ReasoningWinoGrandeAccuracy72.8FLAN 137B (few-shot, k=16)
Common Sense ReasoningWinoGrandeAccuracy71.2FLAN 137B (0-shot)
Common Sense ReasoningARC (Challenge)Accuracy63.8FLAN 137B (few-shot, k=13)
Common Sense ReasoningARC (Challenge)Accuracy63.1FLAN 137B (zero-shot)
Common Sense ReasoningARC (Easy)Accuracy80.7FLAN 137B (few-shot, k=14)
Common Sense ReasoningARC (Easy)Accuracy79.6FLAN 137B (0-shot)
Common Sense ReasoningReCoRDEM85.1FLAN 137B (prompt-tuned)
Common Sense ReasoningReCoRDEM72.5FLAN 137B (zero-shot)
Natural Language InferenceWNLIAccuracy74.6FLAN 137B (zero-shot)
Natural Language InferenceWNLIAccuracy70.4FLAN 137B (few-shot, k=4)
Sentiment AnalysisIMDbAccuracy95FLAN 137B (few-shot, k=2)
Sentiment AnalysisIMDbAccuracy94.3FLAN 137B (zero-shot)
Coreference ResolutionWinograd Schema ChallengeAccuracy86.5FLAN 137B (prompt-tuned)
Coreference ResolutionWinograd Schema ChallengeAccuracy80.8FLAN 137B (zero-shot)
Sentence CompletionHellaSwagAccuracy59.2FLAN 137B (3-shot)
Sentence CompletionHellaSwagAccuracy56.7FLAN 137B (0-shot)

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17AdaptiSent: Context-Aware Adaptive Attention for Multimodal Aspect-Based Sentiment Analysis2025-07-17Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes2025-07-17GLAD: Generalizable Tuning for Vision-Language Models2025-07-17