Finetuned Language Models Are Zero-Shot Learners

Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, Quoc V. Le

2021-09-03ICLR 2022 4Machine Translation Question Answering Sentence Completion Sentiment Analysis Coreference Resolution Natural Language Inference Common Sense Reasoning RTE Zero-Shot Learning Language Modelling

Paper PDF Code Code Code Code Code Code(official)Code Code

Abstract

This paper explores a simple method for improving the zero-shot learning abilities of language models. We show that instruction tuning -- finetuning language models on a collection of tasks described via instructions -- substantially improves zero-shot performance on unseen tasks. We take a 137B parameter pretrained language model and instruction-tune it on over 60 NLP tasks verbalized via natural language instruction templates. We evaluate this instruction-tuned model, which we call FLAN, on unseen task types. FLAN substantially improves the performance of its unmodified counterpart and surpasses zero-shot 175B GPT-3 on 20 of 25 tasks that we evaluate. FLAN even outperforms few-shot GPT-3 by a large margin on ANLI, RTE, BoolQ, AI2-ARC, OpenbookQA, and StoryCloze. Ablation studies reveal that number of finetuning datasets, model scale, and natural language instructions are key to the success of instruction tuning.

Results

Task	Dataset	Metric	Value	Model
Machine Translation	WMT2016 Romanian-English	BLEU score	38.1	FLAN 137B (few-shot, k=9)
Machine Translation	WMT2016 Romanian-English	BLEU score	37.3	FLAN 137B (zero-shot)
Machine Translation	WMT2014 French-English	BLEU score	37.9	FLAN 137B (few-shot, k=9)
Machine Translation	WMT2014 French-English	BLEU score	35.9	FLAN 137B (zero-shot)
Machine Translation	WMT2016 English-German	BLEU score	27	FLAN 137B (zero-shot)
Machine Translation	WMT2016 English-German	BLEU score	26.1	FLAN 137B (few-shot, k=11)
Machine Translation	WMT2016 German-English	BLEU score	40.7	FLAN 137B (few-shot, k=11)
Machine Translation	WMT2016 German-English	BLEU score	38.9	FLAN 137B (zero-shot)
Machine Translation	WMT2016 English-Romanian	BLEU score	20.5	FLAN 137B (few-shot, k=9)
Machine Translation	WMT2016 English-Romanian	BLEU score	18.9	FLAN 137B (zero-shot)
Machine Translation	WMT2014 English-French	BLEU score	33.9	FLAN 137B (zero-shot)
Machine Translation	WMT2014 English-French	BLEU score	33.8	FLAN 137B (few-shot, k=9)
Question Answering	COPA	Accuracy	94	FLAN 137B (prompt-tuned)
Question Answering	COPA	Accuracy	91	FLAN 137B (zero-shot)
Question Answering	COPA	Accuracy	87	FLAN 137B (few-shot, k=16)
Question Answering	OBQA	Accuracy	78.4	FLAN 137B (zero-shot)
Question Answering	OBQA	Accuracy	78.2	FLAN 137B (few-shot, k=16)
Question Answering	MultiRC	F1	83.4	FLAN 137B (prompt-tuned)
Question Answering	MultiRC	F1	77.5	FLAN 137B (zero-shot)
Question Answering	MultiRC	F1	72.1	FLAN 137B (1-shot)
Question Answering	PIQA	Accuracy	81.7	FLAN 137B (few-shot, k=10)
Question Answering	PIQA	Accuracy	80.5	FLAN 137B (0-shot)
Question Answering	StoryCloze	Accuracy	94.7	FLAN 137B (few-shot, k=10)
Question Answering	StoryCloze	Accuracy	93.4	FLAN 137B (zero-shot)
Question Answering	BoolQ	Accuracy	86.3	FLAN 137B (prompt-tuned)
Question Answering	BoolQ	Accuracy	84.6	FLAN 137B (4-shot)
Question Answering	BoolQ	Accuracy	82.9	FLAN 137B (0-shot)
Question Answering	NaturalQA	EM	20.7	FLAN 137B (zero-shot)
Question Answering	TriviaQA	EM	56.7	FLAN 137B (zero-shot)
Common Sense Reasoning	WinoGrande	Accuracy	72.8	FLAN 137B (few-shot, k=16)
Common Sense Reasoning	WinoGrande	Accuracy	71.2	FLAN 137B (0-shot)
Common Sense Reasoning	ARC (Challenge)	Accuracy	63.8	FLAN 137B (few-shot, k=13)
Common Sense Reasoning	ARC (Challenge)	Accuracy	63.1	FLAN 137B (zero-shot)
Common Sense Reasoning	ARC (Easy)	Accuracy	80.7	FLAN 137B (few-shot, k=14)
Common Sense Reasoning	ARC (Easy)	Accuracy	79.6	FLAN 137B (0-shot)
Common Sense Reasoning	ReCoRD	EM	85.1	FLAN 137B (prompt-tuned)
Common Sense Reasoning	ReCoRD	EM	72.5	FLAN 137B (zero-shot)
Natural Language Inference	WNLI	Accuracy	74.6	FLAN 137B (zero-shot)
Natural Language Inference	WNLI	Accuracy	70.4	FLAN 137B (few-shot, k=4)
Sentiment Analysis	IMDb	Accuracy	95	FLAN 137B (few-shot, k=2)
Sentiment Analysis	IMDb	Accuracy	94.3	FLAN 137B (zero-shot)
Coreference Resolution	Winograd Schema Challenge	Accuracy	86.5	FLAN 137B (prompt-tuned)
Coreference Resolution	Winograd Schema Challenge	Accuracy	80.8	FLAN 137B (zero-shot)
Sentence Completion	HellaSwag	Accuracy	59.2	FLAN 137B (3-shot)
Sentence Completion	HellaSwag	Accuracy	56.7	FLAN 137B (0-shot)

Finetuned Language Models Are Zero-Shot Learners

Abstract

Results

Related Papers

Finetuned Language Models Are Zero-Shot Learners

Abstract

Results

Related Papers