Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, Quoc V. Le
This paper explores a simple method for improving the zero-shot learning abilities of language models. We show that instruction tuning -- finetuning language models on a collection of tasks described via instructions -- substantially improves zero-shot performance on unseen tasks. We take a 137B parameter pretrained language model and instruction-tune it on over 60 NLP tasks verbalized via natural language instruction templates. We evaluate this instruction-tuned model, which we call FLAN, on unseen task types. FLAN substantially improves the performance of its unmodified counterpart and surpasses zero-shot 175B GPT-3 on 20 of 25 tasks that we evaluate. FLAN even outperforms few-shot GPT-3 by a large margin on ANLI, RTE, BoolQ, AI2-ARC, OpenbookQA, and StoryCloze. Ablation studies reveal that number of finetuning datasets, model scale, and natural language instructions are key to the success of instruction tuning.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Machine Translation | WMT2016 Romanian-English | BLEU score | 38.1 | FLAN 137B (few-shot, k=9) |
| Machine Translation | WMT2016 Romanian-English | BLEU score | 37.3 | FLAN 137B (zero-shot) |
| Machine Translation | WMT2014 French-English | BLEU score | 37.9 | FLAN 137B (few-shot, k=9) |
| Machine Translation | WMT2014 French-English | BLEU score | 35.9 | FLAN 137B (zero-shot) |
| Machine Translation | WMT2016 English-German | BLEU score | 27 | FLAN 137B (zero-shot) |
| Machine Translation | WMT2016 English-German | BLEU score | 26.1 | FLAN 137B (few-shot, k=11) |
| Machine Translation | WMT2016 German-English | BLEU score | 40.7 | FLAN 137B (few-shot, k=11) |
| Machine Translation | WMT2016 German-English | BLEU score | 38.9 | FLAN 137B (zero-shot) |
| Machine Translation | WMT2016 English-Romanian | BLEU score | 20.5 | FLAN 137B (few-shot, k=9) |
| Machine Translation | WMT2016 English-Romanian | BLEU score | 18.9 | FLAN 137B (zero-shot) |
| Machine Translation | WMT2014 English-French | BLEU score | 33.9 | FLAN 137B (zero-shot) |
| Machine Translation | WMT2014 English-French | BLEU score | 33.8 | FLAN 137B (few-shot, k=9) |
| Question Answering | COPA | Accuracy | 94 | FLAN 137B (prompt-tuned) |
| Question Answering | COPA | Accuracy | 91 | FLAN 137B (zero-shot) |
| Question Answering | COPA | Accuracy | 87 | FLAN 137B (few-shot, k=16) |
| Question Answering | OBQA | Accuracy | 78.4 | FLAN 137B (zero-shot) |
| Question Answering | OBQA | Accuracy | 78.2 | FLAN 137B (few-shot, k=16) |
| Question Answering | MultiRC | F1 | 83.4 | FLAN 137B (prompt-tuned) |
| Question Answering | MultiRC | F1 | 77.5 | FLAN 137B (zero-shot) |
| Question Answering | MultiRC | F1 | 72.1 | FLAN 137B (1-shot) |
| Question Answering | PIQA | Accuracy | 81.7 | FLAN 137B (few-shot, k=10) |
| Question Answering | PIQA | Accuracy | 80.5 | FLAN 137B (0-shot) |
| Question Answering | StoryCloze | Accuracy | 94.7 | FLAN 137B (few-shot, k=10) |
| Question Answering | StoryCloze | Accuracy | 93.4 | FLAN 137B (zero-shot) |
| Question Answering | BoolQ | Accuracy | 86.3 | FLAN 137B (prompt-tuned) |
| Question Answering | BoolQ | Accuracy | 84.6 | FLAN 137B (4-shot) |
| Question Answering | BoolQ | Accuracy | 82.9 | FLAN 137B (0-shot) |
| Question Answering | NaturalQA | EM | 20.7 | FLAN 137B (zero-shot) |
| Question Answering | TriviaQA | EM | 56.7 | FLAN 137B (zero-shot) |
| Common Sense Reasoning | WinoGrande | Accuracy | 72.8 | FLAN 137B (few-shot, k=16) |
| Common Sense Reasoning | WinoGrande | Accuracy | 71.2 | FLAN 137B (0-shot) |
| Common Sense Reasoning | ARC (Challenge) | Accuracy | 63.8 | FLAN 137B (few-shot, k=13) |
| Common Sense Reasoning | ARC (Challenge) | Accuracy | 63.1 | FLAN 137B (zero-shot) |
| Common Sense Reasoning | ARC (Easy) | Accuracy | 80.7 | FLAN 137B (few-shot, k=14) |
| Common Sense Reasoning | ARC (Easy) | Accuracy | 79.6 | FLAN 137B (0-shot) |
| Common Sense Reasoning | ReCoRD | EM | 85.1 | FLAN 137B (prompt-tuned) |
| Common Sense Reasoning | ReCoRD | EM | 72.5 | FLAN 137B (zero-shot) |
| Natural Language Inference | WNLI | Accuracy | 74.6 | FLAN 137B (zero-shot) |
| Natural Language Inference | WNLI | Accuracy | 70.4 | FLAN 137B (few-shot, k=4) |
| Sentiment Analysis | IMDb | Accuracy | 95 | FLAN 137B (few-shot, k=2) |
| Sentiment Analysis | IMDb | Accuracy | 94.3 | FLAN 137B (zero-shot) |
| Coreference Resolution | Winograd Schema Challenge | Accuracy | 86.5 | FLAN 137B (prompt-tuned) |
| Coreference Resolution | Winograd Schema Challenge | Accuracy | 80.8 | FLAN 137B (zero-shot) |
| Sentence Completion | HellaSwag | Accuracy | 59.2 | FLAN 137B (3-shot) |
| Sentence Completion | HellaSwag | Accuracy | 56.7 | FLAN 137B (0-shot) |