Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu
Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new ``Colossal Clean Crawled Corpus'', we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Machine Translation | WMT2014 English-German | BLEU score | 32.1 | T5-11B |
| Machine Translation | WMT2014 English-French | BLEU score | 43.4 | T5 |
| Reading Comprehension | PhotoChat | F1 | 58.9 | T5-3B |
| Reading Comprehension | PhotoChat | Precision | 54.1 | T5-3B |
| Reading Comprehension | PhotoChat | Recall | 64.6 | T5-3B |
| Reading Comprehension | PhotoChat | F1 | 58.1 | T5-base |
| Reading Comprehension | PhotoChat | Precision | 58.2 | T5-base |
| Reading Comprehension | PhotoChat | Recall | 57.9 | T5-base |
| Question Answering | COPA | Accuracy | 94.8 | T5-XXL 11B (fine-tuned) |
| Question Answering | COPA | Accuracy | 92 | T5-XL 3B (fine-tuned) |
| Question Answering | COPA | Accuracy | 83.4 | T5-Large 770M (fine-tuned) |
| Question Answering | COPA | Accuracy | 71.2 | T5-Base 220M (fine-tuned) |
| Question Answering | SQuAD1.1 dev | EM | 90.06 | T5-11B |
| Question Answering | SQuAD1.1 dev | F1 | 95.64 | T5-11B |
| Question Answering | SQuAD1.1 dev | EM | 88.53 | T5-3B |
| Question Answering | SQuAD1.1 dev | F1 | 94.95 | T5-3B |
| Question Answering | SQuAD1.1 dev | EM | 86.66 | T5-Large 770M |
| Question Answering | SQuAD1.1 dev | F1 | 93.79 | T5-Large 770M |
| Question Answering | SQuAD1.1 dev | EM | 85.44 | T5-Base |
| Question Answering | SQuAD1.1 dev | F1 | 92.08 | T5-Base |
| Question Answering | SQuAD1.1 dev | EM | 79.1 | T5-Small |
| Question Answering | SQuAD1.1 dev | F1 | 87.24 | T5-Small |
| Question Answering | MultiRC | F1 | 88.1 | T5-XXL 11B (fine-tuned) |
| Question Answering | MultiRC | EM | 63.3 | T5-11B |
| Question Answering | WebQuestions | EM | 42.8 | T5.1.1-XXL+SSM |
| Question Answering | BoolQ | Accuracy | 91.2 | T5-XXL 11B (fine-tuned) |
| Question Answering | BoolQ | Accuracy | 85.4 | T5-Large 770M (fine-tuned) |
| Question Answering | BoolQ | Accuracy | 81.4 | T5-Base 220M (fine-tuned) |
| Question Answering | BoolQ | Accuracy | 76.4 | T5-Small 60M (fine-tuned) |
| Common Sense Reasoning | ReCoRD | EM | 93.4 | T5-XXL 11B (fine-tuned) |
| Common Sense Reasoning | ReCoRD | F1 | 94.1 | T5-11B |
| Word Sense Disambiguation | Words in Context | Accuracy | 76.9 | T5-XXL 11B |
| Natural Language Inference | WNLI | Accuracy | 93.2 | T5-XXL 11B |
| Natural Language Inference | WNLI | Accuracy | 89.7 | T5-XL 3B |
| Natural Language Inference | WNLI | Accuracy | 85.6 | T5-Large 770M |
| Natural Language Inference | WNLI | Accuracy | 78.8 | T5-Base 220M |
| Natural Language Inference | WNLI | Accuracy | 69.2 | T5-Small 60M |
| Natural Language Inference | CommitmentBank | Accuracy | 96.8 | T5-XXL 11B (fine-tuned) |
| Natural Language Inference | CommitmentBank | F1 | 93.9 | T5-XXL 11B (fine-tuned) |
| Natural Language Inference | CommitmentBank | Accuracy | 94.4 | T5-Large 770M (fine-tuned) |
| Natural Language Inference | CommitmentBank | F1 | 90.3 | T5-Large 770M (fine-tuned) |
| Natural Language Inference | CommitmentBank | Accuracy | 94 | T5-Base 220M (fine-tuned) |
| Natural Language Inference | CommitmentBank | F1 | 86.2 | T5-Base 220M (fine-tuned) |
| Natural Language Inference | MultiNLI | Matched | 92 | T5-XXL 11B (fine-tuned) |
| Natural Language Inference | MultiNLI | Matched | 91.4 | T5-3B |
| Natural Language Inference | MultiNLI | Mismatched | 91.2 | T5-3B |
| Natural Language Inference | MultiNLI | Matched | 89.9 | T5-Large |
| Natural Language Inference | MultiNLI | Matched | 87.1 | T5-Base |
| Natural Language Inference | MultiNLI | Mismatched | 86.2 | T5-Base |
| Natural Language Inference | MultiNLI | Matched | 82.4 | T5-Small |
| Natural Language Inference | MultiNLI | Mismatched | 82.3 | T5-Small |
| Natural Language Inference | MultiNLI | Mismatched | 91.7 | T5-11B |
| Natural Language Inference | MultiNLI | Mismatched | 89.6 | T5-Large 770M |
| Natural Language Inference | WeiboPolls | BLEU-1 | 37.77 | T5 |
| Natural Language Inference | WeiboPolls | BLEU-3 | 25.86 | T5 |
| Natural Language Inference | WeiboPolls | ROUGE-1 | 46.2 | T5 |
| Natural Language Inference | WeiboPolls | ROUGE-L | 43.32 | T5 |
| Semantic Textual Similarity | MRPC | F1 | 91.9 | T5-11B |
| Semantic Textual Similarity | MRPC | F1 | 92.4 | T5-Large |
| Semantic Textual Similarity | MRPC | F1 | 92.5 | T5-3B |
| Semantic Textual Similarity | MRPC | F1 | 90.7 | T5-Base |
| Semantic Textual Similarity | MRPC | F1 | 89.7 | T5-Small |
| Semantic Textual Similarity | STS Benchmark | Pearson Correlation | 0.925 | T5-11B |
| Semantic Textual Similarity | STS Benchmark | Spearman Correlation | 0.921 | T5-11B |
| Semantic Textual Similarity | STS Benchmark | Pearson Correlation | 0.906 | T5-3B |
| Semantic Textual Similarity | STS Benchmark | Spearman Correlation | 0.898 | T5-3B |
| Semantic Textual Similarity | STS Benchmark | Pearson Correlation | 0.899 | T5-Large |
| Semantic Textual Similarity | STS Benchmark | Pearson Correlation | 0.894 | T5-Base |
| Semantic Textual Similarity | STS Benchmark | Pearson Correlation | 0.856 | T5-Small |
| Semantic Textual Similarity | STS Benchmark | Spearman Correlation | 0.85 | T5-Small |
| Semantic Textual Similarity | STS Benchmark | Spearman Correlation | 0.886 | T5-Large 770M |
| Semantic Parsing | WebQuestionsSP | Accuracy | 56.5 | T5-11B (Raffel et al., 2020) |
| Sentiment Analysis | SST-2 Binary classification | Accuracy | 97.5 | T5-11B |
| Sentiment Analysis | SST-2 Binary classification | Accuracy | 97.4 | T5-3B |
| Sentiment Analysis | SST-2 Binary classification | Accuracy | 96.3 | T5-Large 770M |
| Sentiment Analysis | SST-2 Binary classification | Accuracy | 95.2 | T5-Base |
| Sentiment Analysis | SST-2 Binary classification | Accuracy | 91.8 | T5-Small |
| Coreference Resolution | Winograd Schema Challenge | Accuracy | 93.8 | T5-XXL 11B (fine-tuned) |
| Text Summarization | CNN / Daily Mail | ROUGE-1 | 43.52 | T5 |
| Text Summarization | CNN / Daily Mail | ROUGE-2 | 21.55 | T5 |
| Text Summarization | CNN / Daily Mail | ROUGE-L | 40.69 | T5 |
| Text Summarization | CNN / Daily Mail | ROUGE-1 | 43.52 | T5-11B |
| Text Summarization | CNN / Daily Mail | ROUGE-2 | 21.55 | T5-11B |
| Text Summarization | CNN / Daily Mail | ROUGE-L | 40.69 | T5-11B |
| Abstractive Text Summarization | CNN / Daily Mail | ROUGE-1 | 43.52 | T5 |
| Abstractive Text Summarization | CNN / Daily Mail | ROUGE-2 | 21.55 | T5 |
| Abstractive Text Summarization | CNN / Daily Mail | ROUGE-L | 40.69 | T5 |
| Question Generation | WeiboPolls | BLEU-1 | 36.91 | T5 |
| Question Generation | WeiboPolls | BLEU-3 | 16.26 | T5 |
| Question Generation | WeiboPolls | ROUGE-1 | 44.46 | T5 |
| Question Generation | WeiboPolls | ROUGE-L | 42.06 | T5 |
| Question Generation | WeiboPolls | BLEU-1 | 37.34 | T5 |
| Question Generation | WeiboPolls | BLEU-3 | 21.06 | T5 |
| Question Generation | WeiboPolls | ROUGE-1 | 45.33 | T5 |
| Question Generation | WeiboPolls | ROUGE-L | 42.69 | T5 |
| Document Summarization | CNN / Daily Mail | ROUGE-1 | 43.52 | T5-11B |
| Document Summarization | CNN / Daily Mail | ROUGE-2 | 21.55 | T5-11B |
| Document Summarization | CNN / Daily Mail | ROUGE-L | 40.69 | T5-11B |
| Intent Recognition | PhotoChat | F1 | 58.9 | T5-3B |
| Intent Recognition | PhotoChat | Precision | 54.1 | T5-3B |
| Intent Recognition | PhotoChat | Recall | 64.6 | T5-3B |
| Intent Recognition | PhotoChat | F1 | 58.1 | T5-base |
| Intent Recognition | PhotoChat | Precision | 58.2 | T5-base |
| Intent Recognition | PhotoChat | Recall | 57.9 | T5-base |