TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Exploring the Limits of Transfer Learning with a Unified T...

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu

2019-10-23arXiv 2019 10Machine TranslationSemantic ParsingQuestion AnsweringSentiment AnalysisCoreference ResolutionNatural Language InferenceCommon Sense ReasoningMultimodal Intent RecognitionTransfer LearningSemantic Textual SimilarityLinguistic AcceptabilityQuestion GenerationAnswer GenerationWord Sense DisambiguationPoll Generation
PaperPDFCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCodeCode

Abstract

Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new ``Colossal Clean Crawled Corpus'', we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.

Results

TaskDatasetMetricValueModel
Machine TranslationWMT2014 English-GermanBLEU score32.1T5-11B
Machine TranslationWMT2014 English-FrenchBLEU score43.4T5
Reading ComprehensionPhotoChatF158.9T5-3B
Reading ComprehensionPhotoChatPrecision54.1T5-3B
Reading ComprehensionPhotoChatRecall64.6T5-3B
Reading ComprehensionPhotoChatF158.1T5-base
Reading ComprehensionPhotoChatPrecision58.2T5-base
Reading ComprehensionPhotoChatRecall57.9T5-base
Question AnsweringCOPAAccuracy94.8T5-XXL 11B (fine-tuned)
Question AnsweringCOPAAccuracy92T5-XL 3B (fine-tuned)
Question AnsweringCOPAAccuracy83.4T5-Large 770M (fine-tuned)
Question AnsweringCOPAAccuracy71.2T5-Base 220M (fine-tuned)
Question AnsweringSQuAD1.1 devEM90.06T5-11B
Question AnsweringSQuAD1.1 devF195.64T5-11B
Question AnsweringSQuAD1.1 devEM88.53T5-3B
Question AnsweringSQuAD1.1 devF194.95T5-3B
Question AnsweringSQuAD1.1 devEM86.66T5-Large 770M
Question AnsweringSQuAD1.1 devF193.79T5-Large 770M
Question AnsweringSQuAD1.1 devEM85.44T5-Base
Question AnsweringSQuAD1.1 devF192.08T5-Base
Question AnsweringSQuAD1.1 devEM79.1T5-Small
Question AnsweringSQuAD1.1 devF187.24T5-Small
Question AnsweringMultiRCF188.1T5-XXL 11B (fine-tuned)
Question AnsweringMultiRCEM63.3T5-11B
Question AnsweringWebQuestionsEM42.8T5.1.1-XXL+SSM
Question AnsweringBoolQAccuracy91.2T5-XXL 11B (fine-tuned)
Question AnsweringBoolQAccuracy85.4T5-Large 770M (fine-tuned)
Question AnsweringBoolQAccuracy81.4T5-Base 220M (fine-tuned)
Question AnsweringBoolQAccuracy76.4T5-Small 60M (fine-tuned)
Common Sense ReasoningReCoRDEM93.4T5-XXL 11B (fine-tuned)
Common Sense ReasoningReCoRDF194.1T5-11B
Word Sense DisambiguationWords in ContextAccuracy76.9T5-XXL 11B
Natural Language InferenceWNLIAccuracy93.2T5-XXL 11B
Natural Language InferenceWNLIAccuracy89.7T5-XL 3B
Natural Language InferenceWNLIAccuracy85.6T5-Large 770M
Natural Language InferenceWNLIAccuracy78.8T5-Base 220M
Natural Language InferenceWNLIAccuracy69.2T5-Small 60M
Natural Language InferenceCommitmentBankAccuracy96.8T5-XXL 11B (fine-tuned)
Natural Language InferenceCommitmentBankF193.9T5-XXL 11B (fine-tuned)
Natural Language InferenceCommitmentBankAccuracy94.4T5-Large 770M (fine-tuned)
Natural Language InferenceCommitmentBankF190.3T5-Large 770M (fine-tuned)
Natural Language InferenceCommitmentBankAccuracy94T5-Base 220M (fine-tuned)
Natural Language InferenceCommitmentBankF186.2T5-Base 220M (fine-tuned)
Natural Language InferenceMultiNLIMatched92T5-XXL 11B (fine-tuned)
Natural Language InferenceMultiNLIMatched91.4T5-3B
Natural Language InferenceMultiNLIMismatched91.2T5-3B
Natural Language InferenceMultiNLIMatched89.9T5-Large
Natural Language InferenceMultiNLIMatched87.1T5-Base
Natural Language InferenceMultiNLIMismatched86.2T5-Base
Natural Language InferenceMultiNLIMatched82.4T5-Small
Natural Language InferenceMultiNLIMismatched82.3T5-Small
Natural Language InferenceMultiNLIMismatched91.7T5-11B
Natural Language InferenceMultiNLIMismatched89.6T5-Large 770M
Natural Language InferenceWeiboPollsBLEU-137.77T5
Natural Language InferenceWeiboPollsBLEU-325.86T5
Natural Language InferenceWeiboPollsROUGE-146.2T5
Natural Language InferenceWeiboPollsROUGE-L43.32T5
Semantic Textual SimilarityMRPCF191.9T5-11B
Semantic Textual SimilarityMRPCF192.4T5-Large
Semantic Textual SimilarityMRPCF192.5T5-3B
Semantic Textual SimilarityMRPCF190.7T5-Base
Semantic Textual SimilarityMRPCF189.7T5-Small
Semantic Textual SimilaritySTS BenchmarkPearson Correlation0.925T5-11B
Semantic Textual SimilaritySTS BenchmarkSpearman Correlation0.921T5-11B
Semantic Textual SimilaritySTS BenchmarkPearson Correlation0.906T5-3B
Semantic Textual SimilaritySTS BenchmarkSpearman Correlation0.898T5-3B
Semantic Textual SimilaritySTS BenchmarkPearson Correlation0.899T5-Large
Semantic Textual SimilaritySTS BenchmarkPearson Correlation0.894T5-Base
Semantic Textual SimilaritySTS BenchmarkPearson Correlation0.856T5-Small
Semantic Textual SimilaritySTS BenchmarkSpearman Correlation0.85T5-Small
Semantic Textual SimilaritySTS BenchmarkSpearman Correlation0.886T5-Large 770M
Semantic ParsingWebQuestionsSPAccuracy56.5T5-11B (Raffel et al., 2020)
Sentiment AnalysisSST-2 Binary classificationAccuracy97.5T5-11B
Sentiment AnalysisSST-2 Binary classificationAccuracy97.4T5-3B
Sentiment AnalysisSST-2 Binary classificationAccuracy96.3T5-Large 770M
Sentiment AnalysisSST-2 Binary classificationAccuracy95.2T5-Base
Sentiment AnalysisSST-2 Binary classificationAccuracy91.8T5-Small
Coreference ResolutionWinograd Schema ChallengeAccuracy93.8T5-XXL 11B (fine-tuned)
Text SummarizationCNN / Daily MailROUGE-143.52T5
Text SummarizationCNN / Daily MailROUGE-221.55T5
Text SummarizationCNN / Daily MailROUGE-L40.69T5
Text SummarizationCNN / Daily MailROUGE-143.52T5-11B
Text SummarizationCNN / Daily MailROUGE-221.55T5-11B
Text SummarizationCNN / Daily MailROUGE-L40.69T5-11B
Abstractive Text SummarizationCNN / Daily MailROUGE-143.52T5
Abstractive Text SummarizationCNN / Daily MailROUGE-221.55T5
Abstractive Text SummarizationCNN / Daily MailROUGE-L40.69T5
Question GenerationWeiboPollsBLEU-136.91T5
Question GenerationWeiboPollsBLEU-316.26T5
Question GenerationWeiboPollsROUGE-144.46T5
Question GenerationWeiboPollsROUGE-L42.06T5
Question GenerationWeiboPollsBLEU-137.34T5
Question GenerationWeiboPollsBLEU-321.06T5
Question GenerationWeiboPollsROUGE-145.33T5
Question GenerationWeiboPollsROUGE-L42.69T5
Document SummarizationCNN / Daily MailROUGE-143.52T5-11B
Document SummarizationCNN / Daily MailROUGE-221.55T5-11B
Document SummarizationCNN / Daily MailROUGE-L40.69T5-11B
Intent RecognitionPhotoChatF158.9T5-3B
Intent RecognitionPhotoChatPrecision54.1T5-3B
Intent RecognitionPhotoChatRecall64.6T5-3B
Intent RecognitionPhotoChatF158.1T5-base
Intent RecognitionPhotoChatPrecision58.2T5-base
Intent RecognitionPhotoChatRecall57.9T5-base

Related Papers

RaMen: Multi-Strategy Multi-Modal Learning for Bundle Construction2025-07-18From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17AdaptiSent: Context-Aware Adaptive Attention for Multimodal Aspect-Based Sentiment Analysis2025-07-17Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes2025-07-17Disentangling coincident cell events using deep transfer learning and compressive sensing2025-07-17