TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/PaLM: Scaling Language Modeling with Pathways

PaLM: Scaling Language Modeling with Pathways

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradbury, Jacob Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke, Anselm Levskaya, Sanjay Ghemawat, Sunipa Dev, Henryk Michalewski, Xavier Garcia, Vedant Misra, Kevin Robinson, Liam Fedus, Denny Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim, Barret Zoph, Alexander Spiridonov, Ryan Sepassi, David Dohan, Shivani Agrawal, Mark Omernick, Andrew M. Dai, Thanumalayan Sankaranarayana Pillai, Marie Pellat, Aitor Lewkowycz, Erica Moreira, Rewon Child, Oleksandr Polozov, Katherine Lee, Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark Diaz, Orhan Firat, Michele Catasta, Jason Wei, Kathy Meier-Hellstern, Douglas Eck, Jeff Dean, Slav Petrov, Noah Fiedel

2022-04-05Google Research 2022 4Reading ComprehensionQuestion AnsweringFew-Shot LearningMathMulti-task Language UnderstandingSentence CompletionCoreference ResolutionNatural Language InferenceCommon Sense ReasoningAuto DebuggingLogical ReasoningCross-Lingual Question AnsweringCode GenerationMemorizationLanguage ModellingMultiple Choice Question Answering (MCQA)
PaperPDFCodeCodeCodeCodeCodeCodeCode

Abstract

Large language models have been shown to achieve remarkable performance across a variety of natural language tasks using few-shot learning, which drastically reduces the number of task-specific training examples needed to adapt the model to a particular application. To further our understanding of the impact of scale on few-shot learning, we trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model PaLM. We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods. We demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks. On a number of these tasks, PaLM 540B achieves breakthrough performance, outperforming the finetuned state-of-the-art on a suite of multi-step reasoning tasks, and outperforming average human performance on the recently released BIG-bench benchmark. A significant number of BIG-bench tasks showed discontinuous improvements from model scale, meaning that performance steeply increased as we scaled to our largest model. PaLM also has strong capabilities in multilingual tasks and source code generation, which we demonstrate on a wide array of benchmarks. We additionally provide a comprehensive analysis on bias and toxicity, and study the extent of training data memorization with respect to model scale. Finally, we discuss the ethical considerations related to large language models and discuss potential mitigation strategies.

Results

TaskDatasetMetricValueModel
Reading ComprehensionRACEAccuracy (High)49.1PaLM 540B (zero-shot)
Reading ComprehensionRACEAccuracy (Middle)68.1PaLM 540B (zero-shot)
Reading ComprehensionRACEAccuracy (High)47.5PaLM 62B (zero-shot)
Reading ComprehensionRACEAccuracy (Middle)64.3PaLM 62B (zero-shot)
Reading ComprehensionRACEAccuracy (High)42.3PaLM 8B (zero-shot)
Reading ComprehensionRACEAccuracy (Middle)57.9PaLM 8B (zero-shot)
Transfer LearningMGSMAverage (%)55PaLM 540B
Question AnsweringCOPAAccuracy100PaLM 540B (finetuned)
Question AnsweringNatural QuestionsEM39.6PaLM-540B (Few-Shot, k=64)
Question AnsweringNatural QuestionsEM29.3PaLM-540B (One-Shot)
Question AnsweringNatural QuestionsEM21.2PaLM-540B (Zero-Shot)
Question AnsweringOBQAAccuracy53.4PaLM 540B (zero-shot)
Question AnsweringOBQAAccuracy50.4PaLM 62B (zero-shot)
Question AnsweringMultiRCEM69.2PaLM 540B (finetuned)
Question AnsweringMultiRCF190.1PaLM 540B (finetuned)
Question AnsweringWebQuestionsEM43.5PaLM-540B (Few-Shot)
Question AnsweringWebQuestionsEM22.6PaLM-540B (One-Shot)
Question AnsweringWebQuestionsEM10.6PaLM-540B (Zero-Shot)
Question AnsweringBoolQAccuracy92.2PaLM 540B (fine-tuned)
Question AnsweringTriviaQAEM81.4PaLM-540B (Few-Shot)
Question AnsweringTriviaQAEM81.4PaLM-540B (One-Shot)
Question AnsweringTriviaQAEM76.9PaLM-540B (Zero-Shot)
Question AnsweringBIG-bench (Novel Concepts)Accuracy71.9PaLM-540B (few-shot, k=5)
Question AnsweringBIG-bench (Novel Concepts)Accuracy59.4PaLM-62B (few-shot, k=5)
Question AnsweringTyDiQA-GoldPEM52.9PaLM-540B (CoT)
Code GenerationMBPPAccuracy47PaLM Coder 540B
Code GenerationMBPPAccuracy36.8PaLM 540B
Common Sense ReasoningWinoGrandeAccuracy81.1PaLM 540B (0-shot)
Common Sense ReasoningWinoGrandeAccuracy77PaLM 62B (0-shot)
Common Sense ReasoningWinoGrandeAccuracy77PaLM-cont 62B (0-shot)
Common Sense ReasoningBIG-bench (Winowhy)Accuracy65.9PaLM-540B (few-shot, k=5)
Common Sense ReasoningBIG-bench (Winowhy)Accuracy61PaLM-62B (few-shot, k=5)
Common Sense ReasoningBIG-bench (Known Unknowns)Accuracy73.9PaLM-540B (few-shot, k=5)
Common Sense ReasoningReCoRDEM94PaLM 540B (finetuned)
Common Sense ReasoningReCoRDF194.6PaLM 540B (finetuned)
Word Sense DisambiguationWords in ContextAccuracy78.8PaLM 540B (finetuned)
Natural Language InferenceCommitmentBankAccuracy100PaLM 540B (finetuned)
Natural Language InferenceCommitmentBankF1100PaLM 540B (finetuned)
Language ModellingLAMBADAAccuracy89.7PaLM-540B (Few-Shot)
Language ModellingLAMBADAAccuracy81.8PaLM-540B (One-Shot)
Language ModellingLAMBADAAccuracy77.9PaLM-540B (Zero-Shot)
Coreference ResolutionWinograd Schema ChallengeAccuracy100PaLM 540B (fine-tuned)
Coreference ResolutionWinograd Schema ChallengeAccuracy89.5PaLM 540B (5-shot)
Coreference ResolutionWinograd Schema ChallengeAccuracy89.1PaLM 540B (0-shot)
Coreference ResolutionWinograd Schema ChallengeAccuracy86.3PaLM 540B (1-shot)
Multi-Task LearningMGSMAverage (%)55PaLM 540B
Extreme SummarizationGEM-XSumROUGE-221.2PaLM (finetuning)-540B
Extreme SummarizationGEM-XSumROUGE-221T5-XXL
Extreme SummarizationGEM-XSumROUGE-218.5PaLM (finetuning)-62B
Sentence CompletionHellaSwagAccuracy83.8PaLM-540B (Few-Shot)
Sentence CompletionHellaSwagAccuracy83.6PaLM-540B (1-shot)
Sentence CompletionHellaSwagAccuracy83.4PaLM-540B (0-shot)
Auto DebuggingBig-bench LiteExact string match38.2PaLM 62B (few-shot, k=5)
Auto DebuggingBig-bench LiteExact string match38.2PaLM 540B (few-shot, k=5)
Auto DebuggingBig-bench LiteExact string match14.7PaLM 8B (few-shot, k=5)
Logical ReasoningBIG-bench (StrategyQA)Accuracy73.9PaLM-540B (few-shot, k=5)
Logical ReasoningBIG-bench (StrategyQA)Accuracy65.4PaLM-62B (few-shot, k=5)
MemorizationBIG-bench (Hindu Knowledge)Accuracy95.4PaLM-540B (few-shot, k=5)
MemorizationBIG-bench (Hindu Knowledge)Accuracy77.7PaLM-62B (few-shot, k=5)

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learning2025-07-18From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17GLAD: Generalizable Tuning for Vision-Language Models2025-07-17VAR-MATH: Probing True Mathematical Reasoning in Large Language Models via Symbolic Multi-Instance Benchmarks2025-07-17