TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Transcending Scaling Laws with 0.1% Extra Compute

Transcending Scaling Laws with 0.1% Extra Compute

Yi Tay, Jason Wei, Hyung Won Chung, Vinh Q. Tran, David R. So, Siamak Shakeri, Xavier Garcia, Huaixiu Steven Zheng, Jinfeng Rao, Aakanksha Chowdhery, Denny Zhou, Donald Metzler, Slav Petrov, Neil Houlsby, Quoc V. Le, Mostafa Dehghani

2022-10-20Question AnsweringMulti-task Language UnderstandingGSM8KLarge Language ModelArithmetic ReasoningCross-Lingual Question AnsweringMMLULanguage Modelling
PaperPDF

Abstract

Scaling language models improves performance but comes with significant computational costs. This paper proposes UL2R, a method that substantially improves existing language models and their scaling curves with a relatively tiny amount of extra compute. The key idea is to continue training a state-of-the-art large language model (e.g., PaLM) on a few more steps with UL2's mixture-of-denoiser objective. We show that, with almost negligible extra computational costs and no new sources of data, we are able to substantially improve the scaling properties of large language models on downstream metrics. In this paper, we continue training PaLM with UL2R, introducing a new set of models at 8B, 62B, and 540B scale which we call U-PaLM. Impressively, at 540B scale, we show an approximately 2x computational savings rate where U-PaLM achieves the same performance as the final PaLM 540B model at around half its computational budget (i.e., saving $\sim$4.4 million TPUv4 hours). We further show that this improved scaling curve leads to 'emergent abilities' on challenging BIG-Bench tasks -- for instance, U-PaLM does much better than PaLM on some tasks or demonstrates better quality at much smaller scale (62B as opposed to 540B). Overall, we show that U-PaLM outperforms PaLM on many few-shot setups, i.e., English NLP tasks (e.g., commonsense reasoning, question answering), reasoning tasks with chain-of-thought (e.g., GSM8K), multilingual tasks (MGSM, TydiQA), MMLU and challenging BIG-Bench tasks. Finally, we provide qualitative examples showing the new capabilities of U-PaLM for single and multi-span infilling.

Results

TaskDatasetMetricValueModel
Transfer LearningMGSMAverage (%)49.9U-PaLM 540B (CoT)
Question AnsweringStrategyQAAccuracy76.6U-PaLM 540B
Question AnsweringStrategyQAAccuracy76.4PaLM 540B
Question AnsweringStrategyQAAccuracy61.9Minerva 540B
Question AnsweringTyDiQA-GoldPEM78.4U-PaLM 62B (fine-tuned)
Question AnsweringTyDiQA-GoldPF188.5U-PaLM 62B (fine-tuned)
Question AnsweringTyDiQA-GoldPEM54.6U-PaLM-540B (CoT)
Multi-Task LearningMGSMAverage (%)49.9U-PaLM 540B (CoT)
Arithmetic ReasoningGSM8KAccuracy58.5U-PaLM
Arithmetic ReasoningGSM8KParameters (Billion)540U-PaLM

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21DENSE: Longitudinal Progress Note Generation with Temporal Modeling of Heterogeneous Clinical Notes Across Hospital Visits2025-07-18From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17GEMMAS: Graph-based Evaluation Metrics for Multi Agent Systems2025-07-17GeoReg: Weight-Constrained Few-Shot Regression for Socio-Economic Estimation using LLM2025-07-17