TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Training Compute-Optimal Large Language Models

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, Laurent SIfre

2022-03-29Figure Of Speech DetectionCausal JudgmentQuestion AnsweringEntailed PolarityMathematical ReasoningMulti-task Language UnderstandingMovie Dialog Same Or DifferentPhrase RelatednessLogical ArgsPresuppositions As NLISentence CompletionMathematical InductionGRE Reading ComprehensionCommon Sense ReasoningSimilarities AbstractionDark Humor DetectionCrass AINavigateSentence AmbiguityMetaphor BooleanOdd One OutLogical ReasoningAnalytic EntailmentEmpirical JudgmentsUnderstanding FablesQuestion SelectionIrony IdentificationMovie RecommendationMoral PermissibilityNonsense Words GrammarTimedialPhysics MCEvaluating Information EssentialityEnglish ProverbsImplicaturesRiddle SenseSports UnderstandingFantasy ReasoningDiscourse Marker PredictionAnalogical SimilarityMMLUIntent RecognitionCrash BlossomIdentify Odd MetaporHuman Organs Senses Multiple ChoiceWord Sense DisambiguationGeneral KnowledgePhysical IntuitionLanguage ModellingLAMBADATemporal SequencesMultiple Choice Question Answering (MCQA)Sarcasm DetectionEpistemic ReasoningImplicit RelationsMisconceptions
PaperPDFCodeCode

Abstract

We investigate the optimal model size and number of tokens for training a transformer language model under a given compute budget. We find that current large language models are significantly undertrained, a consequence of the recent focus on scaling language models whilst keeping the amount of training data constant. By training over \nummodels language models ranging from 70 million to over 16 billion parameters on 5 to 500 billion tokens, we find that for compute-optimal training, the model size and the number of training tokens should be scaled equally: for every doubling of model size the number of training tokens should also be doubled. We test this hypothesis by training a predicted compute-optimal model, \chinchilla, that uses the same compute budget as \gopher but with 70B parameters and 4$\times$ more more data. \chinchilla uniformly and significantly outperforms \Gopher (280B), GPT-3 (175B), Jurassic-1 (178B), and Megatron-Turing NLG (530B) on a large range of downstream evaluation tasks. This also means that \chinchilla uses substantially less compute for fine-tuning and inference, greatly facilitating downstream usage. As a highlight, \chinchilla reaches a state-of-the-art average accuracy of 67.5\% on the MMLU benchmark, greater than a 7\% improvement over \gopher.

Results

TaskDatasetMetricValueModel
Reading ComprehensionBIG-benchAccuracy 92.8Chinchilla-70B (few-shot, k=5)
Reading ComprehensionBIG-benchAccuracy 49.4Chinchilla-70B (few-shot, k=5)
Reading ComprehensionBIG-benchAccuracy52.6Chinchilla-70B (few-shot, k=5)
Reading ComprehensionBIG-benchAccuracy 77.4Chinchilla-70B (zero-shot)
Reading ComprehensionBIG-benchAccuracy75Chinchilla-70B (few-shot, k=5)
Reading ComprehensionBIG-benchAccuracy 82.4Chinchilla-70B (few-shot, k=5)
Reading ComprehensionBIG-benchAccuracy 69Chinchilla-70B (few-shot, k=5)
Reading ComprehensionBIG-benchAccuracy 63.3Chinchilla-70B (few-shot, k=5)
Reading ComprehensionBIG-benchAccuracy 53.1Chinchilla-70B (few-shot, k=5)
Reading ComprehensionBIG-benchAccuracy 54.5Chinchilla-70B (few-shot, k=5)
Reading ComprehensionBIG-benchAccuracy78Chinchilla-70B (few-shot, k=5)
Reading ComprehensionBIG-benchAccuracy 94Chinchilla-70B (few-shot, k=5)
Transfer LearningMMLAverage (%)67.5chatgpt/gpt3.5(20B)
Question AnsweringSIQAAccuracy51.3Chinchilla (zero-shot)
Question AnsweringNatural QuestionsEM35.5Chinchilla (few-shot, k=64)
Question AnsweringPIQAAccuracy81.8Chinchilla 70B (0-shot)
Question AnsweringBoolQAccuracy83.7Chinchilla 70B (0-shot)
Question AnsweringBIG-bench (Novel Concepts)Accuracy65.6Chinchilla-70B (few-shot, k=5)
Question AnsweringBIG-bench (Movie Recommendation)Accuracy75.6Chinchilla-70B (few-shot, k=5)
Question AnsweringBIG-bench (Navigate)Accuracy52.6Chinchilla-70B (few-shot, k=5)
Question AnsweringBIG-bench (Ruin Names)Accuracy47.1Chinchilla-70B (few-shot, k=5)
Question AnsweringBIG-bench (Hyperbaton)Accuracy54.2Chinchilla-70B (few-shot, k=5)
Common Sense ReasoningBIG-bench (Causal Judgment)Accuracy57.4Chinchilla-70B (few-shot, k=5)
Common Sense ReasoningBIG-bench (Disambiguation QA)Accuracy54.7Chinchilla-70B (few-shot, k=5)
Common Sense ReasoningWinoGrandeAccuracy74.9Chinchilla 70B (0-shot)
Common Sense ReasoningBIG-bench (Sports Understanding)Accuracy71Chinchilla-70B (few-shot, k=5)
Common Sense ReasoningBIG-bench (Winowhy)Accuracy62.5Chinchilla-70B (few-shot, k=5)
Common Sense ReasoningBIG-bench (Known Unknowns)Accuracy65.2Chinchilla-70B (few-shot, k=5)
Common Sense ReasoningBIG-bench (Date Understanding)Accuracy52.3Chinchilla-70B (few-shot, k=5)
Common Sense ReasoningBIG-bench (Logical Sequence)Accuracy64.1Chinchilla-70B (few-shot, k=5)
Common Sense ReasoningBIG-benchAccuracy85.7Chinchilla-70B (few-shot, k=5)
Common Sense ReasoningBIG-benchAccuracy13.1Chinchilla-70B (few-shot, k=5)
Common Sense ReasoningBIG-benchAccuracy67.7Chinchilla-70B (few-shot, k=5)
Common Sense ReasoningBIG-benchAccuracy68.8Chinchilla-70B (few-shot, k=5)
Common Sense ReasoningBIG-benchAccuracy 47.6Chinchilla-70B (few-shot, k=5)
Common Sense ReasoningBIG-benchAccuracy75Chinchilla-70B (few-shot, k=5)
Common Sense ReasoningBIG-benchAccuracy73Chinchilla-70B (few-shot, k=5)
Common Sense ReasoningBIG-benchAccuracy 60.3Chinchilla-70B (few-shot, k=5)
Word Sense DisambiguationBIG-bench (Anachronisms)Accuracy69.1Chinchilla-70B (few-shot, k=5)
Language ModellingLAMBADAAccuracy77.7Chinchilla (Zero-Shot)
Sarcasm DetectionBIG-bench (SNARKS)Accuracy58.6Chinchilla-70B (few-shot, k=5)
Multi-Task LearningMMLAverage (%)67.5chatgpt/gpt3.5(20B)
Mathematical ReasoningBIG-benchAccuracy 47.3Chinchilla-70B (few-shot, k=5)
Analogical SimilarityBIG-benchAccuracy38.1Chinchilla-70B (few-shot, k=5)
Identify Odd MetaporBIG-benchAccuracy68.8Chinchilla-70B (few-shot, k=5)
Odd One OutBIG-benchAccuracy70.9Chinchilla-70B (few-shot, k=5)
Sentence CompletionHellaSwagAccuracy80.8Chinchilla 70B (0-shot)
Emotional IntelligenceBIG-benchAccuracy66.2Chinchilla-70B (few-shot, k=5)
EthicsBIG-benchAccuracy57.3Chinchilla-70B (few-shot, k=5)
Fact CheckingBIG-benchAccuracy65.3Chinchilla-70B (few-shot, k=5)
Fact CheckingBIG-benchAccuracy71.7Chinchilla-70B (few-shot, k=5)
General KnowledgeBIG-benchAccuracy94.3Chinchilla-70B (few-shot, k=5)
General KnowledgeBIG-benchAccuracy87Chinchilla-70B (few-shot, k=5)
Logical ReasoningBIG-bench (Penguins In A Table)Accuracy48.7Chinchilla-70B (few-shot, k=5)
Logical ReasoningBIG-bench (Logic Grid Puzzle)Accuracy44Chinchilla-70B (few-shot, k=5)
Logical ReasoningBIG-bench (Temporal Sequences)Accuracy32Chinchilla-70B (few-shot, k=5)
Logical ReasoningBIG-bench (Formal Fallacies Syllogisms Negation)Accuracy52.1Chinchilla-70B (few-shot, k=5)
Logical ReasoningBIG-bench (Reasoning About Colored Objects)Accuracy59.7Chinchilla-70B (few-shot, k=5)
Logical ReasoningBIG-bench (Logical Fallacy Detection)Accuracy72.1Chinchilla-70B (few-shot, k=5)
Logical ReasoningBIG-bench (StrategyQA)Accuracy68.3Chinchilla-70B (few-shot, k=5)
Logical ReasoningBIG-benchAccuracy79Chinchilla-70B (few-shot, k=5)
Logical ReasoningBIG-benchAccuracy60.6Chinchilla-70B (few-shot, k=5)
Logical ReasoningBIG-benchAccuracy93.1Chinchilla-70B (few-shot, k=5)
Logical ReasoningBIG-benchAccuracy67.1Chinchilla-70B (few-shot, k=5)
Logical ReasoningBIG-benchAccuracy94Chinchilla-70B (few-shot, k=5)
Logical ReasoningBIG-benchAccuracy17.6Chinchilla-70B (few-shot, k=5)
Logical ReasoningBIG-benchAccuracy 56.2Chinchilla-70B (few-shot, k=5)
Logical ReasoningBIG-benchAccuracy49.9Chinchilla-70B (few-shot, k=5)
Human Organs Senses Multiple ChoiceBIG-benchAccuracy 85.7Chinchilla-70B (few-shot, k=5)
Intent RecognitionBIG-benchAccuracy 92.8Chinchilla-70B (few-shot, k=5)

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17VAR-MATH: Probing True Mathematical Reasoning in Large Language Models via Symbolic Multi-Instance Benchmarks2025-07-17Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes2025-07-17Making Language Model a Hierarchical Classifier and Generator2025-07-17