TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Scaling Language Models: Methods, Analysis & Insights from...

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, Eliza Rutherford, Tom Hennigan, Jacob Menick, Albin Cassirer, Richard Powell, George van den Driessche, Lisa Anne Hendricks, Maribeth Rauh, Po-Sen Huang, Amelia Glaese, Johannes Welbl, Sumanth Dathathri, Saffron Huang, Jonathan Uesato, John Mellor, Irina Higgins, Antonia Creswell, Nat McAleese, Amy Wu, Erich Elsen, Siddhant Jayakumar, Elena Buchatskaya, David Budden, Esme Sutherland, Karen Simonyan, Michela Paganini, Laurent SIfre, Lena Martens, Xiang Lorraine Li, Adhiguna Kuncoro, Aida Nematzadeh, Elena Gribovskaya, Domenic Donato, Angeliki Lazaridou, Arthur Mensch, Jean-Baptiste Lespiau, Maria Tsimpoukelli, Nikolai Grigorev, Doug Fritz, Thibault Sottiaux, Mantas Pajarskas, Toby Pohlen, Zhitao Gong, Daniel Toyama, Cyprien de Masson d'Autume, Yujia Li, Tayfun Terzi, Vladimir Mikulik, Igor Babuschkin, Aidan Clark, Diego de Las Casas, Aurelia Guy, Chris Jones, James Bradbury, Matthew Johnson, Blake Hechtman, Laura Weidinger, Iason Gabriel, William Isaac, Ed Lockhart, Simon Osindero, Laura Rimell, Chris Dyer, Oriol Vinyals, Kareem Ayoub, Jeff Stanway, Lorrayne Bennett, Demis Hassabis, Koray Kavukcuoglu, Geoffrey Irving

2021-12-08NA 2021 12Figure Of Speech DetectionReading ComprehensionClinical KnowledgeHigh School ChemistryWinograndeCausal JudgmentRACE-mMoral ScenariosQuestion AnsweringMarketingEntailed PolarityMathematical ReasoningAnatomyIntelligent CommunicationHigh School World HistoryMulti-task Language UnderstandingProfessional AccountingMoral DisputesGlobal FactsCollege MedicineMovie Dialog Same Or DifferentPhrase RelatednessElectrical EngineeringLogical ArgsPublic RelationsPresuppositions As NLISentence CompletionJurisprudenceMathematical InductionGRE Reading ComprehensionHigh School PhysicsHigh School PsychologyCommon Sense ReasoningCollege Computer ScienceConceptual PhysicsHuman AgingSimilarities AbstractionDark Humor DetectionHigh School MicroeconomicsCrass AINavigateNatural QuestionsFact CheckingPhilosophySentence AmbiguityMetaphor BooleanHigh School Government and PoliticsCollege ChemistryFormal LogicOdd One OutLogical ReasoningHigh School Computer ScienceAnalytic EntailmentEmpirical JudgmentsUnderstanding FablesHigh School StatisticsQuestion SelectionPrehistoryHigh School GeographyIrony IdentificationHigh School US HistoryTriviaQAMovie RecommendationMiscellaneousCollege BiologyCollege PhysicsProfessional MedicineAbstract AlgebraEmotional IntelligenceMoral PermissibilityElementary MathematicsNonsense Words GrammarHigh School BiologyComputer SecurityWorld ReligionsTimedialEthicsPhysics MCEvaluating Information EssentialityEnglish ProverbsImplicaturesManagementHuman SexualityRiddle SenseSecurity StudiesProfessional LawSports UnderstandingProfessional PsychologyFantasy ReasoningDiscourse Marker PredictionMedical GeneticsAnalogical SimilarityHigh School MathematicsRACE-hIntent RecognitionCrash BlossomBIG-bench Machine LearningIdentify Odd MetaporVirologyHigh School MacroeconomicsAstronomyHuman Organs Senses Multiple ChoiceNutritionFEVER (3-way)Word Sense DisambiguationLogical FallaciesMemorizationGeneral KnowledgeFEVER (2-way)US Foreign PolicyPhysical IntuitionHigh School European HistoryLanguage ModellingLAMBADASociologyEconometricsTemporal SequencesMultiple Choice Question Answering (MCQA)Business EthicsSarcasm DetectionEpistemic ReasoningImplicit RelationsMisconceptionsCollege MathematicsInternational Law
PaperPDFCodeCodeCode

Abstract

Language modelling provides a step towards intelligent communication systems by harnessing large repositories of written human knowledge to better predict and understand the world. In this paper, we present an analysis of Transformer-based language model performance across a wide range of model scales -- from models with tens of millions of parameters up to a 280 billion parameter model called Gopher. These models are evaluated on 152 diverse tasks, achieving state-of-the-art performance across the majority. Gains from scale are largest in areas such as reading comprehension, fact-checking, and the identification of toxic language, but logical and mathematical reasoning see less benefit. We provide a holistic analysis of the training dataset and model's behaviour, covering the intersection of model scale with bias and toxicity. Finally we discuss the application of language models to AI safety and the mitigation of downstream harms.

Results

TaskDatasetMetricValueModel
Reading ComprehensionBIG-benchAccuracy 88.7Gopher-280B (few-shot, k=5)
Reading ComprehensionBIG-benchAccuracy 36.4Gopher-280B (few-shot, k=5)
Reading ComprehensionBIG-benchAccuracy41.4Gopher-280B (few-shot, k=5)
Reading ComprehensionBIG-benchAccuracy 74.5Gopher-280B (zero-shot)
Reading ComprehensionBIG-benchAccuracy62Gopher-280B (few-shot, k=5)
Reading ComprehensionBIG-benchAccuracy 57.6Gopher-280B (few-shot, k=5)
Reading ComprehensionBIG-benchAccuracy 64.1Gopher-280B (few-shot, k=5)
Reading ComprehensionBIG-benchAccuracy 52.7Gopher-280B (few-shot, k=5)
Reading ComprehensionBIG-benchAccuracy 27.3Gopher-280B (few-shot, k=5)
Reading ComprehensionBIG-benchAccuracy 50.7Gopher-280B (few-shot, k=5)
Reading ComprehensionBIG-benchAccuracy61.4Gopher-280B (few-shot, k=5)
Reading ComprehensionBIG-benchAccuracy 81.8Gopher-280B (few-shot, k=5)
Reading ComprehensionBIG-benchAccuracy71.6Gopher-280B (few-shot, k=5)
Reading ComprehensionBIG-benchAccuracy 75.1Gopher-280B (few-shot, k=5)
Question AnsweringSIQAAccuracy50.6Gopher (zero-shot)
Question AnsweringNatural QuestionsEM28.2Gopher (few-shot, k=64)
Question AnsweringTruthfulQAMC10.295Gopher 280B (zero-shot, Our Prompt + Choices)
Question AnsweringTruthfulQAMC10.25Gopher 7.1 (zero-shot, QA prompts)
Question AnsweringTruthfulQAMC10.23Gopher 7.1B (zero-shot, Our Prompt + Choices)
Question AnsweringTruthfulQAMC10.23Gopher 1.4 (zero-shot, QA prompts)
Question AnsweringTruthfulQAMC10.217Gopher 1.4B (zero-shot, Our Prompt + Choices)
Question AnsweringPIQAAccuracy81.8Gopher 280B (0-shot)
Question AnsweringBoolQAccuracy79.3Gopher (zero-shot)
Question AnsweringBIG-bench (Novel Concepts)Accuracy59.1Gopher-280B (few-shot, k=5)
Question AnsweringBIG-bench (Movie Recommendation)Accuracy50.5Gopher-280B (few-shot, k=5)
Question AnsweringBIG-bench (Navigate)Accuracy51.1Gopher-280B (few-shot, k=5)
Question AnsweringBIG-bench (Ruin Names)Accuracy38.6Gopher-280B (few-shot, k=5)
Question AnsweringBIG-bench (Hyperbaton)Accuracy51.7Gopher-280B (few-shot, k=5)
Common Sense ReasoningBIG-bench (Causal Judgment)Accuracy50.8Gopher-280B (few-shot, k=5)
Common Sense ReasoningBIG-bench (Disambiguation QA)Accuracy45.5Gopher-280B (few-shot, k=5)
Common Sense ReasoningWinoGrandeAccuracy70.1Gopher 280B (0-shot)
Common Sense ReasoningBIG-bench (Sports Understanding)Accuracy54.9Gopher-280B (few-shot, k=5)
Common Sense ReasoningBIG-bench (Winowhy)Accuracy56.7Gopher-280B (few-shot, k=5)
Common Sense ReasoningBIG-bench (Known Unknowns)Accuracy63.6Gopher-280B (few-shot, k=5)
Common Sense ReasoningBIG-bench (Date Understanding)Accuracy44.1Gopher-280B (few-shot, k=5)
Common Sense ReasoningBIG-bench (Logical Sequence)Accuracy36.4Gopher-280B (few-shot, k=5)
Common Sense ReasoningBIG-benchAccuracy68.2Gopher-280B (few-shot, k=5)
Common Sense ReasoningBIG-benchAccuracy11.7Gopher-280B (few-shot, k=5)
Common Sense ReasoningBIG-benchAccuracy52.5Gopher-280B (few-shot, k=5)
Common Sense ReasoningBIG-benchAccuracy50.9Gopher-280B (few-shot, k=5)
Common Sense ReasoningBIG-benchAccuracy 63.6Gopher-280B (few-shot, k=5)
Common Sense ReasoningBIG-benchAccuracy56.8Gopher-280B (few-shot, k=5)
Common Sense ReasoningBIG-benchAccuracy69.7Gopher-280B (few-shot, k=5)
Common Sense ReasoningBIG-benchAccuracy 39.6Gopher-280B (few-shot, k=5)
Word Sense DisambiguationBIG-bench (Anachronisms)Accuracy56.4Gopher-280B (few-shot, k=5)
Language Modelling USPTO BackgroundsBPB0.546Gopher
Language Modelling StackExchangeBPB0.641Gopher
Language ModellingFreeLawBPB0.513Gopher
Language ModellingPhilPapersBPB0.695Gopher
Language ModellingArxiv HEP-TH citation graphBPB0.662Gopher
Language ModellingCuration CorpusBPB0.475Gopher
Language ModellingOpenWebtext2BPB0.677Gopher
Language Modelling Gutenberg PG-19BPB0.656Gopher
Language Modelling Bookcorpus2BPB0.741Gopher
Language ModellingDM MathematicsBPB1.14Gopher
Language ModellingBooks3BPB0.712Gopher
Language ModellingHackerNewsBPB0.89Gopher
Language ModellingPile CCBPB0.691Gopher
Language ModellingGitHubBPB0.377Gopher
Language Modelling PubMed CentralBPB0.525Gopher
Language ModellingNIH ExPorterBPB0.59Gopher
Language ModellingPubMed Cognitive Control AbstractsBPB0.577Gopher
Language Modelling OpenSubtitlesBPB0.899Gopher
Language Modelling Ubuntu IRCBPB1.09Gopher
Sarcasm DetectionBIG-bench (SNARKS)Accuracy48.3Gopher-280B (few-shot, k=5)
Mathematical ReasoningBIG-benchAccuracy35.7Gopher-280B (few-shot, k=5)
Mathematical ReasoningBIG-benchAccuracy25Gopher-280B (few-shot, k=5)
Mathematical ReasoningBIG-benchAccuracy 57.6Gopher-280B (few-shot, k=5)
Mathematical ReasoningBIG-benchAccuracy23.7Gopher-280B (few-shot, k=5)
Mathematical ReasoningBIG-benchAccuracy 44.3Gopher-280B (few-shot, k=5)
Analogical SimilarityBIG-benchAccuracy17.2Gopher-280B (few-shot, k=5)
Identify Odd MetaporBIG-benchAccuracy38.6Gopher-280B (few-shot, k=5)
Odd One OutBIG-benchAccuracy32.5Gopher-280B (few-shot, k=5)
Sentence CompletionHellaSwagAccuracy79.2Gopher 280B (0-shot)
Emotional IntelligenceBIG-benchAccuracy83.1Gopher-280B (few-shot, k=5)
EthicsBIG-benchAccuracy40.2Gopher-280B (few-shot, k=5)
EthicsBIG-benchAccuracy55.1Gopher-280B (few-shot, k=5)
EthicsBIG-benchAccuracy70Gopher-280B (few-shot, k=5)
EthicsBIG-benchAccuracy66.8Gopher-280B (few-shot, k=5)
Fact CheckingBIG-benchAccuracy61.7Gopher-280B (few-shot, k=5)
Fact CheckingBIG-benchAccuracy69.1Gopher-280B (few-shot, k=5)
Fact CheckingBIG-benchAccuracy77.5Gopher-280B (few-shot, k=10)
Fact CheckingBIG-benchAccuracy77.5Gopher-280B (few-shot, k=15)
General KnowledgeBIG-benchAccuracy93.9Gopher-280B (few-shot, k=5)
General KnowledgeBIG-benchAccuracy28.2Gopher-280B (few-shot, k=64)
General KnowledgeBIG-benchAccuracy57.1Gopher-280B (few-shot, k=64)
General KnowledgeBIG-benchAccuracy75.7Gopher-280B (few-shot, k=5)
General KnowledgeBIG-benchAccuracy81.8Gopher-280B (few-shot, k=5)
General KnowledgeBIG-benchAccuracy38Gopher-280B (few-shot, k=5)
High School European HistoryBIG-benchAccuracy72.1Gopher-280B (few-shot, k=5)
High School US HistoryBIG-benchAccuracy78.9Gopher-280B (few-shot, k=5)
High School World HistoryBIG-benchAccuracy75.1Gopher-280B (few-shot, k=5)
International LawBIG-benchAccuracy77.7Gopher-280B (few-shot, k=5)
JurisprudenceBIG-benchAccuracy 71.3Gopher-280B (few-shot, k=5)
Logical FallaciesBIG-benchAccuracy 72.4Gopher-280B (few-shot, k=5)
ManagementBIG-benchAccuracy 77.7Gopher-280B (few-shot, k=5)
MarketingBIG-benchAccuracy83.3Gopher-280B (few-shot, k=5)
PhilosophyBIG-benchAccuracy68.8Gopher-280B (few-shot, k=5)
PrehistoryBIG-benchAccuracy67.6Gopher-280B (few-shot, k=5)
Professional LawBIG-benchAccuracy44.5Gopher-280B (few-shot, k=5)
World ReligionsBIG-benchAccuracy84.2Gopher-280B (few-shot, k=5)
Logical ReasoningBIG-bench (Penguins In A Table)Accuracy40.6Gopher-280B (few-shot, k=5)
Logical ReasoningBIG-bench (Logic Grid Puzzle)Accuracy35.1Gopher-280B (few-shot, k=5)
Logical ReasoningBIG-bench (Temporal Sequences)Accuracy19Gopher-280B (few-shot, k=5)
Logical ReasoningBIG-bench (Formal Fallacies Syllogisms Negation)Accuracy50.7Gopher-280B (few-shot, k=5)
Logical ReasoningBIG-bench (Reasoning About Colored Objects)Accuracy49.2Gopher-280B (few-shot, k=5)
Logical ReasoningBIG-bench (Logical Fallacy Detection)Accuracy58.9Gopher-280B (few-shot, k=5)
Logical ReasoningBIG-bench (StrategyQA)Accuracy61Gopher-280B (few-shot, k=5)
Logical ReasoningBIG-benchAccuracy59.7Gopher-280B (few-shot, k=5)
Logical ReasoningBIG-benchAccuracy56.4Gopher-280B (few-shot, k=5)
Logical ReasoningBIG-benchAccuracy33.6Gopher-280B (few-shot, k=5)
Logical ReasoningBIG-benchAccuracy59.3Gopher-280B (few-shot, k=5)
Logical ReasoningBIG-benchAccuracy53Gopher-280B (few-shot, k=5)
Logical ReasoningBIG-benchAccuracy89.5Gopher-280B (few-shot, k=5)
Logical ReasoningBIG-benchAccuracy16.7Gopher-280B (few-shot, k=5)
Logical ReasoningBIG-benchAccuracy 59.1Gopher-280B (few-shot, k=5)
Logical ReasoningBIG-benchAccuracy34Gopher-280B (few-shot, k=5)
Logical ReasoningBIG-benchAccuracy37Gopher-280B (few-shot, k=5)
AnatomyBIG-benchAccuracy 56.3Gopher-280B (few-shot, k=5)
Clinical KnowledgeBIG-benchAccuracy 67.2Gopher-280B (few-shot, k=5)
College MedicineBIG-benchAccuracy 60.1Gopher-280B (few-shot, k=5)
Human AgingBIG-benchAccuracy 66.4Gopher-280B (few-shot, k=5)
Human Organs Senses Multiple ChoiceBIG-benchAccuracy 84.8Gopher-280B (few-shot, k=5)
Medical GeneticsBIG-benchAccuracy69Gopher-280B (few-shot, k=5)
NutritionBIG-benchAccuracy 69.9Gopher-280B (few-shot, k=5)
Professional MedicineBIG-benchAccuracy64Gopher-280B (few-shot, k=5)
VirologyBIG-benchAccuracy47Gopher-280B (few-shot, k=5)
EconometricsBIG-benchAccuracy43Gopher-280B (few-shot, k=5)
High School GeographyBIG-benchAccuracy 76.8Gopher-280B (few-shot, k=5)
High School Government and PoliticsBIG-benchAccuracy 83.9Gopher-280B (few-shot, k=5)
High School MacroeconomicsBIG-benchAccuracy 65.1Gopher-280B (few-shot, k=5)
High School MicroeconomicsBIG-benchAccuracy66.4Gopher-280B (few-shot, k=5)
High School PsychologyBIG-benchAccuracy 81.8Gopher-280B (few-shot, k=5)
Human SexualityBIG-benchAccuracy67.2Gopher-280B (few-shot, k=5)
Professional PsychologyBIG-benchAccuracy 68.1Gopher-280B (few-shot, k=5)
Public RelationsBIG-benchAccuracy 71.8Gopher-280B (few-shot, k=5)
Security StudiesBIG-benchAccuracy 64.9Gopher-280B (few-shot, k=5)
SociologyBIG-benchAccuracy 84.1Gopher-280B (few-shot, k=5)
US Foreign PolicyBIG-benchAccuracy 81Gopher-280B (few-shot, k=5)
Intent RecognitionBIG-benchAccuracy 88.7Gopher-280B (few-shot, k=5)
MemorizationBIG-bench (Hindu Knowledge)Accuracy80Gopher-280B (few-shot, k=5)
BIG-bench Machine LearningBIG-benchAccuracy41.1Gopher-280B (few-shot, k=5)
AstronomyBIG-benchAccuracy65.8Gopher-280B (few-shot, k=5)
Computer SecurityBIG-benchAccuracy 65Gopher-280B (few-shot, k=5)

Related Papers

PiMRef: Detecting and Explaining Ever-evolving Spear Phishing Emails with Knowledge Base Invariants2025-07-21Leveraging Context for Multimodal Fallacy Classification in Political Debates2025-07-21Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17VAR-MATH: Probing True Mathematical Reasoning in Large Language Models via Symbolic Multi-Instance Benchmarks2025-07-17