Scaling Language Models: Methods, Analysis & Insights from Training Gopher

Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, Eliza Rutherford, Tom Hennigan, Jacob Menick, Albin Cassirer, Richard Powell, George van den Driessche, Lisa Anne Hendricks, Maribeth Rauh, Po-Sen Huang, Amelia Glaese, Johannes Welbl, Sumanth Dathathri, Saffron Huang, Jonathan Uesato, John Mellor, Irina Higgins, Antonia Creswell, Nat McAleese, Amy Wu, Erich Elsen, Siddhant Jayakumar, Elena Buchatskaya, David Budden, Esme Sutherland, Karen Simonyan, Michela Paganini, Laurent SIfre, Lena Martens, Xiang Lorraine Li, Adhiguna Kuncoro, Aida Nematzadeh, Elena Gribovskaya, Domenic Donato, Angeliki Lazaridou, Arthur Mensch, Jean-Baptiste Lespiau, Maria Tsimpoukelli, Nikolai Grigorev, Doug Fritz, Thibault Sottiaux, Mantas Pajarskas, Toby Pohlen, Zhitao Gong, Daniel Toyama, Cyprien de Masson d'Autume, Yujia Li, Tayfun Terzi, Vladimir Mikulik, Igor Babuschkin, Aidan Clark, Diego de Las Casas, Aurelia Guy, Chris Jones, James Bradbury, Matthew Johnson, Blake Hechtman, Laura Weidinger, Iason Gabriel, William Isaac, Ed Lockhart, Simon Osindero, Laura Rimell, Chris Dyer, Oriol Vinyals, Kareem Ayoub, Jeff Stanway, Lorrayne Bennett, Demis Hassabis, Koray Kavukcuoglu, Geoffrey Irving

2021-12-08NA 2021 12Figure Of Speech Detection Reading Comprehension Clinical Knowledge High School Chemistry Winogrande Causal Judgment RACE-m Moral Scenarios Question Answering Marketing Entailed Polarity Mathematical Reasoning Anatomy Intelligent Communication High School World History Multi-task Language Understanding Professional Accounting Moral Disputes Global Facts College Medicine Movie Dialog Same Or Different Phrase Relatedness Electrical Engineering Logical Args Public Relations Presuppositions As NLI Sentence Completion Jurisprudence Mathematical Induction GRE Reading Comprehension High School Physics High School Psychology Common Sense Reasoning College Computer Science Conceptual Physics Human Aging Similarities Abstraction Dark Humor Detection High School Microeconomics Crass AI Navigate Natural Questions Fact Checking Philosophy Sentence Ambiguity Metaphor Boolean High School Government and Politics College Chemistry Formal Logic Odd One Out Logical Reasoning High School Computer Science Analytic Entailment Empirical Judgments Understanding Fables High School Statistics Question Selection Prehistory High School Geography Irony Identification High School US History TriviaQA Movie Recommendation Miscellaneous College Biology College Physics Professional Medicine Abstract Algebra Emotional Intelligence Moral Permissibility Elementary Mathematics Nonsense Words Grammar High School Biology Computer Security World Religions Timedial Ethics Physics MC Evaluating Information Essentiality English Proverbs Implicatures Management Human Sexuality Riddle Sense Security Studies Professional Law Sports Understanding Professional Psychology Fantasy Reasoning Discourse Marker Prediction Medical Genetics Analogical Similarity High School Mathematics RACE-h Intent Recognition Crash Blossom BIG-bench Machine Learning Identify Odd Metapor Virology High School Macroeconomics Astronomy Human Organs Senses Multiple Choice Nutrition FEVER (3-way)Word Sense Disambiguation Logical Fallacies Memorization General Knowledge FEVER (2-way)US Foreign Policy Physical Intuition High School European History Language Modelling LAMBADA Sociology Econometrics Temporal Sequences Multiple Choice Question Answering (MCQA)Business Ethics Sarcasm Detection Epistemic Reasoning Implicit Relations Misconceptions College Mathematics International Law

Paper PDF Code Code Code

Abstract

Language modelling provides a step towards intelligent communication systems by harnessing large repositories of written human knowledge to better predict and understand the world. In this paper, we present an analysis of Transformer-based language model performance across a wide range of model scales -- from models with tens of millions of parameters up to a 280 billion parameter model called Gopher. These models are evaluated on 152 diverse tasks, achieving state-of-the-art performance across the majority. Gains from scale are largest in areas such as reading comprehension, fact-checking, and the identification of toxic language, but logical and mathematical reasoning see less benefit. We provide a holistic analysis of the training dataset and model's behaviour, covering the intersection of model scale with bias and toxicity. Finally we discuss the application of language models to AI safety and the mitigation of downstream harms.

Results

Task	Dataset	Metric	Value	Model
Reading Comprehension	BIG-bench	Accuracy	88.7	Gopher-280B (few-shot, k=5)
Reading Comprehension	BIG-bench	Accuracy	36.4	Gopher-280B (few-shot, k=5)
Reading Comprehension	BIG-bench	Accuracy	41.4	Gopher-280B (few-shot, k=5)
Reading Comprehension	BIG-bench	Accuracy	74.5	Gopher-280B (zero-shot)
Reading Comprehension	BIG-bench	Accuracy	62	Gopher-280B (few-shot, k=5)
Reading Comprehension	BIG-bench	Accuracy	57.6	Gopher-280B (few-shot, k=5)
Reading Comprehension	BIG-bench	Accuracy	64.1	Gopher-280B (few-shot, k=5)
Reading Comprehension	BIG-bench	Accuracy	52.7	Gopher-280B (few-shot, k=5)
Reading Comprehension	BIG-bench	Accuracy	27.3	Gopher-280B (few-shot, k=5)
Reading Comprehension	BIG-bench	Accuracy	50.7	Gopher-280B (few-shot, k=5)
Reading Comprehension	BIG-bench	Accuracy	61.4	Gopher-280B (few-shot, k=5)
Reading Comprehension	BIG-bench	Accuracy	81.8	Gopher-280B (few-shot, k=5)
Reading Comprehension	BIG-bench	Accuracy	71.6	Gopher-280B (few-shot, k=5)
Reading Comprehension	BIG-bench	Accuracy	75.1	Gopher-280B (few-shot, k=5)
Question Answering	SIQA	Accuracy	50.6	Gopher (zero-shot)
Question Answering	Natural Questions	EM	28.2	Gopher (few-shot, k=64)
Question Answering	TruthfulQA	MC1	0.295	Gopher 280B (zero-shot, Our Prompt + Choices)
Question Answering	TruthfulQA	MC1	0.25	Gopher 7.1 (zero-shot, QA prompts)
Question Answering	TruthfulQA	MC1	0.23	Gopher 7.1B (zero-shot, Our Prompt + Choices)
Question Answering	TruthfulQA	MC1	0.23	Gopher 1.4 (zero-shot, QA prompts)
Question Answering	TruthfulQA	MC1	0.217	Gopher 1.4B (zero-shot, Our Prompt + Choices)
Question Answering	PIQA	Accuracy	81.8	Gopher 280B (0-shot)
Question Answering	BoolQ	Accuracy	79.3	Gopher (zero-shot)
Question Answering	BIG-bench (Novel Concepts)	Accuracy	59.1	Gopher-280B (few-shot, k=5)
Question Answering	BIG-bench (Movie Recommendation)	Accuracy	50.5	Gopher-280B (few-shot, k=5)
Question Answering	BIG-bench (Navigate)	Accuracy	51.1	Gopher-280B (few-shot, k=5)
Question Answering	BIG-bench (Ruin Names)	Accuracy	38.6	Gopher-280B (few-shot, k=5)
Question Answering	BIG-bench (Hyperbaton)	Accuracy	51.7	Gopher-280B (few-shot, k=5)
Common Sense Reasoning	BIG-bench (Causal Judgment)	Accuracy	50.8	Gopher-280B (few-shot, k=5)
Common Sense Reasoning	BIG-bench (Disambiguation QA)	Accuracy	45.5	Gopher-280B (few-shot, k=5)
Common Sense Reasoning	WinoGrande	Accuracy	70.1	Gopher 280B (0-shot)
Common Sense Reasoning	BIG-bench (Sports Understanding)	Accuracy	54.9	Gopher-280B (few-shot, k=5)
Common Sense Reasoning	BIG-bench (Winowhy)	Accuracy	56.7	Gopher-280B (few-shot, k=5)
Common Sense Reasoning	BIG-bench (Known Unknowns)	Accuracy	63.6	Gopher-280B (few-shot, k=5)
Common Sense Reasoning	BIG-bench (Date Understanding)	Accuracy	44.1	Gopher-280B (few-shot, k=5)
Common Sense Reasoning	BIG-bench (Logical Sequence)	Accuracy	36.4	Gopher-280B (few-shot, k=5)
Common Sense Reasoning	BIG-bench	Accuracy	68.2	Gopher-280B (few-shot, k=5)
Common Sense Reasoning	BIG-bench	Accuracy	11.7	Gopher-280B (few-shot, k=5)
Common Sense Reasoning	BIG-bench	Accuracy	52.5	Gopher-280B (few-shot, k=5)
Common Sense Reasoning	BIG-bench	Accuracy	50.9	Gopher-280B (few-shot, k=5)
Common Sense Reasoning	BIG-bench	Accuracy	63.6	Gopher-280B (few-shot, k=5)
Common Sense Reasoning	BIG-bench	Accuracy	56.8	Gopher-280B (few-shot, k=5)
Common Sense Reasoning	BIG-bench	Accuracy	69.7	Gopher-280B (few-shot, k=5)
Common Sense Reasoning	BIG-bench	Accuracy	39.6	Gopher-280B (few-shot, k=5)
Word Sense Disambiguation	BIG-bench (Anachronisms)	Accuracy	56.4	Gopher-280B (few-shot, k=5)
Language Modelling	USPTO Backgrounds	BPB	0.546	Gopher
Language Modelling	StackExchange	BPB	0.641	Gopher
Language Modelling	FreeLaw	BPB	0.513	Gopher
Language Modelling	PhilPapers	BPB	0.695	Gopher
Language Modelling	Arxiv HEP-TH citation graph	BPB	0.662	Gopher
Language Modelling	Curation Corpus	BPB	0.475	Gopher
Language Modelling	OpenWebtext2	BPB	0.677	Gopher
Language Modelling	Gutenberg PG-19	BPB	0.656	Gopher
Language Modelling	Bookcorpus2	BPB	0.741	Gopher
Language Modelling	DM Mathematics	BPB	1.14	Gopher
Language Modelling	Books3	BPB	0.712	Gopher
Language Modelling	HackerNews	BPB	0.89	Gopher
Language Modelling	Pile CC	BPB	0.691	Gopher
Language Modelling	GitHub	BPB	0.377	Gopher
Language Modelling	PubMed Central	BPB	0.525	Gopher
Language Modelling	NIH ExPorter	BPB	0.59	Gopher
Language Modelling	PubMed Cognitive Control Abstracts	BPB	0.577	Gopher
Language Modelling	OpenSubtitles	BPB	0.899	Gopher
Language Modelling	Ubuntu IRC	BPB	1.09	Gopher
Sarcasm Detection	BIG-bench (SNARKS)	Accuracy	48.3	Gopher-280B (few-shot, k=5)
Mathematical Reasoning	BIG-bench	Accuracy	35.7	Gopher-280B (few-shot, k=5)
Mathematical Reasoning	BIG-bench	Accuracy	25	Gopher-280B (few-shot, k=5)
Mathematical Reasoning	BIG-bench	Accuracy	57.6	Gopher-280B (few-shot, k=5)
Mathematical Reasoning	BIG-bench	Accuracy	23.7	Gopher-280B (few-shot, k=5)
Mathematical Reasoning	BIG-bench	Accuracy	44.3	Gopher-280B (few-shot, k=5)
Analogical Similarity	BIG-bench	Accuracy	17.2	Gopher-280B (few-shot, k=5)
Identify Odd Metapor	BIG-bench	Accuracy	38.6	Gopher-280B (few-shot, k=5)
Odd One Out	BIG-bench	Accuracy	32.5	Gopher-280B (few-shot, k=5)
Sentence Completion	HellaSwag	Accuracy	79.2	Gopher 280B (0-shot)
Emotional Intelligence	BIG-bench	Accuracy	83.1	Gopher-280B (few-shot, k=5)
Ethics	BIG-bench	Accuracy	40.2	Gopher-280B (few-shot, k=5)
Ethics	BIG-bench	Accuracy	55.1	Gopher-280B (few-shot, k=5)
Ethics	BIG-bench	Accuracy	70	Gopher-280B (few-shot, k=5)
Ethics	BIG-bench	Accuracy	66.8	Gopher-280B (few-shot, k=5)
Fact Checking	BIG-bench	Accuracy	61.7	Gopher-280B (few-shot, k=5)
Fact Checking	BIG-bench	Accuracy	69.1	Gopher-280B (few-shot, k=5)
Fact Checking	BIG-bench	Accuracy	77.5	Gopher-280B (few-shot, k=10)
Fact Checking	BIG-bench	Accuracy	77.5	Gopher-280B (few-shot, k=15)
General Knowledge	BIG-bench	Accuracy	93.9	Gopher-280B (few-shot, k=5)
General Knowledge	BIG-bench	Accuracy	28.2	Gopher-280B (few-shot, k=64)
General Knowledge	BIG-bench	Accuracy	57.1	Gopher-280B (few-shot, k=64)
General Knowledge	BIG-bench	Accuracy	75.7	Gopher-280B (few-shot, k=5)
General Knowledge	BIG-bench	Accuracy	81.8	Gopher-280B (few-shot, k=5)
General Knowledge	BIG-bench	Accuracy	38	Gopher-280B (few-shot, k=5)
High School European History	BIG-bench	Accuracy	72.1	Gopher-280B (few-shot, k=5)
High School US History	BIG-bench	Accuracy	78.9	Gopher-280B (few-shot, k=5)
High School World History	BIG-bench	Accuracy	75.1	Gopher-280B (few-shot, k=5)
International Law	BIG-bench	Accuracy	77.7	Gopher-280B (few-shot, k=5)
Jurisprudence	BIG-bench	Accuracy	71.3	Gopher-280B (few-shot, k=5)
Logical Fallacies	BIG-bench	Accuracy	72.4	Gopher-280B (few-shot, k=5)
Management	BIG-bench	Accuracy	77.7	Gopher-280B (few-shot, k=5)
Marketing	BIG-bench	Accuracy	83.3	Gopher-280B (few-shot, k=5)
Philosophy	BIG-bench	Accuracy	68.8	Gopher-280B (few-shot, k=5)
Prehistory	BIG-bench	Accuracy	67.6	Gopher-280B (few-shot, k=5)
Professional Law	BIG-bench	Accuracy	44.5	Gopher-280B (few-shot, k=5)
World Religions	BIG-bench	Accuracy	84.2	Gopher-280B (few-shot, k=5)
Logical Reasoning	BIG-bench (Penguins In A Table)	Accuracy	40.6	Gopher-280B (few-shot, k=5)
Logical Reasoning	BIG-bench (Logic Grid Puzzle)	Accuracy	35.1	Gopher-280B (few-shot, k=5)
Logical Reasoning	BIG-bench (Temporal Sequences)	Accuracy	19	Gopher-280B (few-shot, k=5)
Logical Reasoning	BIG-bench (Formal Fallacies Syllogisms Negation)	Accuracy	50.7	Gopher-280B (few-shot, k=5)
Logical Reasoning	BIG-bench (Reasoning About Colored Objects)	Accuracy	49.2	Gopher-280B (few-shot, k=5)
Logical Reasoning	BIG-bench (Logical Fallacy Detection)	Accuracy	58.9	Gopher-280B (few-shot, k=5)
Logical Reasoning	BIG-bench (StrategyQA)	Accuracy	61	Gopher-280B (few-shot, k=5)
Logical Reasoning	BIG-bench	Accuracy	59.7	Gopher-280B (few-shot, k=5)
Logical Reasoning	BIG-bench	Accuracy	56.4	Gopher-280B (few-shot, k=5)
Logical Reasoning	BIG-bench	Accuracy	33.6	Gopher-280B (few-shot, k=5)
Logical Reasoning	BIG-bench	Accuracy	59.3	Gopher-280B (few-shot, k=5)
Logical Reasoning	BIG-bench	Accuracy	53	Gopher-280B (few-shot, k=5)
Logical Reasoning	BIG-bench	Accuracy	89.5	Gopher-280B (few-shot, k=5)
Logical Reasoning	BIG-bench	Accuracy	16.7	Gopher-280B (few-shot, k=5)
Logical Reasoning	BIG-bench	Accuracy	59.1	Gopher-280B (few-shot, k=5)
Logical Reasoning	BIG-bench	Accuracy	34	Gopher-280B (few-shot, k=5)
Logical Reasoning	BIG-bench	Accuracy	37	Gopher-280B (few-shot, k=5)
Anatomy	BIG-bench	Accuracy	56.3	Gopher-280B (few-shot, k=5)
Clinical Knowledge	BIG-bench	Accuracy	67.2	Gopher-280B (few-shot, k=5)
College Medicine	BIG-bench	Accuracy	60.1	Gopher-280B (few-shot, k=5)
Human Aging	BIG-bench	Accuracy	66.4	Gopher-280B (few-shot, k=5)
Human Organs Senses Multiple Choice	BIG-bench	Accuracy	84.8	Gopher-280B (few-shot, k=5)
Medical Genetics	BIG-bench	Accuracy	69	Gopher-280B (few-shot, k=5)
Nutrition	BIG-bench	Accuracy	69.9	Gopher-280B (few-shot, k=5)
Professional Medicine	BIG-bench	Accuracy	64	Gopher-280B (few-shot, k=5)
Virology	BIG-bench	Accuracy	47	Gopher-280B (few-shot, k=5)
Econometrics	BIG-bench	Accuracy	43	Gopher-280B (few-shot, k=5)
High School Geography	BIG-bench	Accuracy	76.8	Gopher-280B (few-shot, k=5)
High School Government and Politics	BIG-bench	Accuracy	83.9	Gopher-280B (few-shot, k=5)
High School Macroeconomics	BIG-bench	Accuracy	65.1	Gopher-280B (few-shot, k=5)
High School Microeconomics	BIG-bench	Accuracy	66.4	Gopher-280B (few-shot, k=5)
High School Psychology	BIG-bench	Accuracy	81.8	Gopher-280B (few-shot, k=5)
Human Sexuality	BIG-bench	Accuracy	67.2	Gopher-280B (few-shot, k=5)
Professional Psychology	BIG-bench	Accuracy	68.1	Gopher-280B (few-shot, k=5)
Public Relations	BIG-bench	Accuracy	71.8	Gopher-280B (few-shot, k=5)
Security Studies	BIG-bench	Accuracy	64.9	Gopher-280B (few-shot, k=5)
Sociology	BIG-bench	Accuracy	84.1	Gopher-280B (few-shot, k=5)
US Foreign Policy	BIG-bench	Accuracy	81	Gopher-280B (few-shot, k=5)
Intent Recognition	BIG-bench	Accuracy	88.7	Gopher-280B (few-shot, k=5)
Memorization	BIG-bench (Hindu Knowledge)	Accuracy	80	Gopher-280B (few-shot, k=5)
BIG-bench Machine Learning	BIG-bench	Accuracy	41.1	Gopher-280B (few-shot, k=5)
Astronomy	BIG-bench	Accuracy	65.8	Gopher-280B (few-shot, k=5)
Computer Security	BIG-bench	Accuracy	65	Gopher-280B (few-shot, k=5)

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

Abstract

Results

Related Papers

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

Abstract

Results

Related Papers