Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, Eliza Rutherford, Tom Hennigan, Jacob Menick, Albin Cassirer, Richard Powell, George van den Driessche, Lisa Anne Hendricks, Maribeth Rauh, Po-Sen Huang, Amelia Glaese, Johannes Welbl, Sumanth Dathathri, Saffron Huang, Jonathan Uesato, John Mellor, Irina Higgins, Antonia Creswell, Nat McAleese, Amy Wu, Erich Elsen, Siddhant Jayakumar, Elena Buchatskaya, David Budden, Esme Sutherland, Karen Simonyan, Michela Paganini, Laurent SIfre, Lena Martens, Xiang Lorraine Li, Adhiguna Kuncoro, Aida Nematzadeh, Elena Gribovskaya, Domenic Donato, Angeliki Lazaridou, Arthur Mensch, Jean-Baptiste Lespiau, Maria Tsimpoukelli, Nikolai Grigorev, Doug Fritz, Thibault Sottiaux, Mantas Pajarskas, Toby Pohlen, Zhitao Gong, Daniel Toyama, Cyprien de Masson d'Autume, Yujia Li, Tayfun Terzi, Vladimir Mikulik, Igor Babuschkin, Aidan Clark, Diego de Las Casas, Aurelia Guy, Chris Jones, James Bradbury, Matthew Johnson, Blake Hechtman, Laura Weidinger, Iason Gabriel, William Isaac, Ed Lockhart, Simon Osindero, Laura Rimell, Chris Dyer, Oriol Vinyals, Kareem Ayoub, Jeff Stanway, Lorrayne Bennett, Demis Hassabis, Koray Kavukcuoglu, Geoffrey Irving
Language modelling provides a step towards intelligent communication systems by harnessing large repositories of written human knowledge to better predict and understand the world. In this paper, we present an analysis of Transformer-based language model performance across a wide range of model scales -- from models with tens of millions of parameters up to a 280 billion parameter model called Gopher. These models are evaluated on 152 diverse tasks, achieving state-of-the-art performance across the majority. Gains from scale are largest in areas such as reading comprehension, fact-checking, and the identification of toxic language, but logical and mathematical reasoning see less benefit. We provide a holistic analysis of the training dataset and model's behaviour, covering the intersection of model scale with bias and toxicity. Finally we discuss the application of language models to AI safety and the mitigation of downstream harms.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Reading Comprehension | BIG-bench | Accuracy | 88.7 | Gopher-280B (few-shot, k=5) |
| Reading Comprehension | BIG-bench | Accuracy | 36.4 | Gopher-280B (few-shot, k=5) |
| Reading Comprehension | BIG-bench | Accuracy | 41.4 | Gopher-280B (few-shot, k=5) |
| Reading Comprehension | BIG-bench | Accuracy | 74.5 | Gopher-280B (zero-shot) |
| Reading Comprehension | BIG-bench | Accuracy | 62 | Gopher-280B (few-shot, k=5) |
| Reading Comprehension | BIG-bench | Accuracy | 57.6 | Gopher-280B (few-shot, k=5) |
| Reading Comprehension | BIG-bench | Accuracy | 64.1 | Gopher-280B (few-shot, k=5) |
| Reading Comprehension | BIG-bench | Accuracy | 52.7 | Gopher-280B (few-shot, k=5) |
| Reading Comprehension | BIG-bench | Accuracy | 27.3 | Gopher-280B (few-shot, k=5) |
| Reading Comprehension | BIG-bench | Accuracy | 50.7 | Gopher-280B (few-shot, k=5) |
| Reading Comprehension | BIG-bench | Accuracy | 61.4 | Gopher-280B (few-shot, k=5) |
| Reading Comprehension | BIG-bench | Accuracy | 81.8 | Gopher-280B (few-shot, k=5) |
| Reading Comprehension | BIG-bench | Accuracy | 71.6 | Gopher-280B (few-shot, k=5) |
| Reading Comprehension | BIG-bench | Accuracy | 75.1 | Gopher-280B (few-shot, k=5) |
| Question Answering | SIQA | Accuracy | 50.6 | Gopher (zero-shot) |
| Question Answering | Natural Questions | EM | 28.2 | Gopher (few-shot, k=64) |
| Question Answering | TruthfulQA | MC1 | 0.295 | Gopher 280B (zero-shot, Our Prompt + Choices) |
| Question Answering | TruthfulQA | MC1 | 0.25 | Gopher 7.1 (zero-shot, QA prompts) |
| Question Answering | TruthfulQA | MC1 | 0.23 | Gopher 7.1B (zero-shot, Our Prompt + Choices) |
| Question Answering | TruthfulQA | MC1 | 0.23 | Gopher 1.4 (zero-shot, QA prompts) |
| Question Answering | TruthfulQA | MC1 | 0.217 | Gopher 1.4B (zero-shot, Our Prompt + Choices) |
| Question Answering | PIQA | Accuracy | 81.8 | Gopher 280B (0-shot) |
| Question Answering | BoolQ | Accuracy | 79.3 | Gopher (zero-shot) |
| Question Answering | BIG-bench (Novel Concepts) | Accuracy | 59.1 | Gopher-280B (few-shot, k=5) |
| Question Answering | BIG-bench (Movie Recommendation) | Accuracy | 50.5 | Gopher-280B (few-shot, k=5) |
| Question Answering | BIG-bench (Navigate) | Accuracy | 51.1 | Gopher-280B (few-shot, k=5) |
| Question Answering | BIG-bench (Ruin Names) | Accuracy | 38.6 | Gopher-280B (few-shot, k=5) |
| Question Answering | BIG-bench (Hyperbaton) | Accuracy | 51.7 | Gopher-280B (few-shot, k=5) |
| Common Sense Reasoning | BIG-bench (Causal Judgment) | Accuracy | 50.8 | Gopher-280B (few-shot, k=5) |
| Common Sense Reasoning | BIG-bench (Disambiguation QA) | Accuracy | 45.5 | Gopher-280B (few-shot, k=5) |
| Common Sense Reasoning | WinoGrande | Accuracy | 70.1 | Gopher 280B (0-shot) |
| Common Sense Reasoning | BIG-bench (Sports Understanding) | Accuracy | 54.9 | Gopher-280B (few-shot, k=5) |
| Common Sense Reasoning | BIG-bench (Winowhy) | Accuracy | 56.7 | Gopher-280B (few-shot, k=5) |
| Common Sense Reasoning | BIG-bench (Known Unknowns) | Accuracy | 63.6 | Gopher-280B (few-shot, k=5) |
| Common Sense Reasoning | BIG-bench (Date Understanding) | Accuracy | 44.1 | Gopher-280B (few-shot, k=5) |
| Common Sense Reasoning | BIG-bench (Logical Sequence) | Accuracy | 36.4 | Gopher-280B (few-shot, k=5) |
| Common Sense Reasoning | BIG-bench | Accuracy | 68.2 | Gopher-280B (few-shot, k=5) |
| Common Sense Reasoning | BIG-bench | Accuracy | 11.7 | Gopher-280B (few-shot, k=5) |
| Common Sense Reasoning | BIG-bench | Accuracy | 52.5 | Gopher-280B (few-shot, k=5) |
| Common Sense Reasoning | BIG-bench | Accuracy | 50.9 | Gopher-280B (few-shot, k=5) |
| Common Sense Reasoning | BIG-bench | Accuracy | 63.6 | Gopher-280B (few-shot, k=5) |
| Common Sense Reasoning | BIG-bench | Accuracy | 56.8 | Gopher-280B (few-shot, k=5) |
| Common Sense Reasoning | BIG-bench | Accuracy | 69.7 | Gopher-280B (few-shot, k=5) |
| Common Sense Reasoning | BIG-bench | Accuracy | 39.6 | Gopher-280B (few-shot, k=5) |
| Word Sense Disambiguation | BIG-bench (Anachronisms) | Accuracy | 56.4 | Gopher-280B (few-shot, k=5) |
| Language Modelling | USPTO Backgrounds | BPB | 0.546 | Gopher |
| Language Modelling | StackExchange | BPB | 0.641 | Gopher |
| Language Modelling | FreeLaw | BPB | 0.513 | Gopher |
| Language Modelling | PhilPapers | BPB | 0.695 | Gopher |
| Language Modelling | Arxiv HEP-TH citation graph | BPB | 0.662 | Gopher |
| Language Modelling | Curation Corpus | BPB | 0.475 | Gopher |
| Language Modelling | OpenWebtext2 | BPB | 0.677 | Gopher |
| Language Modelling | Gutenberg PG-19 | BPB | 0.656 | Gopher |
| Language Modelling | Bookcorpus2 | BPB | 0.741 | Gopher |
| Language Modelling | DM Mathematics | BPB | 1.14 | Gopher |
| Language Modelling | Books3 | BPB | 0.712 | Gopher |
| Language Modelling | HackerNews | BPB | 0.89 | Gopher |
| Language Modelling | Pile CC | BPB | 0.691 | Gopher |
| Language Modelling | GitHub | BPB | 0.377 | Gopher |
| Language Modelling | PubMed Central | BPB | 0.525 | Gopher |
| Language Modelling | NIH ExPorter | BPB | 0.59 | Gopher |
| Language Modelling | PubMed Cognitive Control Abstracts | BPB | 0.577 | Gopher |
| Language Modelling | OpenSubtitles | BPB | 0.899 | Gopher |
| Language Modelling | Ubuntu IRC | BPB | 1.09 | Gopher |
| Sarcasm Detection | BIG-bench (SNARKS) | Accuracy | 48.3 | Gopher-280B (few-shot, k=5) |
| Mathematical Reasoning | BIG-bench | Accuracy | 35.7 | Gopher-280B (few-shot, k=5) |
| Mathematical Reasoning | BIG-bench | Accuracy | 25 | Gopher-280B (few-shot, k=5) |
| Mathematical Reasoning | BIG-bench | Accuracy | 57.6 | Gopher-280B (few-shot, k=5) |
| Mathematical Reasoning | BIG-bench | Accuracy | 23.7 | Gopher-280B (few-shot, k=5) |
| Mathematical Reasoning | BIG-bench | Accuracy | 44.3 | Gopher-280B (few-shot, k=5) |
| Analogical Similarity | BIG-bench | Accuracy | 17.2 | Gopher-280B (few-shot, k=5) |
| Identify Odd Metapor | BIG-bench | Accuracy | 38.6 | Gopher-280B (few-shot, k=5) |
| Odd One Out | BIG-bench | Accuracy | 32.5 | Gopher-280B (few-shot, k=5) |
| Sentence Completion | HellaSwag | Accuracy | 79.2 | Gopher 280B (0-shot) |
| Emotional Intelligence | BIG-bench | Accuracy | 83.1 | Gopher-280B (few-shot, k=5) |
| Ethics | BIG-bench | Accuracy | 40.2 | Gopher-280B (few-shot, k=5) |
| Ethics | BIG-bench | Accuracy | 55.1 | Gopher-280B (few-shot, k=5) |
| Ethics | BIG-bench | Accuracy | 70 | Gopher-280B (few-shot, k=5) |
| Ethics | BIG-bench | Accuracy | 66.8 | Gopher-280B (few-shot, k=5) |
| Fact Checking | BIG-bench | Accuracy | 61.7 | Gopher-280B (few-shot, k=5) |
| Fact Checking | BIG-bench | Accuracy | 69.1 | Gopher-280B (few-shot, k=5) |
| Fact Checking | BIG-bench | Accuracy | 77.5 | Gopher-280B (few-shot, k=10) |
| Fact Checking | BIG-bench | Accuracy | 77.5 | Gopher-280B (few-shot, k=15) |
| General Knowledge | BIG-bench | Accuracy | 93.9 | Gopher-280B (few-shot, k=5) |
| General Knowledge | BIG-bench | Accuracy | 28.2 | Gopher-280B (few-shot, k=64) |
| General Knowledge | BIG-bench | Accuracy | 57.1 | Gopher-280B (few-shot, k=64) |
| General Knowledge | BIG-bench | Accuracy | 75.7 | Gopher-280B (few-shot, k=5) |
| General Knowledge | BIG-bench | Accuracy | 81.8 | Gopher-280B (few-shot, k=5) |
| General Knowledge | BIG-bench | Accuracy | 38 | Gopher-280B (few-shot, k=5) |
| High School European History | BIG-bench | Accuracy | 72.1 | Gopher-280B (few-shot, k=5) |
| High School US History | BIG-bench | Accuracy | 78.9 | Gopher-280B (few-shot, k=5) |
| High School World History | BIG-bench | Accuracy | 75.1 | Gopher-280B (few-shot, k=5) |
| International Law | BIG-bench | Accuracy | 77.7 | Gopher-280B (few-shot, k=5) |
| Jurisprudence | BIG-bench | Accuracy | 71.3 | Gopher-280B (few-shot, k=5) |
| Logical Fallacies | BIG-bench | Accuracy | 72.4 | Gopher-280B (few-shot, k=5) |
| Management | BIG-bench | Accuracy | 77.7 | Gopher-280B (few-shot, k=5) |
| Marketing | BIG-bench | Accuracy | 83.3 | Gopher-280B (few-shot, k=5) |
| Philosophy | BIG-bench | Accuracy | 68.8 | Gopher-280B (few-shot, k=5) |
| Prehistory | BIG-bench | Accuracy | 67.6 | Gopher-280B (few-shot, k=5) |
| Professional Law | BIG-bench | Accuracy | 44.5 | Gopher-280B (few-shot, k=5) |
| World Religions | BIG-bench | Accuracy | 84.2 | Gopher-280B (few-shot, k=5) |
| Logical Reasoning | BIG-bench (Penguins In A Table) | Accuracy | 40.6 | Gopher-280B (few-shot, k=5) |
| Logical Reasoning | BIG-bench (Logic Grid Puzzle) | Accuracy | 35.1 | Gopher-280B (few-shot, k=5) |
| Logical Reasoning | BIG-bench (Temporal Sequences) | Accuracy | 19 | Gopher-280B (few-shot, k=5) |
| Logical Reasoning | BIG-bench (Formal Fallacies Syllogisms Negation) | Accuracy | 50.7 | Gopher-280B (few-shot, k=5) |
| Logical Reasoning | BIG-bench (Reasoning About Colored Objects) | Accuracy | 49.2 | Gopher-280B (few-shot, k=5) |
| Logical Reasoning | BIG-bench (Logical Fallacy Detection) | Accuracy | 58.9 | Gopher-280B (few-shot, k=5) |
| Logical Reasoning | BIG-bench (StrategyQA) | Accuracy | 61 | Gopher-280B (few-shot, k=5) |
| Logical Reasoning | BIG-bench | Accuracy | 59.7 | Gopher-280B (few-shot, k=5) |
| Logical Reasoning | BIG-bench | Accuracy | 56.4 | Gopher-280B (few-shot, k=5) |
| Logical Reasoning | BIG-bench | Accuracy | 33.6 | Gopher-280B (few-shot, k=5) |
| Logical Reasoning | BIG-bench | Accuracy | 59.3 | Gopher-280B (few-shot, k=5) |
| Logical Reasoning | BIG-bench | Accuracy | 53 | Gopher-280B (few-shot, k=5) |
| Logical Reasoning | BIG-bench | Accuracy | 89.5 | Gopher-280B (few-shot, k=5) |
| Logical Reasoning | BIG-bench | Accuracy | 16.7 | Gopher-280B (few-shot, k=5) |
| Logical Reasoning | BIG-bench | Accuracy | 59.1 | Gopher-280B (few-shot, k=5) |
| Logical Reasoning | BIG-bench | Accuracy | 34 | Gopher-280B (few-shot, k=5) |
| Logical Reasoning | BIG-bench | Accuracy | 37 | Gopher-280B (few-shot, k=5) |
| Anatomy | BIG-bench | Accuracy | 56.3 | Gopher-280B (few-shot, k=5) |
| Clinical Knowledge | BIG-bench | Accuracy | 67.2 | Gopher-280B (few-shot, k=5) |
| College Medicine | BIG-bench | Accuracy | 60.1 | Gopher-280B (few-shot, k=5) |
| Human Aging | BIG-bench | Accuracy | 66.4 | Gopher-280B (few-shot, k=5) |
| Human Organs Senses Multiple Choice | BIG-bench | Accuracy | 84.8 | Gopher-280B (few-shot, k=5) |
| Medical Genetics | BIG-bench | Accuracy | 69 | Gopher-280B (few-shot, k=5) |
| Nutrition | BIG-bench | Accuracy | 69.9 | Gopher-280B (few-shot, k=5) |
| Professional Medicine | BIG-bench | Accuracy | 64 | Gopher-280B (few-shot, k=5) |
| Virology | BIG-bench | Accuracy | 47 | Gopher-280B (few-shot, k=5) |
| Econometrics | BIG-bench | Accuracy | 43 | Gopher-280B (few-shot, k=5) |
| High School Geography | BIG-bench | Accuracy | 76.8 | Gopher-280B (few-shot, k=5) |
| High School Government and Politics | BIG-bench | Accuracy | 83.9 | Gopher-280B (few-shot, k=5) |
| High School Macroeconomics | BIG-bench | Accuracy | 65.1 | Gopher-280B (few-shot, k=5) |
| High School Microeconomics | BIG-bench | Accuracy | 66.4 | Gopher-280B (few-shot, k=5) |
| High School Psychology | BIG-bench | Accuracy | 81.8 | Gopher-280B (few-shot, k=5) |
| Human Sexuality | BIG-bench | Accuracy | 67.2 | Gopher-280B (few-shot, k=5) |
| Professional Psychology | BIG-bench | Accuracy | 68.1 | Gopher-280B (few-shot, k=5) |
| Public Relations | BIG-bench | Accuracy | 71.8 | Gopher-280B (few-shot, k=5) |
| Security Studies | BIG-bench | Accuracy | 64.9 | Gopher-280B (few-shot, k=5) |
| Sociology | BIG-bench | Accuracy | 84.1 | Gopher-280B (few-shot, k=5) |
| US Foreign Policy | BIG-bench | Accuracy | 81 | Gopher-280B (few-shot, k=5) |
| Intent Recognition | BIG-bench | Accuracy | 88.7 | Gopher-280B (few-shot, k=5) |
| Memorization | BIG-bench (Hindu Knowledge) | Accuracy | 80 | Gopher-280B (few-shot, k=5) |
| BIG-bench Machine Learning | BIG-bench | Accuracy | 41.1 | Gopher-280B (few-shot, k=5) |
| Astronomy | BIG-bench | Accuracy | 65.8 | Gopher-280B (few-shot, k=5) |
| Computer Security | BIG-bench | Accuracy | 65 | Gopher-280B (few-shot, k=5) |