Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Dario Amodei
Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions - something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3's few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Machine Translation | WMT2014 English-French | BLEU | 32.6 | GPT-3 175B (Few-Shot) |
| Machine Translation | WMT2014 French-English | BLEU | 39.2 | GPT-3 175B (Few-Shot) |
| Machine Translation | WMT2016 English-German | BLEU | 29.7 | GPT-3 175B (Few-Shot) |
| Machine Translation | WMT2016 Romanian-English | BLEU | 39.5 | GPT-3 175B (Few-Shot) |
| Machine Translation | WMT2016 German-English | BLEU | 40.6 | GPT-3 175B (Few-Shot) |
| Machine Translation | WMT2016 English-Romanian | BLEU | 21 | GPT-3 175B (Few-Shot) |
| Reading Comprehension | RACE | Accuracy (Middle) | 58.4 | GPT-3 175B (0-shot) |
| Reading Comprehension | RACE | Accuracy (High) | 45.5 | GPT-3 175B (zero-shot) |
| Few-Shot Learning | MedConceptsQA | Accuracy | 41.476 | gpt-3.5-turbo |
| Zero-Shot Learning | MedConceptsQA | Accuracy | 37.058 | gpt-3.5-turbo |
| Question Answering | PeerQA | AlignScore | 0.1378 | GPT-3.5-Turbo-0613-16k |
| Question Answering | PeerQA | Prometheus-2 Answer Correctness | 3.0408 | GPT-3.5-Turbo-0613-16k |
| Question Answering | PeerQA | Rouge-L | 0.2414 | GPT-3.5-Turbo-0613-16k |
| Question Answering | COPA | Accuracy | 92 | GPT-3 175B (few-shot, k=32) |
| Question Answering | COPA | Accuracy | 91 | GPT-3 175B (0-shot) |
| Question Answering | COPA | Accuracy | 87 | GPT-3 175B (1-shot) |
| Question Answering | COPA | Accuracy | 86 | GPT-3 13B (few-shot, k=32) |
| Question Answering | COPA | Accuracy | 73 | GPT-3 Large 760M (0-shot) |
| Question Answering | CoQA | Overall | 85 | GPT-3 175B (few-shot, k=32) |
| Question Answering | Natural Questions | EM | 29.9 | GPT-3 175B (Few-Shot, k=64) |
| Question Answering | Story Cloze | Accuracy | 87.7 | GPT-3 175B (Few-Shot) |
| Question Answering | OBQA | Accuracy | 57.6 | GPT-3 175B (zero-shot) |
| Question Answering | MultiRC | F1 | 75.4 | GPT-3 175B (Few-Shot) |
| Question Answering | WebQuestions | EM | 41.5 | GPT-3-175B (Few-Shot) |
| Question Answering | WebQuestions | EM | 25.3 | GPT-3-175B (One-Shot) |
| Question Answering | WebQuestions | EM | 14.4 | GPT-3-175B (Zero-Shot) |
| Question Answering | QuAC | F1 | 44.3 | GPT-3 175B (few-shot, k=32) |
| Question Answering | PIQA | Accuracy | 81 | GPT-3 175B (0-shot) |
| Question Answering | PIQA | Accuracy | 72.9 | GPT-3 Large 760M (0-shot) |
| Question Answering | RACE | RACE-m | 58.1 | GPT-3 175B (few-shot, k=32) |
| Question Answering | RACE | RACE-h | 46.8 | GPT-3 175B (Few-Shot) |
| Question Answering | StoryCloze | Accuracy | 72.4 | GPT-3 Large 760M (zero-shot) |
| Question Answering | BoolQ | Accuracy | 76.4 | GPT-3 175B (few-shot, k=32) |
| Question Answering | BoolQ | Accuracy | 60.5 | GPT-3 75B (0-shot) |
| Question Answering | DROP Test | F1 | 36.5 | GPT-3 175B (few-shot, k=32) |
| Question Answering | TriviaQA | EM | 71.2 | GPT-3 175B (Few-Shot) |
| Question Answering | OpenBookQA | Accuracy | 65.4 | GPT-3 175B (few-shot, k=32) |
| Common Sense Reasoning | WinoGrande | Accuracy | 70.2 | GPT-3 175B (0-shot) |
| Common Sense Reasoning | WinoGrande | Accuracy | 57.4 | GPT-3 Large 760M (0-shot) |
| Common Sense Reasoning | ARC (Challenge) | Accuracy | 53.2 | GPT-3 175B (1 shot) |
| Common Sense Reasoning | ARC (Challenge) | Accuracy | 51.4 | GPT-3 175B (0-shot) |
| Common Sense Reasoning | ARC (Easy) | Accuracy | 71.2 | GPT-3 175B (1 shot) |
| Common Sense Reasoning | ARC (Easy) | Accuracy | 68.8 | GPT-3 175B (0-shot) |
| Common Sense Reasoning | ReCoRD | EM | 82.1 | GPT-3 Large 760M (0-shot) |
| Word Sense Disambiguation | Words in Context | Accuracy | 49.4 | GPT-3 175B (few-shot, k=32) |
| Natural Language Inference | ANLI test | A1 | 36.8 | GPT-3 |
| Natural Language Inference | ANLI test | A2 | 34 | GPT-3 |
| Natural Language Inference | ANLI test | A3 | 40.2 | GPT-3 |
| Natural Language Inference | CommitmentBank | Accuracy | 75.6 | GPT-3 175B (Few-Shot) |
| Natural Language Inference | CommitmentBank | F1 | 52 | GPT-3 175B (few-shot, k=32) |
| Language Modelling | Penn Treebank (Word Level) | Test perplexity | 20.5 | GPT-3 (Zero-Shot) |
| Language Modelling | LAMBADA | Accuracy | 86.4 | GPT-3 175B (Few-Shot) |
| Language Modelling | LAMBADA | Perplexity | 1.92 | GPT-3 175B (Few-Shot) |
| Language Modelling | LAMBADA | Accuracy | 76.2 | GPT-3 175B (Zero-Shot) |
| Language Modelling | LAMBADA | Perplexity | 3 | GPT-3 175B (Zero-Shot) |
| Language Modelling | LAMBADA | Accuracy | 72.5 | GPT-3 13B (Zero-Shot) |
| Language Modelling | LAMBADA | Perplexity | 3.56 | GPT-3 13B (Zero-Shot) |
| Language Modelling | LAMBADA | Accuracy | 70.3 | GPT-3 6.7B (Zero-Shot) |
| Language Modelling | LAMBADA | Perplexity | 4 | GPT-3 6.7B (Zero-Shot) |
| Language Modelling | LAMBADA | Accuracy | 67.1 | GPT-3 2.7B (Zero-Shot) |
| Language Modelling | LAMBADA | Perplexity | 4.6 | GPT-3 2.7B (Zero-Shot) |
| Coreference Resolution | Winograd Schema Challenge | Accuracy | 80.1 | GPT-3 175B (few-shot) |
| Meta-Learning | MedConceptsQA | Accuracy | 41.476 | gpt-3.5-turbo |
| Sentence Completion | HellaSwag | Accuracy | 79.3 | GPT-3 175B (few-shot, k=32) |
| Sentence Completion | HellaSwag | Accuracy | 78.9 | GPT-3 (0-shot) |
| Sentence Completion | HellaSwag | Accuracy | 51 | GPT-3 Large 760M (0-shot) |
| answerability prediction | PeerQA | Macro F1 | 0.3304 | GPT-3.5-Turbo-0613-16k |