Shijie Wu, Ozan Irsoy, Steven Lu, Vadim Dabravolski, Mark Dredze, Sebastian Gehrmann, Prabhanjan Kambadur, David Rosenberg, Gideon Mann
The use of NLP in the realm of financial technology is broad and complex, with applications ranging from sentiment analysis and named entity recognition to question answering. Large Language Models (LLMs) have been shown to be effective on a variety of tasks; however, no LLM specialized for the financial domain has been reported in literature. In this work, we present BloombergGPT, a 50 billion parameter language model that is trained on a wide range of financial data. We construct a 363 billion token dataset based on Bloomberg's extensive data sources, perhaps the largest domain-specific dataset yet, augmented with 345 billion tokens from general purpose datasets. We validate BloombergGPT on standard LLM benchmarks, open financial benchmarks, and a suite of internal benchmarks that most accurately reflect our intended usage. Our mixed dataset training leads to a model that outperforms existing models on financial tasks by significant margins without sacrificing performance on general LLM benchmarks. Additionally, we explain our modeling choices, training process, and evaluation methodology. We release Training Chronicles (Appendix C) detailing our experience in training BloombergGPT.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Reading Comprehension | RACE | Accuracy (High) | 41.74 | Bloomberg GPT (one-shot) |
| Reading Comprehension | RACE | Accuracy (Middle) | 54.32 | Bloomberg GPT (one-shot) |
| Reading Comprehension | RACE | Accuracy (High) | 39.14 | BLOOM 176B (one-shot) |
| Reading Comprehension | RACE | Accuracy (Middle) | 52.3 | BLOOM 176B (one-shot) |
| Reading Comprehension | RACE | Accuracy (High) | 37.02 | OPT 66B (one-shot) |
| Reading Comprehension | RACE | Accuracy (Middle) | 47.42 | OPT 66B (one-shot) |
| Reading Comprehension | RACE | Accuracy (High) | 34.33 | GPT-NeoX (one-shot) |
| Reading Comprehension | RACE | Accuracy (Middle) | 41.23 | GPT-NeoX (one-shot) |
| Transfer Learning | MML | Average (%) | 39.2 | Bloomberg GPT 50B (5-shot) |
| Transfer Learning | MML | Average (%) | 39.1 | BLOOM 176B (5-shot) |
| Transfer Learning | MML | Average (%) | 36 | OPT 66B (5-shot) |
| Question Answering | COPA | Accuracy | 88 | GPT-NeoX (one-shot) |
| Question Answering | COPA | Accuracy | 86 | Bloomberg GPT (one-shot) |
| Question Answering | COPA | Accuracy | 86 | OPT 66B (one-shot) |
| Question Answering | COPA | Accuracy | 84 | BLOOM 176B (one-shot) |
| Question Answering | MultiRC | F1 | 62.3 | Bloomberg GPT 50B (1-shot) |
| Question Answering | MultiRC | F1 | 26.7 | BLOOM 176B (1-shot) |
| Question Answering | MultiRC | F1 | 22.9 | GPT-NeoX 20B (1-shot) |
| Question Answering | MultiRC | F1 | 18.8 | OPT 66B (1-shot) |
| Question Answering | PIQA | Accuracy | 77.9 | Bloomberg GPT 50B (1-shot) |
| Question Answering | PIQA | Accuracy | 77.6 | OPT 66B (1-shot) |
| Question Answering | PIQA | Accuracy | 77 | BLOOM 176B (1-shot) |
| Question Answering | PIQA | Accuracy | 75.8 | GPT-NeoX 20B (1-shot) |
| Question Answering | BoolQ | Accuracy | 74.6 | Bloomberg GPT 50B (1-shot) |
| Question Answering | BoolQ | Accuracy | 57.5 | OPT 66B (1-shot) |
| Question Answering | BoolQ | Accuracy | 52.9 | BLOOM 176B (1-shot) |
| Question Answering | BoolQ | Accuracy | 46.4 | GPT-NeoX 20B (1-shot) |
| Question Answering | OpenBookQA | Accuracy | 58 | OPT 66B (one-shot) |
| Question Answering | OpenBookQA | Accuracy | 51.6 | Bloomberg GPT 50B (1-shot) |
| Question Answering | OpenBookQA | Accuracy | 47.2 | BLOOM 176B (2-shot) |
| Question Answering | OpenBookQA | Accuracy | 44.2 | GPT-NeoX 50B (2-shot) |
| Question Answering | BIG-bench (Movie Recommendation) | Accuracy | 91.2 | BLOOM 176B (few-shot, k=3) |
| Question Answering | BIG-bench (Movie Recommendation) | Accuracy | 91.2 | OPT 66B (few-shot, k=3) |
| Question Answering | BIG-bench (Movie Recommendation) | Accuracy | 90.4 | Bloomberg GPT (few-shot, k=3) |
| Question Answering | BIG-bench (Movie Recommendation) | Accuracy | 87.2 | PaLM 540B (few-shot, k=3) |
| Question Answering | BIG-bench (Movie Recommendation) | Accuracy | 86.4 | GPT-NeoX (few-shot, k=3) |
| Question Answering | BIG-bench (Navigate) | Accuracy | 62.4 | PaLM 540B (few-shot, k=3) |
| Question Answering | BIG-bench (Navigate) | Accuracy | 50 | BLOOM 176B (few-shot, k=3) |
| Question Answering | BIG-bench (Navigate) | Accuracy | 45.2 | GPT-NeoX (few-shot, k=3) |
| Question Answering | BIG-bench (Navigate) | Accuracy | 42 | Bloomberg GPT (few-shot, k=3) |
| Question Answering | BIG-bench (Navigate) | Accuracy | 42 | OPT 66B (few-shot, k=3) |
| Question Answering | BIG-bench (Ruin Names) | Accuracy | 76 | PaLM 540B (few-shot, k=3) |
| Question Answering | BIG-bench (Ruin Names) | Accuracy | 56 | Bloomberg GPT (few-shot, k=3) |
| Question Answering | BIG-bench (Ruin Names) | Accuracy | 54.8 | BLOOM 176B (few-shot, k=3) |
| Question Answering | BIG-bench (Ruin Names) | Accuracy | 54 | GPT-NeoX (few-shot, k=3) |
| Question Answering | BIG-bench (Ruin Names) | Accuracy | 52.8 | OPT 66B (few-shot, k=3) |
| Question Answering | BIG-bench (Hyperbaton) | Accuracy | 92 | Bloomberg GPT (few-shot, k=3) |
| Question Answering | BIG-bench (Hyperbaton) | Accuracy | 92 | GPT-NeoX (few-shot, k=3) |
| Question Answering | BIG-bench (Hyperbaton) | Accuracy | 92 | BLOOM 176B (few-shot, k=3) |
| Question Answering | BIG-bench (Hyperbaton) | Accuracy | 91.6 | OPT 66B (few-shot, k=3) |
| Question Answering | BIG-bench (Hyperbaton) | Accuracy | 70.8 | PaLM 540B (few-shot, k=3) |
| Common Sense Reasoning | BIG-bench (Causal Judgment) | Accuracy | 61 | PaLM 540B (few-shot, k=3) |
| Common Sense Reasoning | BIG-bench (Causal Judgment) | Accuracy | 52.41 | GPT-NeoX 20B (few-shot, k=3) |
| Common Sense Reasoning | BIG-bench (Causal Judgment) | Accuracy | 51.87 | BLOOM 176B (few-shot, k=3) |
| Common Sense Reasoning | BIG-bench (Causal Judgment) | Accuracy | 51.87 | OPT 66B (few-shot, k=3) |
| Common Sense Reasoning | BIG-bench (Causal Judgment) | Accuracy | 49.73 | BloombergGPT 50B (few-shot, k=3) |
| Common Sense Reasoning | BIG-bench (Disambiguation QA) | Accuracy | 60.8 | PaLM 540B (few-shot, k=3) |
| Common Sense Reasoning | BIG-bench (Disambiguation QA) | Accuracy | 40.8 | GPT-NeoX 20B (few-shot, k=3) |
| Common Sense Reasoning | BIG-bench (Disambiguation QA) | Accuracy | 40.4 | OPT 66B (few-shot, k=3) |
| Common Sense Reasoning | BIG-bench (Disambiguation QA) | Accuracy | 40.4 | BLOOM 176B (few-shot, k=3) |
| Common Sense Reasoning | BIG-bench (Disambiguation QA) | Accuracy | 34 | Bloomberg GPT 50B (few-shot, k=3) |
| Common Sense Reasoning | WinoGrande | Accuracy | 67 | BLOOM 176B (1-shot) |
| Common Sense Reasoning | WinoGrande | Accuracy | 66.1 | OPT 66B (1-shot) |
| Common Sense Reasoning | WinoGrande | Accuracy | 64.1 | Bloomberg GPT (one-shot) |
| Common Sense Reasoning | WinoGrande | Accuracy | 60.6 | GPT-NeoX (one-shot) |
| Common Sense Reasoning | ARC (Challenge) | Accuracy | 50.85 | BLOOM 176B (1-shot) |
| Common Sense Reasoning | ARC (Challenge) | Accuracy | 48.63 | Bloomberg GPT 50B (1-shot) |
| Common Sense Reasoning | ARC (Challenge) | Accuracy | 45.39 | GPT-NeoX 20B (1-shot) |
| Common Sense Reasoning | ARC (Challenge) | Accuracy | 44.54 | OPT 66B (one-shot) |
| Common Sense Reasoning | BIG-bench (Sports Understanding) | Accuracy | 80.4 | PaLM 540B (few-shot, k=3) |
| Common Sense Reasoning | BIG-bench (Sports Understanding) | Accuracy | 62.8 | Bloomberg GPT (few-shot, k=3) |
| Common Sense Reasoning | BIG-bench (Sports Understanding) | Accuracy | 54.4 | OPT 66B (few-shot, k=3) |
| Common Sense Reasoning | BIG-bench (Sports Understanding) | Accuracy | 53.2 | GPT-NeoX (few-shot, k=3) |
| Common Sense Reasoning | ARC (Easy) | Accuracy | 75.93 | BLOOM 176B (1-shot) |
| Common Sense Reasoning | ARC (Easy) | Accuracy | 73.99 | Bloomberg GPT 50B (1-shot) |
| Common Sense Reasoning | ARC (Easy) | Accuracy | 71.25 | OPT 66B (1-shot) |
| Common Sense Reasoning | ARC (Easy) | Accuracy | 70.79 | GPT-NeoX 20B (1-shot) |
| Common Sense Reasoning | BIG-bench (Date Understanding) | Accuracy | 54.8 | Bloomberg GPT 50B (few-shot, k=3) |
| Common Sense Reasoning | BIG-bench (Date Understanding) | Accuracy | 53.6 | PaLM 540B (few-shot,k=3) |
| Common Sense Reasoning | BIG-bench (Date Understanding) | Accuracy | 50 | BLOOM 176B (few-shot, k=3) |
| Common Sense Reasoning | BIG-bench (Date Understanding) | Accuracy | 49.6 | OPT 66B (few-shot, k=3) |
| Common Sense Reasoning | BIG-bench (Date Understanding) | Accuracy | 45.6 | GPT-NeoX 20B (few-shot, k=3) |
| Common Sense Reasoning | CommonsenseQA | Accuracy | 66.4 | OPT 66B (1-shot) |
| Common Sense Reasoning | CommonsenseQA | Accuracy | 65.5 | Bloomberg GPT 50B (1-shot) |
| Common Sense Reasoning | CommonsenseQA | Accuracy | 64.2 | BLOOM 176B (1-shot) |
| Common Sense Reasoning | CommonsenseQA | Accuracy | 60.4 | GPT-NeoX 20B (1-shot) |
| Common Sense Reasoning | ReCoRD | F1 | 82.8 | Bloomberg GPT 50B (1-shot) |
| Common Sense Reasoning | ReCoRD | F1 | 82.5 | OPT 66B (1-shot) |
| Common Sense Reasoning | ReCoRD | F1 | 78 | BLOOM 176B (1-shot) |
| Common Sense Reasoning | ReCoRD | F1 | 67.9 | GPT-NeoX 20B (1-shot) |
| Natural Language Inference | ANLI test | A1 | 33.6 | BLOOM 176B (one-shot) |
| Natural Language Inference | ANLI test | A2 | 33.8 | BLOOM 176B (one-shot) |
| Natural Language Inference | ANLI test | A3 | 35.17 | BLOOM 176B (one-shot) |
| Natural Language Inference | ANLI test | A1 | 33.1 | OPT 66B (one-shot) |
| Natural Language Inference | ANLI test | A2 | 34.2 | OPT 66B (one-shot) |
| Natural Language Inference | ANLI test | A3 | 34.92 | OPT 66B (one-shot) |
| Natural Language Inference | ANLI test | A1 | 32.9 | Bloomberg GPT (one-shot) |
| Natural Language Inference | ANLI test | A2 | 34.4 | Bloomberg GPT (one-shot) |
| Natural Language Inference | ANLI test | A3 | 37.33 | Bloomberg GPT (one-shot) |
| Natural Language Inference | ANLI test | A1 | 32.6 | GPT-NeoX (one-shot) |
| Natural Language Inference | ANLI test | A2 | 33.8 | GPT-NeoX (one-shot) |
| Natural Language Inference | ANLI test | A3 | 36.17 | GPT-NeoX (one-shot) |
| Natural Language Inference | CommitmentBank | Accuracy | 53.57 | Bloomberg GPT (one-shot) |
| Natural Language Inference | CommitmentBank | Accuracy | 48.21 | GPT-NeoX (one-shot) |
| Natural Language Inference | CommitmentBank | Accuracy | 48.21 | BLOOM 176B (one-shot) |
| Natural Language Inference | CommitmentBank | Accuracy | 44.64 | OPT 66B (one-shot) |
| Sarcasm Detection | BIG-bench (SNARKS) | Accuracy | 78.1 | PaLM 540B (few-shot, k=3) |
| Sarcasm Detection | BIG-bench (SNARKS) | Accuracy | 72.47 | BLOOM 176B (few-shot, k=3) |
| Sarcasm Detection | BIG-bench (SNARKS) | Accuracy | 69.66 | Bloomberg GPT (few-shot, k=3) |
| Sarcasm Detection | BIG-bench (SNARKS) | Accuracy | 62.36 | GPT-NeoX (few-shot, k=3) |
| Multi-Task Learning | MML | Average (%) | 39.2 | Bloomberg GPT 50B (5-shot) |
| Multi-Task Learning | MML | Average (%) | 39.1 | BLOOM 176B (5-shot) |
| Multi-Task Learning | MML | Average (%) | 36 | OPT 66B (5-shot) |
| Sentence Completion | HellaSwag | Accuracy | 73.9 | BlooombergGPT 50B (1-shot) |
| Sentence Completion | HellaSwag | Accuracy | 73.5 | OPT 66B (1-shot) |
| Sentence Completion | HellaSwag | Accuracy | 73.2 | BLOOM 176B (1-shot) |
| Sentence Completion | HellaSwag | Accuracy | 68.4 | GPT-NeoX 20B (1-shot) |
| Logical Reasoning | BIG-bench (Penguins In A Table) | Accuracy | 44.5 | PaLM 540B (few-shot, k=3) |
| Logical Reasoning | BIG-bench (Penguins In A Table) | Accuracy | 40.41 | BLOOM 176B (few-shot, k=3) |
| Logical Reasoning | BIG-bench (Penguins In A Table) | Accuracy | 37.67 | Bloomberg GPT (few-shot, k=3) |
| Logical Reasoning | BIG-bench (Penguins In A Table) | Accuracy | 33.56 | GPT-NeoX (few-shot, k=3) |
| Logical Reasoning | BIG-bench (Penguins In A Table) | Accuracy | 28.08 | OPT 66B (few-shot, k=3) |
| Logical Reasoning | BIG-bench (Temporal Sequences) | Accuracy | 39.6 | PaLM 540B (few-shot, k=3) |
| Logical Reasoning | BIG-bench (Temporal Sequences) | Accuracy | 36.8 | BLOOM 176B (few-shot, k=3) |
| Logical Reasoning | BIG-bench (Temporal Sequences) | Accuracy | 29.2 | Bloomberg GPT (few-shot, k=3) |
| Logical Reasoning | BIG-bench (Temporal Sequences) | Accuracy | 23.6 | OPT 66B (few-shot, k=3) |
| Logical Reasoning | BIG-bench (Temporal Sequences) | Accuracy | 21.2 | GPT-NeoX (few-shot, k=3) |
| Logical Reasoning | BIG-bench (Formal Fallacies Syllogisms Negation) | Accuracy | 54 | OPT 66B (few-shot, k=3) |
| Logical Reasoning | BIG-bench (Formal Fallacies Syllogisms Negation) | Accuracy | 53.6 | PaLM 540B (few-shot, k=3) |
| Logical Reasoning | BIG-bench (Formal Fallacies Syllogisms Negation) | Accuracy | 52.8 | BLOOM 176B (few-shot, k=3) |
| Logical Reasoning | BIG-bench (Formal Fallacies Syllogisms Negation) | Accuracy | 52.8 | GPT-NeoX 20B (few-shot, k=3) |
| Logical Reasoning | BIG-bench (Formal Fallacies Syllogisms Negation) | Accuracy | 50.8 | Bloomberg GPT 50B (few-shot, k=3) |
| Logical Reasoning | BIG-bench (Reasoning About Colored Objects) | Accuracy | 38 | PaLM 540B (few-shot, k=3) |
| Logical Reasoning | BIG-bench (Reasoning About Colored Objects) | Accuracy | 36.8 | BLOOM 176B (few-shot, k=3) |
| Logical Reasoning | BIG-bench (Reasoning About Colored Objects) | Accuracy | 34.8 | Bloomberg GPT (few-shot, k=3) |
| Logical Reasoning | BIG-bench (Reasoning About Colored Objects) | Accuracy | 31.2 | OPT 66B (few-shot, k=3) |
| Logical Reasoning | BIG-bench (Reasoning About Colored Objects) | Accuracy | 26 | GPT-NeoX (few-shot, k=3) |