TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/BloombergGPT: A Large Language Model for Finance

BloombergGPT: A Large Language Model for Finance

Shijie Wu, Ozan Irsoy, Steven Lu, Vadim Dabravolski, Mark Dredze, Sebastian Gehrmann, Prabhanjan Kambadur, David Rosenberg, Gideon Mann

2023-03-30Reading ComprehensionCausal JudgmentQuestion AnsweringMulti-task Language UnderstandingSentence CompletionSentiment AnalysisNatural Language InferenceCommon Sense Reasoningnamed-entity-recognitionNavigateLogical ReasoningNamed Entity RecognitionMovie RecommendationLarge Language ModelSports UnderstandingLanguage ModellingTemporal SequencesMultiple Choice Question Answering (MCQA)Sarcasm Detection
PaperPDFCodeCode

Abstract

The use of NLP in the realm of financial technology is broad and complex, with applications ranging from sentiment analysis and named entity recognition to question answering. Large Language Models (LLMs) have been shown to be effective on a variety of tasks; however, no LLM specialized for the financial domain has been reported in literature. In this work, we present BloombergGPT, a 50 billion parameter language model that is trained on a wide range of financial data. We construct a 363 billion token dataset based on Bloomberg's extensive data sources, perhaps the largest domain-specific dataset yet, augmented with 345 billion tokens from general purpose datasets. We validate BloombergGPT on standard LLM benchmarks, open financial benchmarks, and a suite of internal benchmarks that most accurately reflect our intended usage. Our mixed dataset training leads to a model that outperforms existing models on financial tasks by significant margins without sacrificing performance on general LLM benchmarks. Additionally, we explain our modeling choices, training process, and evaluation methodology. We release Training Chronicles (Appendix C) detailing our experience in training BloombergGPT.

Results

TaskDatasetMetricValueModel
Reading ComprehensionRACEAccuracy (High)41.74Bloomberg GPT (one-shot)
Reading ComprehensionRACEAccuracy (Middle)54.32Bloomberg GPT (one-shot)
Reading ComprehensionRACEAccuracy (High)39.14BLOOM 176B (one-shot)
Reading ComprehensionRACEAccuracy (Middle)52.3BLOOM 176B (one-shot)
Reading ComprehensionRACEAccuracy (High)37.02OPT 66B (one-shot)
Reading ComprehensionRACEAccuracy (Middle)47.42OPT 66B (one-shot)
Reading ComprehensionRACEAccuracy (High)34.33GPT-NeoX (one-shot)
Reading ComprehensionRACEAccuracy (Middle)41.23GPT-NeoX (one-shot)
Transfer LearningMMLAverage (%)39.2Bloomberg GPT 50B (5-shot)
Transfer LearningMMLAverage (%)39.1BLOOM 176B (5-shot)
Transfer LearningMMLAverage (%)36OPT 66B (5-shot)
Question AnsweringCOPAAccuracy88GPT-NeoX (one-shot)
Question AnsweringCOPAAccuracy86Bloomberg GPT (one-shot)
Question AnsweringCOPAAccuracy86OPT 66B (one-shot)
Question AnsweringCOPAAccuracy84BLOOM 176B (one-shot)
Question AnsweringMultiRCF162.3Bloomberg GPT 50B (1-shot)
Question AnsweringMultiRCF126.7BLOOM 176B (1-shot)
Question AnsweringMultiRCF122.9GPT-NeoX 20B (1-shot)
Question AnsweringMultiRCF118.8OPT 66B (1-shot)
Question AnsweringPIQAAccuracy77.9Bloomberg GPT 50B (1-shot)
Question AnsweringPIQAAccuracy77.6OPT 66B (1-shot)
Question AnsweringPIQAAccuracy77BLOOM 176B (1-shot)
Question AnsweringPIQAAccuracy75.8GPT-NeoX 20B (1-shot)
Question AnsweringBoolQAccuracy74.6Bloomberg GPT 50B (1-shot)
Question AnsweringBoolQAccuracy57.5OPT 66B (1-shot)
Question AnsweringBoolQAccuracy52.9BLOOM 176B (1-shot)
Question AnsweringBoolQAccuracy46.4GPT-NeoX 20B (1-shot)
Question AnsweringOpenBookQAAccuracy58OPT 66B (one-shot)
Question AnsweringOpenBookQAAccuracy51.6Bloomberg GPT 50B (1-shot)
Question AnsweringOpenBookQAAccuracy47.2BLOOM 176B (2-shot)
Question AnsweringOpenBookQAAccuracy44.2GPT-NeoX 50B (2-shot)
Question AnsweringBIG-bench (Movie Recommendation)Accuracy91.2BLOOM 176B (few-shot, k=3)
Question AnsweringBIG-bench (Movie Recommendation)Accuracy91.2OPT 66B (few-shot, k=3)
Question AnsweringBIG-bench (Movie Recommendation)Accuracy90.4Bloomberg GPT (few-shot, k=3)
Question AnsweringBIG-bench (Movie Recommendation)Accuracy87.2PaLM 540B (few-shot, k=3)
Question AnsweringBIG-bench (Movie Recommendation)Accuracy86.4GPT-NeoX (few-shot, k=3)
Question AnsweringBIG-bench (Navigate)Accuracy62.4PaLM 540B (few-shot, k=3)
Question AnsweringBIG-bench (Navigate)Accuracy50BLOOM 176B (few-shot, k=3)
Question AnsweringBIG-bench (Navigate)Accuracy45.2GPT-NeoX (few-shot, k=3)
Question AnsweringBIG-bench (Navigate)Accuracy42Bloomberg GPT (few-shot, k=3)
Question AnsweringBIG-bench (Navigate)Accuracy42OPT 66B (few-shot, k=3)
Question AnsweringBIG-bench (Ruin Names)Accuracy76PaLM 540B (few-shot, k=3)
Question AnsweringBIG-bench (Ruin Names)Accuracy56Bloomberg GPT (few-shot, k=3)
Question AnsweringBIG-bench (Ruin Names)Accuracy54.8BLOOM 176B (few-shot, k=3)
Question AnsweringBIG-bench (Ruin Names)Accuracy54GPT-NeoX (few-shot, k=3)
Question AnsweringBIG-bench (Ruin Names)Accuracy52.8OPT 66B (few-shot, k=3)
Question AnsweringBIG-bench (Hyperbaton)Accuracy92Bloomberg GPT (few-shot, k=3)
Question AnsweringBIG-bench (Hyperbaton)Accuracy92GPT-NeoX (few-shot, k=3)
Question AnsweringBIG-bench (Hyperbaton)Accuracy92BLOOM 176B (few-shot, k=3)
Question AnsweringBIG-bench (Hyperbaton)Accuracy91.6OPT 66B (few-shot, k=3)
Question AnsweringBIG-bench (Hyperbaton)Accuracy70.8PaLM 540B (few-shot, k=3)
Common Sense ReasoningBIG-bench (Causal Judgment)Accuracy61PaLM 540B (few-shot, k=3)
Common Sense ReasoningBIG-bench (Causal Judgment)Accuracy52.41GPT-NeoX 20B (few-shot, k=3)
Common Sense ReasoningBIG-bench (Causal Judgment)Accuracy51.87BLOOM 176B (few-shot, k=3)
Common Sense ReasoningBIG-bench (Causal Judgment)Accuracy51.87OPT 66B (few-shot, k=3)
Common Sense ReasoningBIG-bench (Causal Judgment)Accuracy49.73BloombergGPT 50B (few-shot, k=3)
Common Sense ReasoningBIG-bench (Disambiguation QA)Accuracy60.8PaLM 540B (few-shot, k=3)
Common Sense ReasoningBIG-bench (Disambiguation QA)Accuracy40.8GPT-NeoX 20B (few-shot, k=3)
Common Sense ReasoningBIG-bench (Disambiguation QA)Accuracy40.4OPT 66B (few-shot, k=3)
Common Sense ReasoningBIG-bench (Disambiguation QA)Accuracy40.4BLOOM 176B (few-shot, k=3)
Common Sense ReasoningBIG-bench (Disambiguation QA)Accuracy34Bloomberg GPT 50B (few-shot, k=3)
Common Sense ReasoningWinoGrandeAccuracy67BLOOM 176B (1-shot)
Common Sense ReasoningWinoGrandeAccuracy66.1OPT 66B (1-shot)
Common Sense ReasoningWinoGrandeAccuracy64.1Bloomberg GPT (one-shot)
Common Sense ReasoningWinoGrandeAccuracy60.6GPT-NeoX (one-shot)
Common Sense ReasoningARC (Challenge)Accuracy50.85BLOOM 176B (1-shot)
Common Sense ReasoningARC (Challenge)Accuracy48.63Bloomberg GPT 50B (1-shot)
Common Sense ReasoningARC (Challenge)Accuracy45.39GPT-NeoX 20B (1-shot)
Common Sense ReasoningARC (Challenge)Accuracy44.54OPT 66B (one-shot)
Common Sense ReasoningBIG-bench (Sports Understanding)Accuracy80.4PaLM 540B (few-shot, k=3)
Common Sense ReasoningBIG-bench (Sports Understanding)Accuracy62.8Bloomberg GPT (few-shot, k=3)
Common Sense ReasoningBIG-bench (Sports Understanding)Accuracy54.4OPT 66B (few-shot, k=3)
Common Sense ReasoningBIG-bench (Sports Understanding)Accuracy53.2GPT-NeoX (few-shot, k=3)
Common Sense ReasoningARC (Easy)Accuracy75.93BLOOM 176B (1-shot)
Common Sense ReasoningARC (Easy)Accuracy73.99Bloomberg GPT 50B (1-shot)
Common Sense ReasoningARC (Easy)Accuracy71.25OPT 66B (1-shot)
Common Sense ReasoningARC (Easy)Accuracy70.79GPT-NeoX 20B (1-shot)
Common Sense ReasoningBIG-bench (Date Understanding)Accuracy54.8Bloomberg GPT 50B (few-shot, k=3)
Common Sense ReasoningBIG-bench (Date Understanding)Accuracy53.6PaLM 540B (few-shot,k=3)
Common Sense ReasoningBIG-bench (Date Understanding)Accuracy50BLOOM 176B (few-shot, k=3)
Common Sense ReasoningBIG-bench (Date Understanding)Accuracy49.6OPT 66B (few-shot, k=3)
Common Sense ReasoningBIG-bench (Date Understanding)Accuracy45.6GPT-NeoX 20B (few-shot, k=3)
Common Sense ReasoningCommonsenseQAAccuracy66.4OPT 66B (1-shot)
Common Sense ReasoningCommonsenseQAAccuracy65.5Bloomberg GPT 50B (1-shot)
Common Sense ReasoningCommonsenseQAAccuracy64.2BLOOM 176B (1-shot)
Common Sense ReasoningCommonsenseQAAccuracy60.4GPT-NeoX 20B (1-shot)
Common Sense ReasoningReCoRDF182.8Bloomberg GPT 50B (1-shot)
Common Sense ReasoningReCoRDF182.5OPT 66B (1-shot)
Common Sense ReasoningReCoRDF178BLOOM 176B (1-shot)
Common Sense ReasoningReCoRDF167.9GPT-NeoX 20B (1-shot)
Natural Language InferenceANLI testA133.6BLOOM 176B (one-shot)
Natural Language InferenceANLI testA233.8BLOOM 176B (one-shot)
Natural Language InferenceANLI testA335.17BLOOM 176B (one-shot)
Natural Language InferenceANLI testA133.1OPT 66B (one-shot)
Natural Language InferenceANLI testA234.2OPT 66B (one-shot)
Natural Language InferenceANLI testA334.92OPT 66B (one-shot)
Natural Language InferenceANLI testA132.9Bloomberg GPT (one-shot)
Natural Language InferenceANLI testA234.4Bloomberg GPT (one-shot)
Natural Language InferenceANLI testA337.33Bloomberg GPT (one-shot)
Natural Language InferenceANLI testA132.6GPT-NeoX (one-shot)
Natural Language InferenceANLI testA233.8GPT-NeoX (one-shot)
Natural Language InferenceANLI testA336.17GPT-NeoX (one-shot)
Natural Language InferenceCommitmentBankAccuracy53.57Bloomberg GPT (one-shot)
Natural Language InferenceCommitmentBankAccuracy48.21GPT-NeoX (one-shot)
Natural Language InferenceCommitmentBankAccuracy48.21BLOOM 176B (one-shot)
Natural Language InferenceCommitmentBankAccuracy44.64OPT 66B (one-shot)
Sarcasm DetectionBIG-bench (SNARKS)Accuracy78.1PaLM 540B (few-shot, k=3)
Sarcasm DetectionBIG-bench (SNARKS)Accuracy72.47BLOOM 176B (few-shot, k=3)
Sarcasm DetectionBIG-bench (SNARKS)Accuracy69.66Bloomberg GPT (few-shot, k=3)
Sarcasm DetectionBIG-bench (SNARKS)Accuracy62.36GPT-NeoX (few-shot, k=3)
Multi-Task LearningMMLAverage (%)39.2Bloomberg GPT 50B (5-shot)
Multi-Task LearningMMLAverage (%)39.1BLOOM 176B (5-shot)
Multi-Task LearningMMLAverage (%)36OPT 66B (5-shot)
Sentence CompletionHellaSwagAccuracy73.9BlooombergGPT 50B (1-shot)
Sentence CompletionHellaSwagAccuracy73.5OPT 66B (1-shot)
Sentence CompletionHellaSwagAccuracy73.2BLOOM 176B (1-shot)
Sentence CompletionHellaSwagAccuracy68.4GPT-NeoX 20B (1-shot)
Logical ReasoningBIG-bench (Penguins In A Table)Accuracy44.5PaLM 540B (few-shot, k=3)
Logical ReasoningBIG-bench (Penguins In A Table)Accuracy40.41BLOOM 176B (few-shot, k=3)
Logical ReasoningBIG-bench (Penguins In A Table)Accuracy37.67Bloomberg GPT (few-shot, k=3)
Logical ReasoningBIG-bench (Penguins In A Table)Accuracy33.56GPT-NeoX (few-shot, k=3)
Logical ReasoningBIG-bench (Penguins In A Table)Accuracy28.08OPT 66B (few-shot, k=3)
Logical ReasoningBIG-bench (Temporal Sequences)Accuracy39.6PaLM 540B (few-shot, k=3)
Logical ReasoningBIG-bench (Temporal Sequences)Accuracy36.8BLOOM 176B (few-shot, k=3)
Logical ReasoningBIG-bench (Temporal Sequences)Accuracy29.2Bloomberg GPT (few-shot, k=3)
Logical ReasoningBIG-bench (Temporal Sequences)Accuracy23.6OPT 66B (few-shot, k=3)
Logical ReasoningBIG-bench (Temporal Sequences)Accuracy21.2GPT-NeoX (few-shot, k=3)
Logical ReasoningBIG-bench (Formal Fallacies Syllogisms Negation)Accuracy54OPT 66B (few-shot, k=3)
Logical ReasoningBIG-bench (Formal Fallacies Syllogisms Negation)Accuracy53.6PaLM 540B (few-shot, k=3)
Logical ReasoningBIG-bench (Formal Fallacies Syllogisms Negation)Accuracy52.8BLOOM 176B (few-shot, k=3)
Logical ReasoningBIG-bench (Formal Fallacies Syllogisms Negation)Accuracy52.8GPT-NeoX 20B (few-shot, k=3)
Logical ReasoningBIG-bench (Formal Fallacies Syllogisms Negation)Accuracy50.8Bloomberg GPT 50B (few-shot, k=3)
Logical ReasoningBIG-bench (Reasoning About Colored Objects)Accuracy38PaLM 540B (few-shot, k=3)
Logical ReasoningBIG-bench (Reasoning About Colored Objects)Accuracy36.8BLOOM 176B (few-shot, k=3)
Logical ReasoningBIG-bench (Reasoning About Colored Objects)Accuracy34.8Bloomberg GPT (few-shot, k=3)
Logical ReasoningBIG-bench (Reasoning About Colored Objects)Accuracy31.2OPT 66B (few-shot, k=3)
Logical ReasoningBIG-bench (Reasoning About Colored Objects)Accuracy26GPT-NeoX (few-shot, k=3)

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21DENSE: Longitudinal Progress Note Generation with Temporal Modeling of Heterogeneous Clinical Notes Across Hospital Visits2025-07-18From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17AdaptiSent: Context-Aware Adaptive Attention for Multimodal Aspect-Based Sentiment Analysis2025-07-17Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes2025-07-17