TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/EconLogicQA: A Question-Answering Benchmark for Evaluating...

EconLogicQA: A Question-Answering Benchmark for Evaluating Large Language Models in Economic Sequential Reasoning

Yinzhu Quan, Zefang Liu

2024-05-13Question AnsweringSentence OrderingManagementMultiple Choice Question Answering (MCQA)
PaperPDFCode(official)

Abstract

In this paper, we introduce EconLogicQA, a rigorous benchmark designed to assess the sequential reasoning capabilities of large language models (LLMs) within the intricate realms of economics, business, and supply chain management. Diverging from traditional benchmarks that predict subsequent events individually, EconLogicQA poses a more challenging task: it requires models to discern and sequence multiple interconnected events, capturing the complexity of economic logics. EconLogicQA comprises an array of multi-event scenarios derived from economic articles, which necessitate an insightful understanding of both temporal and logical event relationships. Through comprehensive evaluations, we exhibit that EconLogicQA effectively gauges a LLM's proficiency in navigating the sequential complexities inherent in economic contexts. We provide a detailed description of EconLogicQA dataset and shows the outcomes from evaluating the benchmark across various leading-edge LLMs, thereby offering a thorough perspective on their sequential reasoning potential in economic contexts. Our benchmark dataset is available at https://huggingface.co/datasets/yinzhu-quan/econ_logic_qa.

Results

TaskDatasetMetricValueModel
Sentence OrderingEconLogicQAAccuracy0.5692GPT-4-Turbo
Sentence OrderingEconLogicQAAccuracy0.5538GPT-4
Sentence OrderingEconLogicQAAccuracy0.3769GPT-3.5-Turbo
Sentence OrderingEconLogicQAAccuracy0.3462Llama-3-8B-Instruct
Sentence OrderingEconLogicQAAccuracy0.3154Mistral-7B-Instruct-v0.2
Sentence OrderingEconLogicQAAccuracy0.2615Mistral-7B-v0.1
Sentence OrderingEconLogicQAAccuracy0.2615Mistral-7B-v0.2
Sentence OrderingEconLogicQAAccuracy0.2385Llama-3-8B
Sentence OrderingEconLogicQAAccuracy0.2308Zephyr-7B-Alpha
Sentence OrderingEconLogicQAAccuracy0.2077Yi-6B-Chat
Sentence OrderingEconLogicQAAccuracy0.1769Zephyr-7B-Beta
Sentence OrderingEconLogicQAAccuracy0.1538Mistral-7B-Instruct-v0.1
Sentence OrderingEconLogicQAAccuracy0.1462Llama-2-13B-Chat
Sentence OrderingEconLogicQAAccuracy0.0923Llama-2-7B-Chat
Sentence OrderingEconLogicQAAccuracy0.0846Gemma-2B-IT
Sentence OrderingEconLogicQAAccuracy0.0385Yi-6B
Sentence OrderingEconLogicQAAccuracy0.0231Gemma-7B-IT
Sentence OrderingEconLogicQAAccuracy0.0077Llama-2-7B

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17Overview of the TalentCLEF 2025: Skill and Job Title Intelligence for Human Capital Management2025-07-17Autonomous Resource Management in Microservice Systems via Reinforcement Learning2025-07-17Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16Is This Just Fantasy? Language Model Representations Reflect Human Judgments of Event Plausibility2025-07-16