TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/TheoremQA: A Theorem-driven Question Answering dataset

TheoremQA: A Theorem-driven Question Answering dataset

Wenhu Chen, Ming Yin, Max Ku, Pan Lu, Yixin Wan, Xueguang Ma, Jianyu Xu, Xinyi Wang, Tony Xia

2023-05-21Question AnsweringMath
PaperPDFCode(official)

Abstract

The recent LLMs like GPT-4 and PaLM-2 have made tremendous progress in solving fundamental math problems like GSM8K by achieving over 90% accuracy. However, their capabilities to solve more challenging math problems which require domain-specific knowledge (i.e. theorem) have yet to be investigated. In this paper, we introduce TheoremQA, the first theorem-driven question-answering dataset designed to evaluate AI models' capabilities to apply theorems to solve challenging science problems. TheoremQA is curated by domain experts containing 800 high-quality questions covering 350 theorems (e.g. Taylor's theorem, Lagrange's theorem, Huffman coding, Quantum Theorem, Elasticity Theorem, etc) from Math, Physics, EE&CS, and Finance. We evaluate a wide spectrum of 16 large language and code models with different prompting strategies like Chain-of-Thoughts and Program-of-Thoughts. We found that GPT-4's capabilities to solve these problems are unparalleled, achieving an accuracy of 51% with Program-of-Thoughts Prompting. All the existing open-sourced models are below 15%, barely surpassing the random-guess baseline. Given the diversity and broad coverage of TheoremQA, we believe it can be used as a better benchmark to evaluate LLMs' capabilities to solve challenging science problems. The data and code are released in https://github.com/wenhuchen/TheoremQA.

Results

TaskDatasetMetricValueModel
General KnowledgeTheoremQAAccuracy52.4GPT-4 (PoT)
General KnowledgeTheoremQAAccuracy43.8GPT-4 (CoT)
General KnowledgeTheoremQAAccuracy35.6GPT-3.5-turbo (PoT)
General KnowledgeTheoremQAAccuracy31.8PaLM-2-unicorn (CoT)
General KnowledgeTheoremQAAccuracy30.2GPT-3.5-turbo (CoT)
General KnowledgeTheoremQAAccuracy25.9Claude-v1 (PoT)
General KnowledgeTheoremQAAccuracy24.9Claude-v1 (CoT)
General KnowledgeTheoremQAAccuracy23.9code-davinci-002
General KnowledgeTheoremQAAccuracy23.6Claude-instant (CoT)
General KnowledgeTheoremQAAccuracy22.8text-davinci-003
General KnowledgeTheoremQAAccuracy21PaLM-2-bison (CoT)

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17VAR-MATH: Probing True Mathematical Reasoning in Large Language Models via Symbolic Multi-Instance Benchmarks2025-07-17QuestA: Expanding Reasoning Capacity in LLMs via Question Augmentation2025-07-17Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16Is This Just Fantasy? Language Model Representations Reflect Human Judgments of Event Plausibility2025-07-16