TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Solving Challenging Math Word Problems Using GPT-4 Code In...

Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification

Aojun Zhou, Ke Wang, Zimu Lu, Weikang Shi, Sichun Luo, Zipeng Qin, Shaoqing Lu, Anya Jia, Linqi Song, Mingjie Zhan, Hongsheng Li

2023-08-15Mathematical ReasoningMathMath Word Problem SolvingArithmetic Reasoning
PaperPDFCode

Abstract

Recent progress in large language models (LLMs) like GPT-4 and PaLM-2 has brought significant advancements in addressing math reasoning problems. In particular, OpenAI's latest version of GPT-4, known as GPT-4 Code Interpreter, shows remarkable performance on challenging math datasets. In this paper, we explore the effect of code on enhancing LLMs' reasoning capability by introducing different constraints on the \textit{Code Usage Frequency} of GPT-4 Code Interpreter. We found that its success can be largely attributed to its powerful skills in generating and executing code, evaluating the output of code execution, and rectifying its solution when receiving unreasonable outputs. Based on this insight, we propose a novel and effective prompting method, explicit \uline{c}ode-based \uline{s}elf-\uline{v}erification~(CSV), to further boost the mathematical reasoning potential of GPT-4 Code Interpreter. This method employs a zero-shot prompt on GPT-4 Code Interpreter to encourage it to use code to self-verify its answers. In instances where the verification state registers as ``False'', the model shall automatically amend its solution, analogous to our approach of rectifying errors during a mathematics examination. Furthermore, we recognize that the states of the verification result indicate the confidence of a solution, which can improve the effectiveness of majority voting. With GPT-4 Code Interpreter and CSV, we achieve an impressive zero-shot accuracy on MATH dataset \textbf{(53.9\% $\to$ 84.3\%)}.

Results

TaskDatasetMetricValueModel
Question AnsweringMATHAccuracy84.3GPT-4-code model (CSV, w/ code, SC, k=16)
Question AnsweringMATHAccuracy73.5GPT-4-code model (CSV, w/ code)
Question AnsweringMATHAccuracy71.2LogicNet (with code interpreter)
Question AnsweringMATHAccuracy69.7GPT-4-code model (w/ code)
Question AnsweringMATHAccuracy60.8GPT-4-code model (w/o code)
Math Word Problem SolvingMATHAccuracy84.3GPT-4-code model (CSV, w/ code, SC, k=16)
Math Word Problem SolvingMATHAccuracy73.5GPT-4-code model (CSV, w/ code)
Math Word Problem SolvingMATHAccuracy71.2LogicNet (with code interpreter)
Math Word Problem SolvingMATHAccuracy69.7GPT-4-code model (w/ code)
Math Word Problem SolvingMATHAccuracy60.8GPT-4-code model (w/o code)
Mathematical Question AnsweringMATHAccuracy84.3GPT-4-code model (CSV, w/ code, SC, k=16)
Mathematical Question AnsweringMATHAccuracy73.5GPT-4-code model (CSV, w/ code)
Mathematical Question AnsweringMATHAccuracy71.2LogicNet (with code interpreter)
Mathematical Question AnsweringMATHAccuracy69.7GPT-4-code model (w/ code)
Mathematical Question AnsweringMATHAccuracy60.8GPT-4-code model (w/o code)
Mathematical ReasoningMATHAccuracy84.3GPT-4-code model (CSV, w/ code, SC, k=16)
Mathematical ReasoningMATHAccuracy73.5GPT-4-code model (CSV, w/ code)
Mathematical ReasoningMATHAccuracy71.2LogicNet (with code interpreter)
Mathematical ReasoningMATHAccuracy69.7GPT-4-code model (w/ code)
Mathematical ReasoningMATHAccuracy60.8GPT-4-code model (w/o code)

Related Papers

VAR-MATH: Probing True Mathematical Reasoning in Large Language Models via Symbolic Multi-Instance Benchmarks2025-07-17QuestA: Expanding Reasoning Capacity in LLMs via Question Augmentation2025-07-17A Survey of Deep Learning for Geometry Problem Solving2025-07-16Scaling Up RL: Unlocking Diverse Reasoning in LLMs via Prolonged Training2025-07-16KisMATH: Do LLMs Have Knowledge of Implicit Structures in Mathematical Reasoning?2025-07-15Temperature and Persona Shape LLM Agent Consensus With Minimal Accuracy Gains in Qualitative Coding2025-07-15Personalized Exercise Recommendation with Semantically-Grounded Knowledge Tracing2025-07-15DCR: Quantifying Data Contamination in LLMs Evaluation2025-07-15