Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification

Aojun Zhou, Ke Wang, Zimu Lu, Weikang Shi, Sichun Luo, Zipeng Qin, Shaoqing Lu, Anya Jia, Linqi Song, Mingjie Zhan, Hongsheng Li

2023-08-15Mathematical Reasoning Math Math Word Problem Solving Arithmetic Reasoning

Paper PDF Code

Abstract

Recent progress in large language models (LLMs) like GPT-4 and PaLM-2 has brought significant advancements in addressing math reasoning problems. In particular, OpenAI's latest version of GPT-4, known as GPT-4 Code Interpreter, shows remarkable performance on challenging math datasets. In this paper, we explore the effect of code on enhancing LLMs' reasoning capability by introducing different constraints on the \textit{Code Usage Frequency} of GPT-4 Code Interpreter. We found that its success can be largely attributed to its powerful skills in generating and executing code, evaluating the output of code execution, and rectifying its solution when receiving unreasonable outputs. Based on this insight, we propose a novel and effective prompting method, explicit \uline{c}ode-based \uline{s}elf-\uline{v}erification~(CSV), to further boost the mathematical reasoning potential of GPT-4 Code Interpreter. This method employs a zero-shot prompt on GPT-4 Code Interpreter to encourage it to use code to self-verify its answers. In instances where the verification state registers as ``False'', the model shall automatically amend its solution, analogous to our approach of rectifying errors during a mathematics examination. Furthermore, we recognize that the states of the verification result indicate the confidence of a solution, which can improve the effectiveness of majority voting. With GPT-4 Code Interpreter and CSV, we achieve an impressive zero-shot accuracy on MATH dataset \textbf{(53.9\% $\to$ 84.3\%)}.

Results

Task	Dataset	Metric	Value	Model
Question Answering	MATH	Accuracy	84.3	GPT-4-code model (CSV, w/ code, SC, k=16)
Question Answering	MATH	Accuracy	73.5	GPT-4-code model (CSV, w/ code)
Question Answering	MATH	Accuracy	71.2	LogicNet (with code interpreter)
Question Answering	MATH	Accuracy	69.7	GPT-4-code model (w/ code)
Question Answering	MATH	Accuracy	60.8	GPT-4-code model (w/o code)
Math Word Problem Solving	MATH	Accuracy	84.3	GPT-4-code model (CSV, w/ code, SC, k=16)
Math Word Problem Solving	MATH	Accuracy	73.5	GPT-4-code model (CSV, w/ code)
Math Word Problem Solving	MATH	Accuracy	71.2	LogicNet (with code interpreter)
Math Word Problem Solving	MATH	Accuracy	69.7	GPT-4-code model (w/ code)
Math Word Problem Solving	MATH	Accuracy	60.8	GPT-4-code model (w/o code)
Mathematical Question Answering	MATH	Accuracy	84.3	GPT-4-code model (CSV, w/ code, SC, k=16)
Mathematical Question Answering	MATH	Accuracy	73.5	GPT-4-code model (CSV, w/ code)
Mathematical Question Answering	MATH	Accuracy	71.2	LogicNet (with code interpreter)
Mathematical Question Answering	MATH	Accuracy	69.7	GPT-4-code model (w/ code)
Mathematical Question Answering	MATH	Accuracy	60.8	GPT-4-code model (w/o code)
Mathematical Reasoning	MATH	Accuracy	84.3	GPT-4-code model (CSV, w/ code, SC, k=16)
Mathematical Reasoning	MATH	Accuracy	73.5	GPT-4-code model (CSV, w/ code)
Mathematical Reasoning	MATH	Accuracy	71.2	LogicNet (with code interpreter)
Mathematical Reasoning	MATH	Accuracy	69.7	GPT-4-code model (w/ code)
Mathematical Reasoning	MATH	Accuracy	60.8	GPT-4-code model (w/o code)

Abstract

Results

Task	Dataset	Metric	Value	Model
Question Answering	MATH	Accuracy	84.3	GPT-4-code model (CSV, w/ code, SC, k=16)
Question Answering	MATH	Accuracy	73.5	GPT-4-code model (CSV, w/ code)
Question Answering	MATH	Accuracy	71.2	LogicNet (with code interpreter)
Question Answering	MATH	Accuracy	69.7	GPT-4-code model (w/ code)
Question Answering	MATH	Accuracy	60.8	GPT-4-code model (w/o code)
Math Word Problem Solving	MATH	Accuracy	84.3	GPT-4-code model (CSV, w/ code, SC, k=16)
Math Word Problem Solving	MATH	Accuracy	73.5	GPT-4-code model (CSV, w/ code)
Math Word Problem Solving	MATH	Accuracy	71.2	LogicNet (with code interpreter)
Math Word Problem Solving	MATH	Accuracy	69.7	GPT-4-code model (w/ code)
Math Word Problem Solving	MATH	Accuracy	60.8	GPT-4-code model (w/o code)
Mathematical Question Answering	MATH	Accuracy	84.3	GPT-4-code model (CSV, w/ code, SC, k=16)
Mathematical Question Answering	MATH	Accuracy	73.5	GPT-4-code model (CSV, w/ code)
Mathematical Question Answering	MATH	Accuracy	71.2	LogicNet (with code interpreter)
Mathematical Question Answering	MATH	Accuracy	69.7	GPT-4-code model (w/ code)
Mathematical Question Answering	MATH	Accuracy	60.8	GPT-4-code model (w/o code)
Mathematical Reasoning	MATH	Accuracy	84.3	GPT-4-code model (CSV, w/ code, SC, k=16)
Mathematical Reasoning	MATH	Accuracy	73.5	GPT-4-code model (CSV, w/ code)
Mathematical Reasoning	MATH	Accuracy	71.2	LogicNet (with code interpreter)
Mathematical Reasoning	MATH	Accuracy	69.7	GPT-4-code model (w/ code)
Mathematical Reasoning	MATH	Accuracy	60.8	GPT-4-code model (w/o code)

Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification

Abstract

Results

Related Papers

Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification

Abstract

Results

Related Papers