TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/MathCoder: Seamless Code Integration in LLMs for Enhanced ...

MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning

Ke Wang, Houxing Ren, Aojun Zhou, Zimu Lu, Sichun Luo, Weikang Shi, Renrui Zhang, Linqi Song, Mingjie Zhan, Hongsheng Li

2023-10-05Mathematical ReasoningMathMath Word Problem SolvingGSM8KArithmetic Reasoning
PaperPDFCode(official)

Abstract

The recently released GPT-4 Code Interpreter has demonstrated remarkable proficiency in solving challenging math problems, primarily attributed to its ability to seamlessly reason with natural language, generate code, execute code, and continue reasoning based on the execution output. In this paper, we present a method to fine-tune open-source language models, enabling them to use code for modeling and deriving math equations and, consequently, enhancing their mathematical reasoning abilities. We propose a method of generating novel and high-quality datasets with math problems and their code-based solutions, referred to as MathCodeInstruct. Each solution interleaves natural language, code, and execution results. We also introduce a customized supervised fine-tuning and inference approach. This approach yields the MathCoder models, a family of models capable of generating code-based solutions for solving challenging math problems. Impressively, the MathCoder models achieve state-of-the-art scores among open-source LLMs on the MATH (45.2%) and GSM8K (83.9%) datasets, substantially outperforming other open-source alternatives. Notably, the MathCoder model not only surpasses ChatGPT-3.5 and PaLM-2 on GSM8K and MATH but also outperforms GPT-4 on the competition-level MATH dataset. The dataset and models will be released at https://github.com/mathllm/MathCoder.

Results

TaskDatasetMetricValueModel
Question AnsweringMATHAccuracy45.2MathCoder-CL-34B
Question AnsweringMATHParameters (Billions)34MathCoder-CL-34B
Question AnsweringMATHAccuracy45.1MathCoder-L-34B
Question AnsweringMATHParameters (Billions)34MathCoder-L-34B
Question AnsweringMATHAccuracy35.9MathCoder-CL-13B
Question AnsweringMATHParameters (Billions)13MathCoder-CL-13B
Question AnsweringMATHAccuracy30.2MathCoder-CL-7B
Question AnsweringMATHParameters (Billions)7MathCoder-CL-7B
Question AnsweringMATHAccuracy29.9MathCoder-L-13B
Question AnsweringMATHParameters (Billions)13MathCoder-L-13B
Question AnsweringMATHAccuracy23.3MathCoder-L-7B
Question AnsweringMATHParameters (Billions)7MathCoder-L-7B
Question AnsweringSVAMPExecution Accuracy84.9MathCoder-L-70B
Math Word Problem SolvingMATHAccuracy45.2MathCoder-CL-34B
Math Word Problem SolvingMATHParameters (Billions)34MathCoder-CL-34B
Math Word Problem SolvingMATHAccuracy45.1MathCoder-L-34B
Math Word Problem SolvingMATHParameters (Billions)34MathCoder-L-34B
Math Word Problem SolvingMATHAccuracy35.9MathCoder-CL-13B
Math Word Problem SolvingMATHParameters (Billions)13MathCoder-CL-13B
Math Word Problem SolvingMATHAccuracy30.2MathCoder-CL-7B
Math Word Problem SolvingMATHParameters (Billions)7MathCoder-CL-7B
Math Word Problem SolvingMATHAccuracy29.9MathCoder-L-13B
Math Word Problem SolvingMATHParameters (Billions)13MathCoder-L-13B
Math Word Problem SolvingMATHAccuracy23.3MathCoder-L-7B
Math Word Problem SolvingMATHParameters (Billions)7MathCoder-L-7B
Math Word Problem SolvingSVAMPExecution Accuracy84.9MathCoder-L-70B
Mathematical Question AnsweringMATHAccuracy45.2MathCoder-CL-34B
Mathematical Question AnsweringMATHParameters (Billions)34MathCoder-CL-34B
Mathematical Question AnsweringMATHAccuracy45.1MathCoder-L-34B
Mathematical Question AnsweringMATHParameters (Billions)34MathCoder-L-34B
Mathematical Question AnsweringMATHAccuracy35.9MathCoder-CL-13B
Mathematical Question AnsweringMATHParameters (Billions)13MathCoder-CL-13B
Mathematical Question AnsweringMATHAccuracy30.2MathCoder-CL-7B
Mathematical Question AnsweringMATHParameters (Billions)7MathCoder-CL-7B
Mathematical Question AnsweringMATHAccuracy29.9MathCoder-L-13B
Mathematical Question AnsweringMATHParameters (Billions)13MathCoder-L-13B
Mathematical Question AnsweringMATHAccuracy23.3MathCoder-L-7B
Mathematical Question AnsweringMATHParameters (Billions)7MathCoder-L-7B
Mathematical Question AnsweringSVAMPExecution Accuracy84.9MathCoder-L-70B
Mathematical ReasoningMATHAccuracy45.2MathCoder-CL-34B
Mathematical ReasoningMATHParameters (Billions)34MathCoder-CL-34B
Mathematical ReasoningMATHAccuracy45.1MathCoder-L-34B
Mathematical ReasoningMATHParameters (Billions)34MathCoder-L-34B
Mathematical ReasoningMATHAccuracy35.9MathCoder-CL-13B
Mathematical ReasoningMATHParameters (Billions)13MathCoder-CL-13B
Mathematical ReasoningMATHAccuracy30.2MathCoder-CL-7B
Mathematical ReasoningMATHParameters (Billions)7MathCoder-CL-7B
Mathematical ReasoningMATHAccuracy29.9MathCoder-L-13B
Mathematical ReasoningMATHParameters (Billions)13MathCoder-L-13B
Mathematical ReasoningMATHAccuracy23.3MathCoder-L-7B
Mathematical ReasoningMATHParameters (Billions)7MathCoder-L-7B
Mathematical ReasoningSVAMPExecution Accuracy84.9MathCoder-L-70B
Arithmetic ReasoningGSM8KAccuracy83.9MathCoder-L-70B
Arithmetic ReasoningGSM8KParameters (Billion)70MathCoder-L-70B
Arithmetic ReasoningGSM8KAccuracy81.7MathCoder-CL-34B
Arithmetic ReasoningGSM8KParameters (Billion)34MathCoder-CL-34B
Arithmetic ReasoningGSM8KAccuracy74.1MathCoder-CL-13B
Arithmetic ReasoningGSM8KParameters (Billion)7MathCoder-CL-13B
Arithmetic ReasoningGSM8KAccuracy72.6MathCoder-L-13B
Arithmetic ReasoningGSM8KParameters (Billion)13MathCoder-L-13B
Arithmetic ReasoningGSM8KAccuracy67.8MathCoder-CL-7B
Arithmetic ReasoningGSM8KParameters (Billion)7MathCoder-CL-7B
Arithmetic ReasoningGSM8KAccuracy64.2MathCoder-L-7B
Arithmetic ReasoningGSM8KParameters (Billion)7MathCoder-L-7B

Related Papers

VAR-MATH: Probing True Mathematical Reasoning in Large Language Models via Symbolic Multi-Instance Benchmarks2025-07-17QuestA: Expanding Reasoning Capacity in LLMs via Question Augmentation2025-07-17GEMMAS: Graph-based Evaluation Metrics for Multi Agent Systems2025-07-17A Survey of Deep Learning for Geometry Problem Solving2025-07-16Scaling Up RL: Unlocking Diverse Reasoning in LLMs via Prolonged Training2025-07-16DAC: A Dynamic Attention-aware Approach for Task-Agnostic Prompt Compression2025-07-16KisMATH: Do LLMs Have Knowledge of Implicit Structures in Mathematical Reasoning?2025-07-15Temperature and Persona Shape LLM Agent Consensus With Minimal Accuracy Gains in Qualitative Coding2025-07-15