MACM: Utilizing a Multi-Agent System for Condition Mining in Solving Complex Mathematical Problems

Bin Lei, Yi Zhang, Shan Zuo, Ali Payani, Caiwen Ding

2024-04-06Math Math Word Problem Solving Logical Reasoning

Abstract

Recent advancements in large language models, such as GPT-4, have demonstrated remarkable capabilities in processing standard queries. Despite these advancements, their performance substantially declines in \textbf{advanced mathematical problems requiring complex, multi-step logical reasoning}. To enhance their inferential capabilities, current research has delved into \textit{prompting engineering}, exemplified by methodologies such as the Tree of Thought and Graph of Thought. Nonetheless, these existing approaches encounter two significant limitations. Firstly, their effectiveness in tackling complex mathematical problems is somewhat constrained. Secondly, the necessity to design distinct prompts for individual problems hampers their generalizability. In response to these limitations, this paper introduces the \textit{Multi-Agent System for conditional Mining} (\textbf{MACM}) prompting method. It not only resolves intricate mathematical problems but also demonstrates strong generalization capabilities across various mathematical contexts. With the assistance of MACM, the accuracy of GPT-4 Turbo on the most challenging level five mathematical problems in the MATH dataset increase from $\mathbf{54.68\%} \text{ to } \mathbf{76.73\%}$. The code is available in \url{https://github.com/bin123apple/MACM}.

Results

Task	Dataset	Metric	Value	Model
Question Answering	MATH	Accuracy	87.92	GPT-4 Turbo (MACM, w/code, voting)
Math Word Problem Solving	MATH	Accuracy	87.92	GPT-4 Turbo (MACM, w/code, voting)
Mathematical Question Answering	MATH	Accuracy	87.92	GPT-4 Turbo (MACM, w/code, voting)
Mathematical Reasoning	MATH	Accuracy	87.92	GPT-4 Turbo (MACM, w/code, voting)

Related Papers

VAR-MATH: Probing True Mathematical Reasoning in Large Language Models via Symbolic Multi-Instance Benchmarks2025-07-17 QuestA: Expanding Reasoning Capacity in LLMs via Question Augmentation2025-07-17 Scaling Up RL: Unlocking Diverse Reasoning in LLMs via Prolonged Training2025-07-16 Temperature and Persona Shape LLM Agent Consensus With Minimal Accuracy Gains in Qualitative Coding2025-07-15 Personalized Exercise Recommendation with Semantically-Grounded Knowledge Tracing2025-07-15 Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination2025-07-14 A Practical Two-Stage Recipe for Mathematical LLMs: Maximizing Accuracy with SFT and Efficiency with Reinforcement Learning2025-07-11 Skip a Layer or Loop it? Test-Time Depth Adaptation of Pretrained LLMs2025-07-10