TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/MetaMath: Bootstrap Your Own Mathematical Questions for La...

MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T. Kwok, Zhenguo Li, Adrian Weller, Weiyang Liu

2023-09-21Mathematical ReasoningMathMath Word Problem SolvingNatural Language UnderstandingGSM8KArithmetic ReasoningLanguage Modelling
PaperPDFCode(official)

Abstract

Large language models (LLMs) have pushed the limits of natural language understanding and exhibited excellent problem-solving ability. Despite the great success, most existing open-source LLMs (e.g., LLaMA-2) are still far away from satisfactory for solving mathematical problem due to the complex reasoning procedures. To bridge this gap, we propose MetaMath, a fine-tuned language model that specializes in mathematical reasoning. Specifically, we start by bootstrapping mathematical questions by rewriting the question from multiple perspectives without extra knowledge, which results in a new dataset called MetaMathQA. Then we fine-tune the LLaMA-2 models on MetaMathQA. Experimental results on two popular benchmarks (i.e., GSM8K and MATH) for mathematical reasoning demonstrate that MetaMath outperforms a suite of open-source LLMs by a significant margin. Our MetaMath-7B model achieves 66.4% on GSM8K and 19.4% on MATH, exceeding the state-of-the-art models of the same size by 11.5% and 8.7%. Particularly, MetaMath-70B achieves an accuracy of 82.3% on GSM8K, slightly better than GPT-3.5-Turbo. We release all the MetaMathQA dataset, the MetaMath models with different model sizes and the training code for public use.

Results

TaskDatasetMetricValueModel
Question AnsweringMATHAccuracy26MetaMath 70B
Question AnsweringMATHParameters (Billions)70MetaMath 70B
Question AnsweringMATHAccuracy22.5MetaMath 13B
Question AnsweringMATHParameters (Billions)13MetaMath 13B
Question AnsweringMATHAccuracy19.4MetaMath 7B
Question AnsweringMATHParameters (Billions)7MetaMath 7B
Math Word Problem SolvingMATHAccuracy26MetaMath 70B
Math Word Problem SolvingMATHParameters (Billions)70MetaMath 70B
Math Word Problem SolvingMATHAccuracy22.5MetaMath 13B
Math Word Problem SolvingMATHParameters (Billions)13MetaMath 13B
Math Word Problem SolvingMATHAccuracy19.4MetaMath 7B
Math Word Problem SolvingMATHParameters (Billions)7MetaMath 7B
Mathematical Question AnsweringMATHAccuracy26MetaMath 70B
Mathematical Question AnsweringMATHParameters (Billions)70MetaMath 70B
Mathematical Question AnsweringMATHAccuracy22.5MetaMath 13B
Mathematical Question AnsweringMATHParameters (Billions)13MetaMath 13B
Mathematical Question AnsweringMATHAccuracy19.4MetaMath 7B
Mathematical Question AnsweringMATHParameters (Billions)7MetaMath 7B
Mathematical ReasoningMATHAccuracy26MetaMath 70B
Mathematical ReasoningMATHParameters (Billions)70MetaMath 70B
Mathematical ReasoningMATHAccuracy22.5MetaMath 13B
Mathematical ReasoningMATHParameters (Billions)13MetaMath 13B
Mathematical ReasoningMATHAccuracy19.4MetaMath 7B
Mathematical ReasoningMATHParameters (Billions)7MetaMath 7B
Arithmetic ReasoningGSM8KAccuracy82.3MetaMath 70B
Arithmetic ReasoningGSM8KParameters (Billion)70MetaMath 70B
Arithmetic ReasoningGSM8KAccuracy77.7MetaMath-Mistral-7B
Arithmetic ReasoningGSM8KParameters (Billion)7MetaMath-Mistral-7B
Arithmetic ReasoningGSM8KAccuracy71MetaMath 13B
Arithmetic ReasoningGSM8KParameters (Billion)13MetaMath 13B
Arithmetic ReasoningGSM8KAccuracy66.4MetaMath 7B
Arithmetic ReasoningGSM8KParameters (Billion)7MetaMath 7B

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21VAR-MATH: Probing True Mathematical Reasoning in Large Language Models via Symbolic Multi-Instance Benchmarks2025-07-17QuestA: Expanding Reasoning Capacity in LLMs via Question Augmentation2025-07-17GEMMAS: Graph-based Evaluation Metrics for Multi Agent Systems2025-07-17Making Language Model a Hierarchical Classifier and Generator2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities2025-07-17