Llemma: An Open Language Model For Mathematics

Zhangir Azerbayev, Hailey Schoelkopf, Keiran Paster, Marco Dos Santos, Stephen Mcaleer, Albert Q. Jiang, Jia Deng, Stella Biderman, Sean Welleck

2023-10-16Math Automated Theorem Proving Large Language Model Arithmetic Reasoning Language Modelling

Paper PDF Code(official)Code Code(official)Code(official)

Abstract

We present Llemma, a large language model for mathematics. We continue pretraining Code Llama on the Proof-Pile-2, a mixture of scientific papers, web data containing mathematics, and mathematical code, yielding Llemma. On the MATH benchmark Llemma outperforms all known open base models, as well as the unreleased Minerva model suite on an equi-parameter basis. Moreover, Llemma is capable of tool use and formal theorem proving without any further finetuning. We openly release all artifacts, including 7 billion and 34 billion parameter models, the Proof-Pile-2, and code to replicate our experiments.

Results

Task	Dataset	Metric	Value	Model
Automated Theorem Proving	miniF2F-test	Pass@32	26.2	LLEMMA-7b
Automated Theorem Proving	miniF2F-test	cumulative	26.2	LLEMMA-7b
Automated Theorem Proving	miniF2F-test	Pass@32	25.8	LLEMMA-34b
Automated Theorem Proving	miniF2F-test	cumulative	25.8	LLEMMA-34b
Mathematical Proofs	miniF2F-test	Pass@32	26.2	LLEMMA-7b
Mathematical Proofs	miniF2F-test	cumulative	26.2	LLEMMA-7b
Mathematical Proofs	miniF2F-test	Pass@32	25.8	LLEMMA-34b
Mathematical Proofs	miniF2F-test	cumulative	25.8	LLEMMA-34b
Arithmetic Reasoning	GSM8K	Accuracy	51.5	Llemma 34B
Arithmetic Reasoning	GSM8K	Parameters (Billion)	34	Llemma 34B
Arithmetic Reasoning	GSM8K	Accuracy	36.4	Llemma 7B
Arithmetic Reasoning	GSM8K	Parameters (Billion)	7	Llemma 7B

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21 DENSE: Longitudinal Progress Note Generation with Temporal Modeling of Heterogeneous Clinical Notes Across Hospital Visits2025-07-18 VAR-MATH: Probing True Mathematical Reasoning in Large Language Models via Symbolic Multi-Instance Benchmarks2025-07-17 QuestA: Expanding Reasoning Capacity in LLMs via Question Augmentation2025-07-17 GeoReg: Weight-Constrained Few-Shot Regression for Socio-Economic Estimation using LLM2025-07-17 The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17 Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities2025-07-17 Rethinking the Embodied Gap in Vision-and-Language Navigation: A Holistic Study of Physical and Visual Disparities2025-07-17