Metric: Accuracy (higher is better)
| # | Model↕ | Accuracy▼ | Extra Data | Paper | Date↕ | Code |
|---|---|---|---|---|---|---|
| 1 | Claude 3.5 Sonnet (HPT) | 97.72 | No | Hierarchical Prompting Taxonomy: A Universal Eva... | 2024-06-18 | Code |
| 2 | DUP prompt upon GPT-4 | 97.1 | No | Achieving >97% on GSM8K: Deeply Understanding th... | 2024-04-23 | Code |
| 3 | Qwen2-Math-72B-Instruct (greedy) | 96.7 | Yes | Qwen2 Technical Report | 2024-07-15 | Code |
| 4 | SFT-Mistral-7B (Metamath, OVM, Smart Ensemble) | 96.4 | Yes | - | - | - |
| 5 | OpenMath2-Llama3.1-70B (majority@256) | 96 | Yes | OpenMathInstruct-2: Accelerating AI for Math wit... | 2024-10-02 | Code |
| 6 | Jiutian-大模型 | 95.2 | No | - | - | - |
| 7 | DAMOMath-7B(MetaMath, OVM, BS, Ensemble) | 95.1 | Yes | - | - | - |
| 8 | Claude 3 Opus (0-shot chain-of-thought) | 95 | No | - | - | - |
| 9 | OpenMath2-Llama3.1-70B | 94.9 | Yes | OpenMathInstruct-2: Accelerating AI for Math wit... | 2024-10-02 | Code |
| 10 | GPT-4 (Teaching-Inspired) | 94.8 | No | Teaching-Inspired Integrated Prompting Framework... | 2024-10-10 | Code |
| 11 | SFT-Mistral-7B (Metamath + ovm +ensemble) | 94.13 | Yes | - | - | - |
| 12 | OpenMath2-Llama3.1-8B (majority@256) | 94.1 | Yes | OpenMathInstruct-2: Accelerating AI for Math wit... | 2024-10-02 | Code |
| 13 | Qwen2-72B-Instruct-Step-DPO (0-shot CoT) | 94 | Yes | Step-DPO: Step-wise Preference Optimization for ... | 2024-06-26 | Code |
| 14 | DAMOMath-7B(MetaMath, OVM, Ensemble) | 93.2 | Yes | - | - | - |
| 15 | Claude 3 Sonnet (0-shot chain-of-thought) | 92.3 | No | - | - | - |
| 16 | AlphaLLM (with MCTS) | 92 | No | Toward Self-Improvement of LLMs via Imagination,... | 2024-04-18 | Code |
| 17 | OpenMath2-Llama3.1-8B | 91.7 | Yes | OpenMathInstruct-2: Accelerating AI for Math wit... | 2024-10-02 | Code |
| 18 | PaLM 2 (few-shot, k=8, SC) | 91 | No | PaLM 2 Technical Report | 2023-05-17 | Code |
| 19 | GaC(Qwen2-72B-Instruct + Llama-3-70B-Instruct) | 90.91 | No | Breaking the Ceiling of the LLM Community by Tre... | 2024-06-18 | Code |
| 20 | OpenMath-CodeLlama-70B (w/ code, SC, k=50) | 90.8 | Yes | OpenMathInstruct-1: A 1.8 Million Math Instructi... | 2024-02-15 | Code |
| 21 | DART-Math-Llama3-70B-Uniform (0-shot CoT, w/o code) | 90.4 | Yes | DART-Math: Difficulty-Aware Rejection Tuning for... | 2024-06-18 | Code |
| 22 | OpenMath-Llama2-70B (w/ code, SC, k=50) | 90.1 | Yes | OpenMathInstruct-1: A 1.8 Million Math Instructi... | 2024-02-15 | Code |
| 23 | DART-Math-Llama3-70B-Prop2Diff (0-shot CoT, w/o code) | 89.6 | Yes | DART-Math: Difficulty-Aware Rejection Tuning for... | 2024-06-18 | Code |
| 24 | Shepherd+Mistral-7B (SFT on MetaMATH + PRM RL+ PRM rerank, k=256) | 89.1 | Yes | Math-Shepherd: Verify and Reinforce LLMs Step-by... | 2023-12-14 | Code |
| 25 | Llama SFT (Metamath ToRA Ensemble) | 89 | Yes | - | - | - |
| 26 | Minerva 62B (maj5@100) | 89 | No | Solving Quantitative Reasoning Problems with Lan... | 2022-06-29 | Code |
| 27 | Claude 3 Haiku (0-shot chain-of-thought) | 88.9 | No | - | - | - |
| 28 | ToRA-70B (SC, k=50) | 88.3 | Yes | ToRA: A Tool-Integrated Reasoning Agent for Math... | 2023-09-29 | Code |
| 29 | DeepSeekMATH-RL-7B | 88.2 | Yes | DeepSeekMath: Pushing the Limits of Mathematical... | 2024-02-05 | Code |
| 30 | DART-Math-DSMath-7B-Uniform (0-shot CoT, w/o code) | 88.2 | Yes | DART-Math: Difficulty-Aware Rejection Tuning for... | 2024-06-18 | Code |
| 31 | OpenMath-CodeLlama-34B (w/ code, SC, k=50) | 88 | Yes | OpenMathInstruct-1: A 1.8 Million Math Instructi... | 2024-02-15 | Code |
| 32 | Claude 2 (0-shot chain-of-thought) | 88 | No | - | - | - |
| 33 | Shivaay-4B (8-shot chain-of-thought) | 87.41 | No | - | - | - |
| 34 | DeepMind 70B Model (SFT+ORM-RL, ORM reranking) | 87.3 | Yes | Solving math word problems with process- and out... | 2022-11-25 | - |
| 35 | MMOS-DeepSeekMath-7B(0-shot,k=50) | 87.2 | Yes | An Empirical Study of Data Ability Boundary in L... | 2024-02-23 | Code |
| 36 | DeepMind 70B Model (SFT+PRM-RL, PRM reranking) | 87.1 | Yes | Solving math word problems with process- and out... | 2022-11-25 | - |
| 37 | GPT-4 | 87.1 | No | Sparks of Artificial General Intelligence: Early... | 2023-03-22 | Code |
| 38 | OpenMath-Mistral-7B (w/ code, SC, k=50) | 86.9 | Yes | OpenMathInstruct-1: A 1.8 Million Math Instructi... | 2024-02-15 | Code |
| 39 | Orca-Math 7B (fine-tuned) | 86.8 | Yes | Orca-Math: Unlocking the potential of SLMs in Gr... | 2024-02-16 | - |
| 40 | DART-Math-DSMath-7B-Prop2Diff (0-shot CoT, w/o code) | 86.8 | Yes | DART-Math: Difficulty-Aware Rejection Tuning for... | 2024-06-18 | Code |
| 41 | OpenMath-CodeLlama-13B (w/ code, SC, k=50) | 86.8 | Yes | OpenMathInstruct-1: A 1.8 Million Math Instructi... | 2024-02-15 | Code |
| 42 | Gemini Pro (maj1@32) | 86.5 | No | Gemini: A Family of Highly Capable Multimodal Mo... | 2023-12-19 | Code |
| 43 | Codex (Self-Evaluation Guided Decoding, PAL, multiple reasoning chains, 9-shot gen, 5-shot eval) | 85.5 | No | - | - | - |
| 44 | Claude 1.3 (0-shot chain-of-thought) | 85.2 | No | - | - | - |
| 45 | ToRA-Code-34B (SC, k=50) | 85.1 | Yes | ToRA: A Tool-Integrated Reasoning Agent for Math... | 2023-09-29 | Code |
| 46 | OpenMath-CodeLlama-7B (w/ code, SC, k=50) | 84.8 | Yes | OpenMathInstruct-1: A 1.8 Million Math Instructi... | 2024-02-15 | Code |
| 47 | OVM-Mistral-7B (verify100@1) | 84.7 | No | OVM, Outcome-supervised Value Models for Plannin... | 2023-11-16 | Code |
| 48 | OpenMath-Llama2-70B (w/ code) | 84.7 | Yes | OpenMathInstruct-1: A 1.8 Million Math Instructi... | 2024-02-15 | Code |
| 49 | OpenMath-CodeLlama-70B (w/ code) | 84.6 | Yes | OpenMathInstruct-1: A 1.8 Million Math Instructi... | 2024-02-15 | Code |
| 50 | code-davinci-002 175B (LEVER, 8-shot) | 84.5 | No | LEVER: Learning to Verify Language-to-Code Gener... | 2023-02-16 | Code |
| 51 | ToRA 70B | 84.3 | Yes | ToRA: A Tool-Integrated Reasoning Agent for Math... | 2023-09-29 | Code |
| 52 | Shepherd + Mistral-7B (SFT on MetaMATH + PRM RL) | 84.1 | Yes | Math-Shepherd: Verify and Reinforce LLMs Step-by... | 2023-12-14 | Code |
| 53 | MathCoder-L-70B | 83.9 | Yes | MathCoder: Seamless Code Integration in LLMs for... | 2023-10-05 | Code |
| 54 | WizardMath-7B-V1.1 | 83.2 | Yes | WizardMath: Empowering Mathematical Reasoning fo... | 2023-08-18 | Code |
| 55 | DIVERSE 175B (8-shot) | 83.2 | No | Making Large Language Models Better Reasoners wi... | 2022-06-06 | - |
| 56 | OVM-Mistral-7B (verify20@1) | 82.6 | No | OVM, Outcome-supervised Value Models for Plannin... | 2023-11-16 | Code |
| 57 | DART-Math-Mistral-7B-Uniform (0-shot CoT, w/o code) | 82.6 | Yes | DART-Math: Difficulty-Aware Rejection Tuning for... | 2024-06-18 | Code |
| 58 | ChatGPT (Ask, Refine, Trust) | 82.6 | No | The ART of LLM Refinement: Ask, Refine, and Trust | 2023-11-14 | - |
| 59 | DART-Math-Llama3-8B-Uniform (0-shot CoT, w/o code) | 82.5 | Yes | DART-Math: Difficulty-Aware Rejection Tuning for... | 2024-06-18 | Code |
| 60 | MetaMath 70B | 82.3 | Yes | MetaMath: Bootstrap Your Own Mathematical Questi... | 2023-09-21 | Code |
| 61 | MuggleMATH 70B | 82.3 | Yes | MuggleMath: Assessing the Impact of Query and Re... | 2023-10-09 | Code |
| 62 | PaLM 540B (Self Improvement, Self Consistency) | 82.1 | No | Large Language Models Can Self-Improve | 2022-10-20 | - |
| 63 | MathCoder-CL-34B | 81.7 | Yes | MathCoder: Seamless Code Integration in LLMs for... | 2023-10-05 | Code |
| 64 | WizardMath-70B-V1.0 | 81.6 | Yes | WizardMath: Empowering Mathematical Reasoning fo... | 2023-08-18 | Code |
| 65 | Phi-GSM+V 1.3B+1.3B (verify48@1) | 81.5 | No | TinyGSM: achieving >80% on GSM8k with small lang... | 2023-12-14 | - |
| 66 | DART-Math-Mistral-7B-Prop2Diff (0-shot CoT, w/o code) | 81.1 | Yes | DART-Math: Difficulty-Aware Rejection Tuning for... | 2024-06-18 | Code |
| 67 | DART-Math-Llama3-8B-Prop2Diff (0-shot CoT, w/o code) | 81.1 | Yes | DART-Math: Difficulty-Aware Rejection Tuning for... | 2024-06-18 | Code |
| 68 | Claude Instant 1.1 (0-shot chain-of-thought) | 80.9 | No | - | - | - |
| 69 | ToRA-Code 34B | 80.7 | Yes | ToRA: A Tool-Integrated Reasoning Agent for Math... | 2023-09-29 | Code |
| 70 | OpenMath-CodeLlama-34B (w/ code) | 80.7 | Yes | OpenMathInstruct-1: A 1.8 Million Math Instructi... | 2024-02-15 | Code |
| 71 | PaLM 2 (few-shot, k=8, CoT) | 80.7 | No | PaLM 2 Technical Report | 2023-05-17 | Code |
| 72 | MMOS-DeepSeekMath-7B(0-shot) | 80.5 | Yes | An Empirical Study of Data Ability Boundary in L... | 2024-02-23 | Code |
| 73 | MMOS-CODE-34B(0-shot) | 80.4 | Yes | An Empirical Study of Data Ability Boundary in L... | 2024-02-23 | Code |
| 74 | OpenMath-Mistral-7B (w/ code) | 80.2 | Yes | OpenMathInstruct-1: A 1.8 Million Math Instructi... | 2024-02-15 | Code |
| 75 | Self-Evaluation Guided Decoding (Codex, PAL, single reasoning chain, 9-shot gen, 5-shot eval) | 80.2 | No | - | - | - |
| 76 | OpenMath-CodeLlama-13B (w/ code) | 78.8 | Yes | OpenMathInstruct-1: A 1.8 Million Math Instructi... | 2024-02-15 | Code |
| 77 | Minerva 540B (CoT) | 78.5 | No | Solving Quantitative Reasoning Problems with Lan... | 2022-06-29 | Code |
| 78 | Camelidae-8×34B (5-shot) | 78.3 | No | Parameter-Efficient Sparsity Crafting from Dense... | 2024-01-05 | Code |
| 79 | Qwen2idae-16x14B (5-shot) | 77.8 | No | Parameter-Efficient Sparsity Crafting from Dense... | 2024-01-05 | Code |
| 80 | MetaMath-Mistral-7B | 77.7 | Yes | MetaMath: Bootstrap Your Own Mathematical Questi... | 2023-09-21 | Code |
| 81 | OpenChat-3.5 7B | 77.3 | No | OpenChat: Advancing Open-source Language Models ... | 2023-09-20 | Code |
| 82 | DeepMind 70B Model (STaR, maj1@96) | 76.5 | Yes | Solving math word problems with process- and out... | 2022-11-25 | - |
| 83 | Arithmo2-Mistral-7B | 76.4 | No | - | - | - |
| 84 | OpenMath-CodeLlama-7B (w/ code) | 75.9 | Yes | OpenMathInstruct-1: A 1.8 Million Math Instructi... | 2024-02-15 | Code |
| 85 | ToRA-Code 13B | 75.8 | Yes | ToRA: A Tool-Integrated Reasoning Agent for Math... | 2023-09-29 | Code |
| 86 | Arithmo-Mistral-7B | 74.7 | No | - | - | - |
| 87 | PaLM 540B maj1@40 (8-shot) | 74.4 | Yes | Self-Consistency Improves Chain of Thought Reaso... | 2022-03-21 | Code |
| 88 | PaLM 540B (Self Consistency) | 74.4 | No | Large Language Models Can Self-Improve | 2022-10-20 | - |
| 89 | Phi-GSM 2.7B (fine-tuned) | 74.3 | No | TinyGSM: achieving >80% on GSM8k with small lang... | 2023-12-14 | - |
| 90 | MathCoder-CL-13B | 74.1 | Yes | MathCoder: Seamless Code Integration in LLMs for... | 2023-10-05 | Code |
| 91 | MuggleMATH 13B | 74 | Yes | MuggleMath: Assessing the Impact of Query and Re... | 2023-10-09 | Code |
| 92 | MMOS-CODE-7B(0-shot) | 73.9 | Yes | An Empirical Study of Data Ability Boundary in L... | 2024-02-23 | Code |
| 93 | CodeT5+ | 73.8 | No | CodeT5+: Open Code Large Language Models for Cod... | 2023-05-13 | Code |
| 94 | Llama-3.3-70B + CAPO | 73.73 | No | CAPO: Cost-Aware Prompt Optimization | 2025-04-22 | Code |
| 95 | OVM-Llama2-7B (verify100@1) | 73.7 | No | OVM, Outcome-supervised Value Models for Plannin... | 2023-11-16 | Code |
| 96 | PaLM 540B (Self Improvement, CoT Prompting) | 73.5 | No | Large Language Models Can Self-Improve | 2022-10-20 | - |
| 97 | KwaiYiiMath 13B | 73.3 | Yes | KwaiYiiMath: Technical Report | 2023-10-11 | - |
| 98 | ToRA-Code 7B | 72.6 | Yes | ToRA: A Tool-Integrated Reasoning Agent for Math... | 2023-09-29 | Code |
| 99 | MathCoder-L-13B | 72.6 | Yes | MathCoder: Seamless Code Integration in LLMs for... | 2023-10-05 | Code |
| 100 | DBRX Base 132B | 72.3 | No | - | - | - |
| 101 | Self-Evaluation Guided Decoding (Codex, CoT, single reasoning chain, 9-shot gen, 5-shot eval) | 71.9 | No | - | - | - |
| 102 | MetaMath 13B | 71 | Yes | MetaMath: Bootstrap Your Own Mathematical Questi... | 2023-09-21 | Code |
| 103 | MuggleMATH 7B | 69.8 | Yes | MuggleMath: Assessing the Impact of Query and Re... | 2023-10-09 | Code |
| 104 | LLaMA 65B-maj1@k | 69.7 | No | LLaMA: Open and Efficient Foundation Language Mo... | 2023-02-27 | Code |
| 105 | Minerva 62B (maj1@100) | 68.5 | Yes | Solving Quantitative Reasoning Problems with Lan... | 2022-06-29 | Code |
| 106 | code-davinci-002 (Least-to-Most Prompting) | 68.01 | No | Least-to-Most Prompting Enables Complex Reasonin... | 2022-05-21 | Code |
| 107 | MathCoder-CL-7B | 67.8 | Yes | MathCoder: Seamless Code Integration in LLMs for... | 2023-10-05 | Code |
| 108 | DBRX Instruct 132B | 66.9 | No | - | - | - |
| 109 | MetaMath 7B | 66.4 | Yes | MetaMath: Bootstrap Your Own Mathematical Questi... | 2023-09-21 | Code |
| 110 | Mistral-Small-24B + CAPO | 65.07 | No | CAPO: Cost-Aware Prompt Optimization | 2025-04-22 | Code |
| 111 | RFT 70B | 64.8 | Yes | Scaling Relationship on Learning Mathematical Re... | 2023-08-03 | Code |
| 112 | MathCoder-L-7B | 64.2 | Yes | MathCoder: Seamless Code Integration in LLMs for... | 2023-10-05 | Code |
| 113 | WizardMath-13B-V1.0 | 63.9 | Yes | WizardMath: Empowering Mathematical Reasoning fo... | 2023-08-18 | Code |
| 114 | GPT-J (CoRe) | 63.2 | No | Solving Math Word Problems via Cooperative Reaso... | 2022-10-28 | Code |
| 115 | Llama-2 70B (on 100 first questions, 4-shot, auto-optimized prompting) | 61 | No | The Unreasonable Effectiveness of Eccentric Auto... | 2024-02-09 | - |
| 116 | Qwen2.5-32B + CAPO | 60.2 | No | CAPO: Cost-Aware Prompt Optimization | 2025-04-22 | Code |
| 117 | LLaMA 2 70B (CoT-Influx) | 59.59 | No | Fewer is More: Boosting LLM Reasoning with Reinf... | 2023-12-14 | - |
| 118 | Orca 2 13B | 59.14 | No | Orca 2: Teaching Small Language Models How to Re... | 2023-11-18 | - |
| 119 | U-PaLM | 58.5 | No | Transcending Scaling Laws with 0.1% Extra Compute | 2022-10-20 | - |
| 120 | PaLM-540B (few-Shot-cot) | 58.1 | Yes | Large Language Models are Zero-Shot Reasoners | 2022-05-24 | Code |
| 121 | GPT-3.5 (few-shot, k=5) | 57.1 | No | GPT-4 Technical Report | 2023-03-15 | Code |
| 122 | Minerva 8B (maj5@100) | 56.8 | No | Solving Quantitative Reasoning Problems with Lan... | 2022-06-29 | Code |
| 123 | LLaMA 2 70B (on-shot) | 56.8 | No | Llama 2: Open Foundation and Fine-Tuned Chat Mod... | 2023-07-18 | Code |
| 124 | PaLM 540B (8-shot) | 56.5 | Yes | Solving Quantitative Reasoning Problems with Lan... | 2022-06-29 | Code |
| 125 | PaLM 540B (CoT Prompting) | 56.5 | No | Large Language Models Can Self-Improve | 2022-10-20 | - |
| 126 | RFT 13B | 55.3 | Yes | Scaling Relationship on Learning Mathematical Re... | 2023-08-03 | Code |
| 127 | Finetuned GPT-3 175B + verifier | 55 | Yes | Large Language Models are Zero-Shot Reasoners | 2022-05-24 | Code |
| 128 | WizardMath-7B-V1.0 | 54.9 | Yes | WizardMath: Empowering Mathematical Reasoning fo... | 2023-08-18 | Code |
| 129 | LLaMA 33B-maj1@k | 53.1 | No | LLaMA: Open and Efficient Foundation Language Mo... | 2023-02-27 | Code |
| 130 | Minerva 62B (8-shot) | 52.4 | Yes | Solving Quantitative Reasoning Problems with Lan... | 2022-06-29 | Code |
| 131 | Mistral 7B (maj@8) | 52.2 | No | Mistral 7B | 2023-10-10 | Code |
| 132 | Llemma 34B | 51.5 | No | Llemma: An Open Language Model For Mathematics | 2023-10-16 | Code |
| 133 | Text-davinci-002-175B (zero-plus-few-Shot-cot (8 samples)) | 51.5 | Yes | Large Language Models are Zero-Shot Reasoners | 2022-05-24 | Code |
| 134 | RFT 7B | 51.2 | Yes | Scaling Relationship on Learning Mathematical Re... | 2023-08-03 | Code |
| 135 | LLaMA 65B | 50.9 | No | LLaMA: Open and Efficient Foundation Language Mo... | 2023-02-27 | Code |
| 136 | Orca 2 7B | 47.23 | No | Orca 2: Teaching Small Language Models How to Re... | 2023-11-18 | - |
| 137 | Llama-2 13B (on 100 first questions, 4-shot, auto-optimized prompting) | 43 | No | The Unreasonable Effectiveness of Eccentric Auto... | 2024-02-09 | - |
| 138 | text-davinci-002 175B (2-shot, CoT) | 41.3 | Yes | Large Language Models are Zero-Shot Reasoners | 2022-05-24 | Code |
| 139 | Mistral 7B (on 100 first questions, 4-shot, auto-optimized prompting) | 41 | No | The Unreasonable Effectiveness of Eccentric Auto... | 2024-02-09 | - |
| 140 | text-davinci-002 175B (0-shot, CoT) | 40.7 | Yes | Large Language Models are Zero-Shot Reasoners | 2022-05-24 | Code |
| 141 | Branch-Train-MiX 4x7B (sampling top-2 experts) | 37.1 | No | Branch-Train-MiX: Mixing Expert LLMs into a Mixt... | 2024-03-12 | Code |
| 142 | Llemma 7B | 36.4 | No | Llemma: An Open Language Model For Mathematics | 2023-10-16 | Code |
| 143 | LLaMA 33B | 35.6 | No | LLaMA: Open and Efficient Foundation Language Mo... | 2023-02-27 | Code |
| 144 | Vicuna (SYRELM) | 35.2 | Yes | Frugal LMs Trained to Invoke Symbolic Solvers Ac... | 2023-12-09 | Code |
| 145 | PaLM 62B (8-shot) | 33 | Yes | Solving Quantitative Reasoning Problems with Lan... | 2022-06-29 | Code |
| 146 | PaLM 540B (Self Improvement, Standard-Prompting) | 32.2 | No | Large Language Models Can Self-Improve | 2022-10-20 | - |
| 147 | LLaMA 13B-maj1@k | 29.3 | No | LLaMA: Open and Efficient Foundation Language Mo... | 2023-02-27 | Code |
| 148 | Minerva 8B-maj1@k (8-shot) | 28.4 | Yes | Solving Quantitative Reasoning Problems with Lan... | 2022-06-29 | Code |
| 149 | GPT-2-Medium 355M + question-solution classifier (BS=5) | 20.8 | No | Composing Ensembles of Pre-trained Models via It... | 2022-10-20 | - |
| 150 | GPT-Neo-2.7B + Self-Sampling | 19.5 | No | Learning Math Reasoning from Self-Sampled Correc... | 2022-05-28 | Code |
| 151 | GPT-2-Medium 355M (fine-tuned, BS=5) | 18.3 | No | Composing Ensembles of Pre-trained Models via It... | 2022-10-20 | - |
| 152 | LLaMA 7B (maj1@k) | 18.1 | No | LLaMA: Open and Efficient Foundation Language Mo... | 2023-02-27 | Code |
| 153 | PaLM 540B (few-shot) | 17.9 | Yes | Large Language Models are Zero-Shot Reasoners | 2022-05-24 | Code |
| 154 | PaLM 540B (Standard-Prompting) | 17.9 | No | Large Language Models Can Self-Improve | 2022-10-20 | - |
| 155 | LLaMA 13B | 17.8 | No | LLaMA: Open and Efficient Foundation Language Mo... | 2023-02-27 | Code |
| 156 | GPT-2-Medium 355M + question-solution classifier (BS=1) | 16.8 | No | Composing Ensembles of Pre-trained Models via It... | 2022-10-20 | - |
| 157 | Minerva 8B (8-shot) | 16.2 | Yes | Solving Quantitative Reasoning Problems with Lan... | 2022-06-29 | Code |
| 158 | GPT-2-Medium 355M (BS=5) | 12.2 | No | Composing Ensembles of Pre-trained Models via It... | 2022-10-20 | - |
| 159 | LLaMA 7B | 11 | No | LLaMA: Open and Efficient Foundation Language Mo... | 2023-02-27 | Code |
| 160 | Text-davinci-002-175B (0-shot) | 10.4 | Yes | Large Language Models are Zero-Shot Reasoners | 2022-05-24 | Code |
| 161 | GPT-Neo 125M + Self-Sampling | 7.5 | No | Learning Math Reasoning from Self-Sampled Correc... | 2022-05-28 | Code |
| 162 | UL2 20B (chain-of-thought) | 4.4 | No | UL2: Unifying Language Learning Paradigms | 2022-05-10 | Code |
| 163 | PaLM 8B (8-shot) | 4.1 | Yes | Solving Quantitative Reasoning Problems with Lan... | 2022-06-29 | Code |
| 164 | UL2 20B (0-shot) | 4.1 | No | UL2: Unifying Language Learning Paradigms | 2022-05-10 | Code |