Metric: Accuracy (higher is better)
| # | Model↕ | Accuracy▼ | Extra Data | Paper | Date↕ | Code |
|---|---|---|---|---|---|---|
| 1 | GPT-4 (few-shot, k=25) | 96.4 | No | GPT-4 Technical Report | 2023-03-15 | Code |
| 2 | PaLM 2 (few-shot, CoT, SC) | 95.1 | No | PaLM 2 Technical Report | 2023-05-17 | Code |
| 3 | Shivaay (4B, few-shot, k=8) | 91.04 | No | - | - | - |
| 4 | StupidLLM | 91.03 | No | - | - | - |
| 5 | Claude 2 (few-shot, k=5) | 91 | No | - | - | - |
| 6 | Claude 1.3 (few-shot, k=5) | 90 | No | - | - | - |
| 7 | PaLM 540B (Self Improvement, Self Consistency) | 89.8 | No | Large Language Models Can Self-Improve | 2022-10-20 | - |
| 8 | PaLM 540B (Self Consistency) | 88.7 | No | Large Language Models Can Self-Improve | 2022-10-20 | - |
| 9 | PaLM 540B (Self Improvement, CoT Prompting) | 88.3 | No | Large Language Models Can Self-Improve | 2022-10-20 | - |
| 10 | PaLM 540B (Self Improvement, Standard-Prompting) | 87.2 | No | Large Language Models Can Self-Improve | 2022-10-20 | - |
| 11 | PaLM 540B (Standard-Prompting) | 87.1 | No | Large Language Models Can Self-Improve | 2022-10-20 | - |
| 12 | ST-MoE-32B 269B (fine-tuned) | 86.5 | No | ST-MoE: Designing Stable and Transferable Sparse... | 2022-02-17 | Code |
| 13 | Claude Instant 1.1 (few-shot, k=5) | 85.7 | No | - | - | - |
| 14 | GPT-3.5 (few-shot, k=25) | 85.2 | No | GPT-4 Technical Report | 2023-03-15 | Code |
| 15 | PaLM 540B (CoT Prompting) | 85.2 | No | Large Language Models Can Self-Improve | 2022-10-20 | - |
| 16 | LLaMA 3 8B + MoSLoRA (fine-tuned) | 81.5 | No | Mixture-of-Subspaces in Low-Rank Adaptation | 2024-06-16 | Code |
| 17 | LLaMA-3 8B + MixLoRA | 79.9 | No | MixLoRA: Enhancing Large Language Models Fine-Tu... | 2024-04-22 | Code |
| 18 | LLaMA-2 13B + MixLoRA | 69.9 | No | MixLoRA: Enhancing Large Language Models Fine-Tu... | 2024-04-22 | Code |
| 19 | PaLM 2-L (1-shot) | 69.2 | No | PaLM 2 Technical Report | 2023-05-17 | Code |
| 20 | GAL 120B (zero-shot) | 67.9 | Yes | Galactica: A Large Language Model for Science | 2022-11-16 | Code |
| 21 | Camelidae-8×34B | 65.2 | No | Parameter-Efficient Sparsity Crafting from Dense... | 2024-01-05 | Code |
| 22 | PaLM 2-M (1-shot) | 64.9 | No | PaLM 2 Technical Report | 2023-05-17 | Code |
| 23 | FLAN 137B (few-shot, k=13) | 63.8 | No | Finetuned Language Models Are Zero-Shot Learners | 2021-09-03 | Code |
| 24 | FLAN 137B (zero-shot) | 63.1 | No | Finetuned Language Models Are Zero-Shot Learners | 2021-09-03 | Code |
| 25 | PaLM 2-S (1-shot) | 59.6 | No | PaLM 2 Technical Report | 2023-05-17 | Code |
| 26 | LLaMA-2 7B + MixLoRA | 58.1 | No | MixLoRA: Enhancing Large Language Models Fine-Tu... | 2024-04-22 | Code |
| 27 | LLaMA 33B (zero-shot) | 57.8 | No | LLaMA: Open and Efficient Foundation Language Mo... | 2023-02-27 | Code |
| 28 | ST-MoE-L 4.1B (fine-tuned) | 56.9 | No | ST-MoE: Designing Stable and Transferable Sparse... | 2022-02-17 | Code |
| 29 | LLaMA 65B (zero-shot) | 56 | Yes | LLaMA: Open and Efficient Foundation Language Mo... | 2023-02-27 | Code |
| 30 | Mistral 7B (0-shot) | 55.5 | No | Mistral 7B | 2023-10-10 | Code |
| 31 | GPT-3 175B (1 shot) | 53.2 | Yes | Language Models are Few-Shot Learners | 2020-05-28 | Code |
| 32 | LLaMA 13B (zero-shot) | 52.7 | No | LLaMA: Open and Efficient Foundation Language Mo... | 2023-02-27 | Code |
| 33 | GPT-3 (zero-shot) | 51.4 | No | Galactica: A Large Language Model for Science | 2022-11-16 | Code |
| 34 | GPT-3 175B (0-shot) | 51.4 | No | Language Models are Few-Shot Learners | 2020-05-28 | Code |
| 35 | BLOOM 176B (1-shot) | 50.85 | No | BloombergGPT: A Large Language Model for Finance | 2023-03-30 | Code |
| 36 | GLaM 64B/64E (0 shot) | 50.3 | Yes | GLaM: Efficient Scaling of Language Models with ... | 2021-12-13 | - |
| 37 | UL2 20B (chain-of-thought + self-consistency) | 49.5 | No | UL2: Unifying Language Learning Paradigms | 2022-05-10 | Code |
| 38 | Bloomberg GPT 50B (1-shot) | 48.63 | No | BloombergGPT: A Large Language Model for Finance | 2023-03-30 | Code |
| 39 | GLaM 64B/64E (1 shot) | 48.2 | Yes | GLaM: Efficient Scaling of Language Models with ... | 2021-12-13 | - |
| 40 | LLaMA 7B (zero-shot) | 47.6 | No | LLaMA: Open and Efficient Foundation Language Mo... | 2023-02-27 | Code |
| 41 | GPT-NeoX 20B (1-shot) | 45.39 | No | BloombergGPT: A Large Language Model for Finance | 2023-03-30 | Code |
| 42 | phi-1.5-web 1.3B (zero-shot) | 44.9 | No | Textbooks Are All You Need II: phi-1.5 technical... | 2023-09-11 | Code |
| 43 | OPT 66B (one-shot) | 44.54 | No | BloombergGPT: A Large Language Model for Finance | 2023-03-30 | Code |
| 44 | OPT-175B | 43.94 | No | SparseGPT: Massive Language Models Can Be Accura... | 2023-01-02 | Code |
| 45 | UL2 20B (chain-of-thought) | 42.9 | No | UL2: Unifying Language Learning Paradigms | 2022-05-10 | Code |
| 46 | SparseGPT (175B, 50% Sparsity) | 41.3 | No | SparseGPT: Massive Language Models Can Be Accura... | 2023-01-02 | Code |
| 47 | SparseGPT (175B, 4:8 Sparsity) | 39.85 | No | SparseGPT: Massive Language Models Can Be Accura... | 2023-01-02 | Code |
| 48 | SparseGPT (175B, 2:4 Sparsity) | 38.99 | No | SparseGPT: Massive Language Models Can Be Accura... | 2023-01-02 | Code |
| 49 | Pythia 12B (5-shot) | 36.8 | No | Pythia: A Suite for Analyzing Large Language Mod... | 2023-04-03 | Code |
| 50 | BLOOM (few-shot, k=5) | 32.9 | No | Galactica: A Large Language Model for Science | 2022-11-16 | Code |
| 51 | Pythia 12B (0-shot) | 31.8 | No | Pythia: A Suite for Analyzing Large Language Mod... | 2023-04-03 | Code |
| 52 | OPT (few-shot, k=5) | 31.1 | No | Galactica: A Large Language Model for Science | 2022-11-16 | Code |
| 53 | UL2 20B (zero-shot) | 29.8 | No | UL2: Unifying Language Learning Paradigms | 2022-05-10 | Code |
| 54 | OPT-175B (50% Sparsity) | 25.6 | No | SparseGPT: Massive Language Models Can Be Accura... | 2023-01-02 | Code |