| 1 | GPT-4 + knowledge base | 95.9 | No | - | - | - |
| 2 | MVP-Tuning (ensemble) | 95.2 | No | - | - | - |
| 3 | PaLM 540B (Self Improvement, Self Consistency) | 94.4 | No | Large Language Models Can Self-Improve | 2022-10-20 | - |
| 4 | X-Reasoner | 94.2 | No | - | - | - |
| 5 | PaLM 540B (Self Improvement, CoT Prompting) | 93 | No | Large Language Models Can Self-Improve | 2022-10-20 | - |
| 6 | PaLM 540B (Self Improvement, Standard-Prompting) | 92 | No | Large Language Models Can Self-Improve | 2022-10-20 | - |
| 7 | DeBERTa-xxlarge 1.5B + MVP-Tuning | 91.3 | No | - | - | - |
| 8 | PaLM 540B (Self Consistency) | 90 | No | Large Language Models Can Self-Improve | 2022-10-20 | - |
| 9 | GrapeQA: PEGA+CANP | 90 | No | GrapeQA: GRaph Augmentation and Pruning to Enhan... | 2023-03-22 | - |
| 10 | GenMC 11B | 89.8 | No | Clues Before Answers: Generation-Enhanced Multip... | 2022-04-30 | Code |
| 11 | AristoRoBERTa + MVP-Tuning | 87.6 | No | - | - | - |
| 12 | AristoRoBERTa + Graph Soft Counter | 87.4 | No | GNN is a Counter? Revisiting GNN for Question An... | 2021-10-07 | - |
| 13 | UnifiedQA 11B | 87.2 | No | UnifiedQA: Crossing Format Boundaries With a Sin... | 2020-05-02 | Code |
| 14 | LLaMA-3 8B+MoSLoRA | 86.8 | No | Mixture-of-Subspaces in Low-Rank Adaptation | 2024-06-16 | Code |
| 15 | PaLM 540B (CoT Prompting) | 86.4 | No | Large Language Models Can Self-Improve | 2022-10-20 | - |
| 16 | LLaMA-3 8B + MixLoRA | 84.8 | No | MixLoRA: Enhancing Large Language Models Fine-Tu... | 2024-04-22 | Code |
| 17 | PaLM 540B (Standard-Prompting) | 84.4 | No | Large Language Models Can Self-Improve | 2022-10-20 | - |
| 18 | TTTTT 3B | 83.2 | No | Fusing Context Into Knowledge Graph for Commonse... | 2020-12-09 | Code |
| 19 | LLaMA-2 13B + MixLoRA | 83 | No | MixLoRA: Enhancing Large Language Models Fine-Tu... | 2024-04-22 | Code |
| 20 | AristoRoBERTa + QA-GNN | 82.8 | No | QA-GNN: Reasoning with Language Models and Knowl... | 2021-04-13 | Code |
| 21 | QA-GNN | 82.8 | No | QA-GNN: Reasoning with Language Models and Knowl... | 2021-04-13 | Code |
| 22 | DEKCOR | 82.4 | No | Fusing Context Into Knowledge Graph for Commonse... | 2020-12-09 | Code |
| 23 | GrapeQA: PEGA | 82 | No | GrapeQA: GRaph Augmentation and Pruning to Enhan... | 2023-03-22 | - |
| 24 | LLaMA-2 7B + MixLoRA | 81.6 | No | MixLoRA: Enhancing Large Language Models Fine-Tu... | 2024-04-22 | Code |
| 25 | AristoRoBERTa | 77.8 | No | QA-GNN: Reasoning with Language Models and Knowl... | 2021-04-13 | Code |
| 26 | BiLSTM max-out question-match (science fact + common knowledge fact) | 76.9 | No | Can a Suit of Armor Conduct Electricity? A New D... | 2018-09-08 | Code |
| 27 | Careful Selection | 72 | No | Careful Selection of Knowledge to solve Open Boo... | 2019-07-24 | - |
| 28 | GrapeQA: CANP | 66.2 | No | GrapeQA: GRaph Augmentation and Pruning to Enhan... | 2023-03-22 | - |
| 29 | GPT-3 175B (few-shot, k=32) | 65.4 | No | Language Models are Few-Shot Learners | 2020-05-28 | Code |
| 30 | PaLM 2-L (1-shot) | 58.5 | No | PaLM 2 Technical Report | 2023-05-17 | Code |
| 31 | OPT 66B (one-shot) | 58 | No | BloombergGPT: A Large Language Model for Finance | 2023-03-30 | Code |
| 32 | PaLM 2-S (1-shot) | 57.4 | No | PaLM 2 Technical Report | 2023-05-17 | Code |
| 33 | BiLSTM max-out question-match (WordNet + science fact) | 56.3 | No | Can a Suit of Armor Conduct Electricity? A New D... | 2018-09-08 | Code |
| 34 | PaLM 2-M (1-shot) | 56.2 | No | PaLM 2 Technical Report | 2023-05-17 | Code |
| 35 | BiLSTM max-out question-match (with a science fact) | 55.8 | No | Can a Suit of Armor Conduct Electricity? A New D... | 2018-09-08 | Code |
| 36 | Bloomberg GPT 50B (1-shot) | 51.6 | No | BloombergGPT: A Large Language Model for Finance | 2023-03-30 | Code |
| 37 | BLOOM 176B (2-shot) | 47.2 | No | BloombergGPT: A Large Language Model for Finance | 2023-03-30 | Code |
| 38 | GPT-NeoX 50B (2-shot) | 44.2 | No | BloombergGPT: A Large Language Model for Finance | 2023-03-30 | Code |
| 39 | LaMini-GPT 1.5B | 39.8 | No | LaMini-LM: A Diverse Herd of Distilled Models fr... | 2023-04-27 | Code |
| 40 | LaMini-T5 738M | 36 | No | LaMini-LM: A Diverse Herd of Distilled Models fr... | 2023-04-27 | Code |
| 41 | LaMini-F-T5 783M | 34 | No | LaMini-LM: A Diverse Herd of Distilled Models fr... | 2023-04-27 | Code |
| 42 | T5-Large 738M | 32.8 | No | LaMini-LM: A Diverse Herd of Distilled Models fr... | 2023-04-27 | Code |
| 43 | GPT-2-XL 1.5B | 32 | No | LaMini-LM: A Diverse Herd of Distilled Models fr... | 2023-04-27 | Code |
| 44 | FLAN-T5-Large 783M | 31.2 | No | LaMini-LM: A Diverse Herd of Distilled Models fr... | 2023-04-27 | Code |
| 45 | Random chance baseline | 25 | No | HellaSwag: Can a Machine Really Finish Your Sent... | 2019-05-19 | Code |