Tasks
SotA
Datasets
Papers
Methods
Submit
About
SotA
/
Natural Language Processing
/
Common Sense Reasoning
/
ARC (Challenge)
Common Sense Reasoning on ARC (Challenge)
Metric: Accuracy (higher is better)
Leaderboard
Dataset
Loading chart...
Results
Submit a result
Hide extra data
Export CSV
Sort:
Accuracy (best first)
Accuracy (worst first)
Date (newest first)
Date (oldest first)
Model name (A→Z)
#
Model
↕
Accuracy
▼
Extra Data
Paper
Date
↕
Code
1
GPT-4 (few-shot, k=25)
96.4
No
GPT-4 Technical Report
2023-03-15
Code
2
PaLM 2 (few-shot, CoT, SC)
95.1
No
PaLM 2 Technical Report
2023-05-17
Code
3
Shivaay (4B, few-shot, k=8)
91.04
No
-
-
-
4
StupidLLM
91.03
No
-
-
-
5
Claude 2 (few-shot, k=5)
91
No
-
-
-
6
Claude 1.3 (few-shot, k=5)
90
No
-
-
-
7
PaLM 540B (Self Improvement, Self Consistency)
89.8
No
Large Language Models Can Self-Improve
2022-10-20
-
8
PaLM 540B (Self Consistency)
88.7
No
Large Language Models Can Self-Improve
2022-10-20
-
9
PaLM 540B (Self Improvement, CoT Prompting)
88.3
No
Large Language Models Can Self-Improve
2022-10-20
-
10
PaLM 540B (Self Improvement, Standard-Prompting)
87.2
No
Large Language Models Can Self-Improve
2022-10-20
-
11
PaLM 540B (Standard-Prompting)
87.1
No
Large Language Models Can Self-Improve
2022-10-20
-
12
ST-MoE-32B 269B (fine-tuned)
86.5
No
ST-MoE: Designing Stable and Transferable Sparse...
2022-02-17
Code
13
Claude Instant 1.1 (few-shot, k=5)
85.7
No
-
-
-
14
GPT-3.5 (few-shot, k=25)
85.2
No
GPT-4 Technical Report
2023-03-15
Code
15
PaLM 540B (CoT Prompting)
85.2
No
Large Language Models Can Self-Improve
2022-10-20
-
16
LLaMA 3 8B + MoSLoRA (fine-tuned)
81.5
No
Mixture-of-Subspaces in Low-Rank Adaptation
2024-06-16
Code
17
LLaMA-3 8B + MixLoRA
79.9
No
MixLoRA: Enhancing Large Language Models Fine-Tu...
2024-04-22
Code
18
LLaMA-2 13B + MixLoRA
69.9
No
MixLoRA: Enhancing Large Language Models Fine-Tu...
2024-04-22
Code
19
PaLM 2-L (1-shot)
69.2
No
PaLM 2 Technical Report
2023-05-17
Code
20
GAL 120B (zero-shot)
67.9
Yes
Galactica: A Large Language Model for Science
2022-11-16
Code
21
Camelidae-8×34B
65.2
No
Parameter-Efficient Sparsity Crafting from Dense...
2024-01-05
Code
22
PaLM 2-M (1-shot)
64.9
No
PaLM 2 Technical Report
2023-05-17
Code
23
FLAN 137B (few-shot, k=13)
63.8
No
Finetuned Language Models Are Zero-Shot Learners
2021-09-03
Code
24
FLAN 137B (zero-shot)
63.1
No
Finetuned Language Models Are Zero-Shot Learners
2021-09-03
Code
25
PaLM 2-S (1-shot)
59.6
No
PaLM 2 Technical Report
2023-05-17
Code
26
LLaMA-2 7B + MixLoRA
58.1
No
MixLoRA: Enhancing Large Language Models Fine-Tu...
2024-04-22
Code
27
LLaMA 33B (zero-shot)
57.8
No
LLaMA: Open and Efficient Foundation Language Mo...
2023-02-27
Code
28
ST-MoE-L 4.1B (fine-tuned)
56.9
No
ST-MoE: Designing Stable and Transferable Sparse...
2022-02-17
Code
29
LLaMA 65B (zero-shot)
56
Yes
LLaMA: Open and Efficient Foundation Language Mo...
2023-02-27
Code
30
Mistral 7B (0-shot)
55.5
No
Mistral 7B
2023-10-10
Code
31
GPT-3 175B (1 shot)
53.2
Yes
Language Models are Few-Shot Learners
2020-05-28
Code
32
LLaMA 13B (zero-shot)
52.7
No
LLaMA: Open and Efficient Foundation Language Mo...
2023-02-27
Code
33
GPT-3 (zero-shot)
51.4
No
Galactica: A Large Language Model for Science
2022-11-16
Code
34
GPT-3 175B (0-shot)
51.4
No
Language Models are Few-Shot Learners
2020-05-28
Code
35
BLOOM 176B (1-shot)
50.85
No
BloombergGPT: A Large Language Model for Finance
2023-03-30
Code
36
GLaM 64B/64E (0 shot)
50.3
Yes
GLaM: Efficient Scaling of Language Models with ...
2021-12-13
-
37
UL2 20B (chain-of-thought + self-consistency)
49.5
No
UL2: Unifying Language Learning Paradigms
2022-05-10
Code
38
Bloomberg GPT 50B (1-shot)
48.63
No
BloombergGPT: A Large Language Model for Finance
2023-03-30
Code
39
GLaM 64B/64E (1 shot)
48.2
Yes
GLaM: Efficient Scaling of Language Models with ...
2021-12-13
-
40
LLaMA 7B (zero-shot)
47.6
No
LLaMA: Open and Efficient Foundation Language Mo...
2023-02-27
Code
41
GPT-NeoX 20B (1-shot)
45.39
No
BloombergGPT: A Large Language Model for Finance
2023-03-30
Code
42
phi-1.5-web 1.3B (zero-shot)
44.9
No
Textbooks Are All You Need II: phi-1.5 technical...
2023-09-11
Code
43
OPT 66B (one-shot)
44.54
No
BloombergGPT: A Large Language Model for Finance
2023-03-30
Code
44
OPT-175B
43.94
No
SparseGPT: Massive Language Models Can Be Accura...
2023-01-02
Code
45
UL2 20B (chain-of-thought)
42.9
No
UL2: Unifying Language Learning Paradigms
2022-05-10
Code
46
SparseGPT (175B, 50% Sparsity)
41.3
No
SparseGPT: Massive Language Models Can Be Accura...
2023-01-02
Code
47
SparseGPT (175B, 4:8 Sparsity)
39.85
No
SparseGPT: Massive Language Models Can Be Accura...
2023-01-02
Code
48
SparseGPT (175B, 2:4 Sparsity)
38.99
No
SparseGPT: Massive Language Models Can Be Accura...
2023-01-02
Code
49
Pythia 12B (5-shot)
36.8
No
Pythia: A Suite for Analyzing Large Language Mod...
2023-04-03
Code
50
BLOOM (few-shot, k=5)
32.9
No
Galactica: A Large Language Model for Science
2022-11-16
Code
51
Pythia 12B (0-shot)
31.8
No
Pythia: A Suite for Analyzing Large Language Mod...
2023-04-03
Code
52
OPT (few-shot, k=5)
31.1
No
Galactica: A Large Language Model for Science
2022-11-16
Code
53
UL2 20B (zero-shot)
29.8
No
UL2: Unifying Language Learning Paradigms
2022-05-10
Code
54
OPT-175B (50% Sparsity)
25.6
No
SparseGPT: Massive Language Models Can Be Accura...
2023-01-02
Code
#1
GPT-4 (few-shot, k=25)
SOTA
96.4
Accuracy
· 2023-03-15
GPT-4 Technical Report
Code
#2
PaLM 2 (few-shot, CoT, SC)
95.1
Accuracy
· 2023-05-17
PaLM 2 Technical Report
Code
#3
Shivaay (4B, few-shot, k=8)
91.04
Accuracy
No paper
#4
StupidLLM
91.03
Accuracy
No paper
#5
Claude 2 (few-shot, k=5)
91
Accuracy
No paper
#6
Claude 1.3 (few-shot, k=5)
90
Accuracy
No paper
#7
PaLM 540B (Self Improvement, Self Consistency)
SOTA
89.8
Accuracy
· 2022-10-20
Large Language Models Can Self-Improve
#8
PaLM 540B (Self Consistency)
88.7
Accuracy
· 2022-10-20
Large Language Models Can Self-Improve
#9
PaLM 540B (Self Improvement, CoT Prompting)
88.3
Accuracy
· 2022-10-20
Large Language Models Can Self-Improve
#10
PaLM 540B (Self Improvement, Standard-Prompting)
87.2
Accuracy
· 2022-10-20
Large Language Models Can Self-Improve
#11
PaLM 540B (Standard-Prompting)
87.1
Accuracy
· 2022-10-20
Large Language Models Can Self-Improve
#12
ST-MoE-32B 269B (fine-tuned)
SOTA
86.5
Accuracy
· 2022-02-17
ST-MoE: Designing Stable and Transferable Sparse Expert Models
Code
#13
Claude Instant 1.1 (few-shot, k=5)
85.7
Accuracy
No paper
#14
GPT-3.5 (few-shot, k=25)
85.2
Accuracy
· 2023-03-15
GPT-4 Technical Report
Code
#15
PaLM 540B (CoT Prompting)
85.2
Accuracy
· 2022-10-20
Large Language Models Can Self-Improve
#16
LLaMA 3 8B + MoSLoRA (fine-tuned)
81.5
Accuracy
· 2024-06-16
Mixture-of-Subspaces in Low-Rank Adaptation
Code
#17
LLaMA-3 8B + MixLoRA
79.9
Accuracy
· 2024-04-22
MixLoRA: Enhancing Large Language Models Fine-Tuning with LoRA-based Mixture of Experts
Code
#18
LLaMA-2 13B + MixLoRA
69.9
Accuracy
· 2024-04-22
MixLoRA: Enhancing Large Language Models Fine-Tuning with LoRA-based Mixture of Experts
Code
#19
PaLM 2-L (1-shot)
69.2
Accuracy
· 2023-05-17
PaLM 2 Technical Report
Code
#20
GAL 120B (zero-shot)
67.9
Accuracy
· Extra Data
· 2022-11-16
Galactica: A Large Language Model for Science
Code
#21
Camelidae-8×34B
65.2
Accuracy
· 2024-01-05
Parameter-Efficient Sparsity Crafting from Dense to Mixture-of-Experts for Instruction Tuning on General Tasks
Code
#22
PaLM 2-M (1-shot)
64.9
Accuracy
· 2023-05-17
PaLM 2 Technical Report
Code
#23
FLAN 137B (few-shot, k=13)
SOTA
63.8
Accuracy
· 2021-09-03
Finetuned Language Models Are Zero-Shot Learners
Code
#24
FLAN 137B (zero-shot)
63.1
Accuracy
· 2021-09-03
Finetuned Language Models Are Zero-Shot Learners
Code
#25
PaLM 2-S (1-shot)
59.6
Accuracy
· 2023-05-17
PaLM 2 Technical Report
Code
#26
LLaMA-2 7B + MixLoRA
58.1
Accuracy
· 2024-04-22
MixLoRA: Enhancing Large Language Models Fine-Tuning with LoRA-based Mixture of Experts
Code
#27
LLaMA 33B (zero-shot)
57.8
Accuracy
· 2023-02-27
LLaMA: Open and Efficient Foundation Language Models
Code
#28
ST-MoE-L 4.1B (fine-tuned)
56.9
Accuracy
· 2022-02-17
ST-MoE: Designing Stable and Transferable Sparse Expert Models
Code
#29
LLaMA 65B (zero-shot)
56
Accuracy
· Extra Data
· 2023-02-27
LLaMA: Open and Efficient Foundation Language Models
Code
#30
Mistral 7B (0-shot)
55.5
Accuracy
· 2023-10-10
Mistral 7B
Code
#31
GPT-3 175B (1 shot)
SOTA
53.2
Accuracy
· Extra Data
· 2020-05-28
Language Models are Few-Shot Learners
Code
#32
LLaMA 13B (zero-shot)
52.7
Accuracy
· 2023-02-27
LLaMA: Open and Efficient Foundation Language Models
Code
#33
GPT-3 (zero-shot)
51.4
Accuracy
· 2022-11-16
Galactica: A Large Language Model for Science
Code
#34
GPT-3 175B (0-shot)
51.4
Accuracy
· 2020-05-28
Language Models are Few-Shot Learners
Code
#35
BLOOM 176B (1-shot)
50.85
Accuracy
· 2023-03-30
BloombergGPT: A Large Language Model for Finance
Code
#36
GLaM 64B/64E (0 shot)
50.3
Accuracy
· Extra Data
· 2021-12-13
GLaM: Efficient Scaling of Language Models with Mixture-of-Experts
#37
UL2 20B (chain-of-thought + self-consistency)
49.5
Accuracy
· 2022-05-10
UL2: Unifying Language Learning Paradigms
Code
#38
Bloomberg GPT 50B (1-shot)
48.63
Accuracy
· 2023-03-30
BloombergGPT: A Large Language Model for Finance
Code
#39
GLaM 64B/64E (1 shot)
48.2
Accuracy
· Extra Data
· 2021-12-13
GLaM: Efficient Scaling of Language Models with Mixture-of-Experts
#40
LLaMA 7B (zero-shot)
47.6
Accuracy
· 2023-02-27
LLaMA: Open and Efficient Foundation Language Models
Code
#41
GPT-NeoX 20B (1-shot)
45.39
Accuracy
· 2023-03-30
BloombergGPT: A Large Language Model for Finance
Code
#42
phi-1.5-web 1.3B (zero-shot)
44.9
Accuracy
· 2023-09-11
Textbooks Are All You Need II: phi-1.5 technical report
Code
#43
OPT 66B (one-shot)
44.54
Accuracy
· 2023-03-30
BloombergGPT: A Large Language Model for Finance
Code
#44
OPT-175B
43.94
Accuracy
· 2023-01-02
SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot
Code
#45
UL2 20B (chain-of-thought)
42.9
Accuracy
· 2022-05-10
UL2: Unifying Language Learning Paradigms
Code
#46
SparseGPT (175B, 50% Sparsity)
41.3
Accuracy
· 2023-01-02
SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot
Code
#47
SparseGPT (175B, 4:8 Sparsity)
39.85
Accuracy
· 2023-01-02
SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot
Code
#48
SparseGPT (175B, 2:4 Sparsity)
38.99
Accuracy
· 2023-01-02
SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot
Code
#49
Pythia 12B (5-shot)
36.8
Accuracy
· 2023-04-03
Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling
Code
#50
BLOOM (few-shot, k=5)
32.9
Accuracy
· 2022-11-16
Galactica: A Large Language Model for Science
Code
#51
Pythia 12B (0-shot)
31.8
Accuracy
· 2023-04-03
Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling
Code
#52
OPT (few-shot, k=5)
31.1
Accuracy
· 2022-11-16
Galactica: A Large Language Model for Science
Code
#53
UL2 20B (zero-shot)
29.8
Accuracy
· 2022-05-10
UL2: Unifying Language Learning Paradigms
Code
#54
OPT-175B (50% Sparsity)
25.6
Accuracy
· 2023-01-02
SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot
Code