Multi-Task Learning on MML

Metric: Average (%) (higher is better)

LeaderboardDataset

Loading chart...

Results

Hide augmentations

#	Model↕	Average (%)▼	Augmentations	Paper	Date↕	Code
1	GPT-4 o1(300b)	87	Yes	GPT-4o as the Gold Standard: A Scalable and Gene...	2024-10-03	-
2	Llama 3.1 (405B)	86.6	Yes	Llama 3 Meets MoE: Efficient Upcycling	2024-12-13	Code
3	Llama 3.1 (70B)	86	Yes	Llama 3 Meets MoE: Efficient Upcycling	2024-12-13	Code
4	Gemini Ultra (5-shot)	83.7	No	-	-	-
5	Claude 3 Sonnet (5-shot)	79	No	-	-	-
6	Qwen1.5 72B (5-shot)	77.5	No	-	-	-
7	Claude 3 Haiku (5-shot)	75.2	No	-	-	-
8	DBRX Instruct 132B (5-shot)	73.7	No	The Llama 3 Herd of Models	2024-07-31	Code
9	llama 2(65b)	73.5	No	Scaling Instruction-Finetuned Language Models	2022-10-20	Code
10	Llama 3.1 8B (CoT)	73	Yes	The Llama 3 Herd of Models	2024-07-31	Code
11	Mixtral 8x7B (5-shot)	70.6	No	Mixtral of Experts	2024-01-08	Code
12	GPT-3.5 Turbo	70	Yes	GPT-4 Technical Report	2023-03-15	Code
13	LLaMA 65B (fine-tuned)	68.9	No	LLaMA: Open and Efficient Foundation Language Mo...	2023-02-27	Code
14	chatgpt/gpt3.5(20B)	67.5	No	Training Compute-Optimal Large Language Models	2022-03-29	Code
15	LLaMA 65B (5-shot)	63.4	No	LLaMA: Open and Efficient Foundation Language Mo...	2023-02-27	Code
16	LLaMA 2 34B (5-shot)	62.6	No	Llama 2: Open Foundation and Fine-Tuned Chat Mod...	2023-07-18	Code
17	Mistral 7B (5-shot)	62.5	Yes	Mixtral of Experts	2024-01-08	Code
18	Mistral 7B (5-shot)	60.1	No	Mistral 7B	2023-10-10	Code
19	GPT-3 Davinci 175B (CoT)	59.5	No	Scaling Instruction-Finetuned Language Models	2022-10-20	Code
20	LLaMA 33B (5-shot)	57.8	No	LLaMA: Open and Efficient Foundation Language Mo...	2023-02-27	Code
21	Falcon 40B	57	No	The Falcon Series of Open Language Models	2023-11-28	-
22	Qwen 7B (5-shot)	56.7	No	-	-	-
23	LLaMA 2 13B (5-shot)	54.8	No	Llama 2: Open Foundation and Fine-Tuned Chat Mod...	2023-07-18	Code
24	Branch-Train-MiX 4x7B (sampling top-1 experts)	53.2	No	Branch-Train-MiX: Mixing Expert LLMs into a Mixt...	2024-03-12	Code
25	GAL 120B (zero-shot)	52.6	No	Galactica: A Large Language Model for Science	2022-11-16	Code
26	Atlas (5-shot)	47.9	No	Atlas: Few-shot Learning with Retrieval Augmente...	2022-08-05	Code
27	Flan-T5-XL 3B (CoT)	45.5	No	Scaling Instruction-Finetuned Language Models	2022-10-20	Code
28	LLaMA 2 7B (5-shot)	45.3	No	Llama 2: Open Foundation and Fine-Tuned Chat Mod...	2023-07-18	Code
29	Flan-T5-Large 780M	45.1	No	Scaling Instruction-Finetuned Language Models	2022-10-20	Code
30	GLM-130B	44.8	No	GLM-130B: An Open Bilingual Pre-trained Model	2022-10-05	Code
31	Flan-T5-Large 780M (CoT)	40.5	No	Scaling Instruction-Finetuned Language Models	2022-10-20	Code
32	GPT-3 Davinci 175B (5-shot)	39.7	No	Scaling Instruction-Finetuned Language Models	2022-10-20	Code
33	Bloomberg GPT 50B (5-shot)	39.2	No	BloombergGPT: A Large Language Model for Finance	2023-03-30	Code
34	UL2 20B (5-shot)	39.2	No	UL2: Unifying Language Learning Paradigms	2022-05-10	Code
35	BLOOM 176B (5-shot)	39.1	No	BloombergGPT: A Large Language Model for Finance	2023-03-30	Code
36	phi-1.5-web 1.3B	37.9	No	Textbooks Are All You Need II: phi-1.5 technical...	2023-09-11	Code
37	OPT 66B (5-shot)	36	No	BloombergGPT: A Large Language Model for Finance	2023-03-30	Code
38	Flan-T5-Base 250M	35.9	No	Scaling Instruction-Finetuned Language Models	2022-10-20	Code
39	Flan-T5-Base 250M (CoT)	33.7	No	Scaling Instruction-Finetuned Language Models	2022-10-20	Code
40	GPT-NeoX 20B (5-shot)	33.6	No	GPT-NeoX-20B: An Open-Source Autoregressive Lang...	2022-04-14	Code
41	RWKV v5 Eagle 7B	31	No	-	-	-
42	LLaMA7B-MiLe-Loss(5-shot)	29.68	No	MiLe Loss: a New Loss for Mitigating the Bias of...	2023-10-30	Code
43	Flan-T5-Small 80M	28.7	No	Scaling Instruction-Finetuned Language Models	2022-10-20	Code
44	Falcon 7B (5-shot)	28	No	The Falcon Series of Open Language Models	2023-11-28	-