Video Question Answering on EgoSchema (subset)

Metric: Accuracy (higher is better)

LeaderboardDataset

Loading chart...

Results

Sort:

#	Model↕	Accuracy▼	Extra Data	Paper	Date↕	Code
1	Tarsier (34B)	68.6	No	Tarsier: Recipes for Training and Evaluating Lar...	2024-06-30	Code
2	VideoChat-T (7B)	68.4	No	TimeSuite: Improving MLLMs for Long Video Unders...	2024-10-25	Code
3	LangRepo (12B)	66.2	No	Language Repository for Long Video Understanding	2024-03-21	Code
4	VideoTree (GPT4)	66.2	No	VideoTree: Adaptive Tree-based Video Representat...	2024-05-29	Code
5	LVNet	66	No	Too Many Frames, Not All Useful: Efficient Strat...	2024-06-13	Code
6	VideoChat2_HD_mistral	65.6	No	MVBench: A Comprehensive Multi-modal Video Under...	2023-11-28	Code
7	VideoChat2_mistral	63.6	No	MVBench: A Comprehensive Multi-modal Video Under...	2023-11-28	Code
8	MVU (13B)	60.3	No	Understanding Long Videos with Multimodal Langua...	2024-03-25	Code
9	TS-LLaVA-34B	57.8	No	TS-LLaVA: Constructing Visual Tokens through Thu...	2024-11-17	Code
10	LLoVi (GPT-3.5)	57.6	No	A Simple LLM Framework for Long-Range Video Ques...	2023-12-28	Code
11	LLoVi (7B)	50.8	No	A Simple LLM Framework for Long-Range Video Ques...	2023-12-28	Code
12	SlowFast-LLaVA-34B	47.2	No	SlowFast-LLaVA: A Strong Training-Free Baseline ...	2024-07-22	Code
13	SeViLA (4B)	25.7	No	Self-Chained Image-Language Model for Video Loca...	2023-05-11	Code
14	Random	20	No	CREPE: Can Vision-Language Foundation Models Rea...	2022-12-13	Code

#1Tarsier (34B)SOTA
68.6
Accuracy· 2024-06-30
Tarsier: Recipes for Training and Evaluating Large Video Description Models Code
#2VideoChat-T (7B)
68.4
Accuracy· 2024-10-25
TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning Code
#3LangRepo (12B)SOTA
66.2
Accuracy· 2024-03-21
Language Repository for Long Video Understanding Code
#4VideoTree (GPT4)
66.2
Accuracy· 2024-05-29
VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos Code
#5LVNet
66
Accuracy· 2024-06-13
Too Many Frames, Not All Useful: Efficient Strategies for Long-Form Video QA Code
#6VideoChat2_HD_mistralSOTA
65.6
Accuracy· 2023-11-28
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark Code
#7VideoChat2_mistral
63.6
Accuracy· 2023-11-28
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark Code
#8MVU (13B)
60.3
Accuracy· 2024-03-25
Understanding Long Videos with Multimodal Language Models Code
#9TS-LLaVA-34B
57.8
Accuracy· 2024-11-17
TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models Code
#10LLoVi (GPT-3.5)
57.6
Accuracy· 2023-12-28
A Simple LLM Framework for Long-Range Video Question-Answering Code
#11LLoVi (7B)
50.8
Accuracy· 2023-12-28
A Simple LLM Framework for Long-Range Video Question-Answering Code
#12SlowFast-LLaVA-34B
47.2
Accuracy· 2024-07-22
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models Code
#13SeViLA (4B)SOTA
25.7
Accuracy· 2023-05-11
Self-Chained Image-Language Model for Video Localization and Question Answering Code
#14RandomSOTA
20
Accuracy· 2022-12-13
CREPE: Can Vision-Language Foundation Models Reason Compositionally?Code