Question Answering on EgoSchema (fullset)

Metric: Accuracy (higher is better)

LeaderboardDataset

Loading chart...

Results

Hide extra data

Sort:

#	Model↕	Accuracy▼	Extra Data	Paper	Date↕	Code
1	BIMBA-LLaVA-Qwen2-7B	71.14	No	BIMBA: Selective-Scan Compression for Long-Range...	2025-03-12	Code
2	LinVT-Qwen2-VL(7B)	69.5	No	LinVT: Empower Your Image-level Large Language M...	2024-12-06	Code
3	Qwen2.5-Omni	68.6	Yes	Qwen2.5-Omni Technical Report	2025-03-26	Code
4	LongVU (7B)	67.6	No	LongVU: Spatiotemporal Adaptive Compression for ...	2024-10-22	Code
5	Video-RAG (Based on LLaVA-Video)	66.7	No	Video-RAG: Visually-aligned Retrieval-Augmented ...	2024-11-20	Code
6	VideoLLaMA2 (72B)	63.9	No	VideoLLaMA 2: Advancing Spatial-Temporal Modelin...	2024-06-11	Code
7	Tarsier (34B)	61.7	No	Tarsier: Recipes for Training and Evaluating Lar...	2024-06-30	Code
8	LVNet	61.1	No	Too Many Frames, Not All Useful: Efficient Strat...	2024-06-13	Code
9	VideoTree (GPT4)	61.1	No	VideoTree: Adaptive Tree-based Video Representat...	2024-05-29	Code
10	InternVideo2-6B	60.2	No	InternVideo2: Scaling Foundation Models for Mult...	2024-03-22	Code
11	VideoChat-T (7B)	60	No	TimeSuite: Improving MLLMs for Long Video Unders...	2024-10-25	Code
12	VideoChat2_phi3	56.7	No	MVBench: A Comprehensive Multi-modal Video Under...	2023-11-28	Code
13	VideoChat2_HD_mistral	55.8	No	MVBench: A Comprehensive Multi-modal Video Under...	2023-11-28	Code
14	VideoChat2_mistral	54.4	No	MVBench: A Comprehensive Multi-modal Video Under...	2023-11-28	Code
15	Vamos (GPT-4o)	53.6	No	Vamos: Versatile Action Models for Video Underst...	2023-11-22	Code
16	TraveLER	53.3	No	TraveLER: A Modular Multi-LMM Agent Framework fo...	2024-04-01	Code
17	LLoVi (GPT-3.5)	50.3	No	A Simple LLM Framework for Long-Range Video Ques...	2023-12-28	Code
18	Video ReCap	50.23	No	Video ReCap: Recursive Captioning of Hour-Long V...	2024-02-20	Code
19	Vamos (GPT-4)	48.3	No	Vamos: Versatile Action Models for Video Underst...	2023-11-22	Code
20	LangRepo (12B)	41.2	No	Language Repository for Long Video Understanding	2024-03-21	Code
21	MVU (13B)	37.6	No	Understanding Long Videos with Multimodal Langua...	2024-03-25	Code
22	Vamos (13B)	36.7	No	Vamos: Versatile Action Models for Video Underst...	2023-11-22	Code
23	LLoVi (7B)	33.5	No	A Simple LLM Framework for Long-Range Video Ques...	2023-12-28	Code
24	TimeChat (7B)	33	No	TimeChat: A Time-sensitive Multimodal Large Lang...	2023-12-04	Code
25	InternVideo	32.1	No	InternVideo: General Video Foundation Models via...	2022-12-06	Code
26	mPLUG-Owl (7B)	31.1	No	mPLUG-Owl: Modularization Empowers Large Languag...	2023-04-27	Code
27	FrozenBiLM	26.9	No	Zero-Shot Video Question Answering via Frozen Bi...	2022-06-16	Code
28	SeViLA (4B)	22.7	No	Self-Chained Image-Language Model for Video Loca...	2023-05-11	Code
29	Random	20	No	CREPE: Can Vision-Language Foundation Models Rea...	2022-12-13	Code

#1BIMBA-LLaVA-Qwen2-7BSOTA
71.14
Accuracy· 2025-03-12
BIMBA: Selective-Scan Compression for Long-Range Video Question Answering Code
#2LinVT-Qwen2-VL(7B)SOTA
69.5
Accuracy· 2024-12-06
LinVT: Empower Your Image-level Large Language Model to Understand Videos Code
#3Qwen2.5-Omni
68.6
Accuracy· Extra Data· 2025-03-26
Qwen2.5-Omni Technical Report Code
#4LongVU (7B)SOTA
67.6
Accuracy· 2024-10-22
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding Code
#5Video-RAG (Based on LLaVA-Video)
66.7
Accuracy· 2024-11-20
Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension Code
#6VideoLLaMA2 (72B)SOTA
63.9
Accuracy· 2024-06-11
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs Code
#7Tarsier (34B)
61.7
Accuracy· 2024-06-30
Tarsier: Recipes for Training and Evaluating Large Video Description Models Code
#8LVNet
61.1
Accuracy· 2024-06-13
Too Many Frames, Not All Useful: Efficient Strategies for Long-Form Video QA Code
#9VideoTree (GPT4)SOTA
61.1
Accuracy· 2024-05-29
VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos Code
#10InternVideo2-6BSOTA
60.2
Accuracy· 2024-03-22
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding Code
#11VideoChat-T (7B)
60
Accuracy· 2024-10-25
TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning Code
#12VideoChat2_phi3SOTA
56.7
Accuracy· 2023-11-28
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark Code
#13VideoChat2_HD_mistral
55.8
Accuracy· 2023-11-28
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark Code
#14VideoChat2_mistral
54.4
Accuracy· 2023-11-28
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark Code
#15Vamos (GPT-4o)SOTA
53.6
Accuracy· 2023-11-22
Vamos: Versatile Action Models for Video Understanding Code
#16TraveLER
53.3
Accuracy· 2024-04-01
TraveLER: A Modular Multi-LMM Agent Framework for Video Question-Answering Code
#17LLoVi (GPT-3.5)
50.3
Accuracy· 2023-12-28
A Simple LLM Framework for Long-Range Video Question-Answering Code
#18Video ReCap
50.23
Accuracy· 2024-02-20
Video ReCap: Recursive Captioning of Hour-Long Videos Code
#19Vamos (GPT-4)
48.3
Accuracy· 2023-11-22
Vamos: Versatile Action Models for Video Understanding Code
#20LangRepo (12B)
41.2
Accuracy· 2024-03-21
Language Repository for Long Video Understanding Code
#21MVU (13B)
37.6
Accuracy· 2024-03-25
Understanding Long Videos with Multimodal Language Models Code
#22Vamos (13B)
36.7
Accuracy· 2023-11-22
Vamos: Versatile Action Models for Video Understanding Code
#23LLoVi (7B)
33.5
Accuracy· 2023-12-28
A Simple LLM Framework for Long-Range Video Question-Answering Code
#24TimeChat (7B)
33
Accuracy· 2023-12-04
TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding Code
#25InternVideoSOTA
32.1
Accuracy· 2022-12-06
InternVideo: General Video Foundation Models via Generative and Discriminative Learning Code
#26mPLUG-Owl (7B)
31.1
Accuracy· 2023-04-27
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality Code
#27FrozenBiLMSOTA
26.9
Accuracy· 2022-06-16
Zero-Shot Video Question Answering via Frozen Bidirectional Language Models Code
#28SeViLA (4B)
22.7
Accuracy· 2023-05-11
Self-Chained Image-Language Model for Video Localization and Question Answering Code
#29Random
20
Accuracy· 2022-12-13
CREPE: Can Vision-Language Foundation Models Reason Compositionally?Code