Question Answering on Video-MME

Metric: Accuracy (%) (higher is better)

LeaderboardDataset

Loading chart...

Results

Submit a result

Sort:

#	Model↕	Accuracy (%)▼	Extra Data	Paper	Date↕	Code
1	Gemini 1.5 Pro	81.3	No	Gemini 1.5: Unlocking multimodal understanding a...	2024-03-08	Code
2	Video-RAG (Based on LLaVA-Video)	77.4	No	Video-RAG: Visually-aligned Retrieval-Augmented ...	2024-11-20	Code
3	GPT-4o	77.2	No	GPT-4o: Visual perception performance of multimo...	2024-06-14	-
4	Gemini 1.5 Flash	75	No	Gemini 1.5: Unlocking multimodal understanding a...	2024-03-08	Code
5	GPT-4o mini	68.9	No	GPT-4o: Visual perception performance of multimo...	2024-06-14	-
6	BIMBA-LLaVA-Qwen2-7B	64.67	No	BIMBA: Selective-Scan Compression for Long-Range...	2025-03-12	Code
7	VILA-1.5 (34B)	64.1	No	VILA: On Pre-training for Visual Language Models	2023-12-12	Code
8	MiniCPM-V 2.6 (8B)	63.7	No	MiniCPM-V: A GPT-4V Level MLLM on Your Phone	2024-08-03	Code
9	VideoLLaMA2 (72B)	63.1	No	VideoLLaMA 2: Advancing Spatial-Temporal Modelin...	2024-06-11	Code
10	LongVU (7B)	60.6	No	LongVU: Spatiotemporal Adaptive Compression for ...	2024-10-22	Code
11	VideoChat-T (7B)	55.8	No	TimeSuite: Improving MLLMs for Long Video Unders...	2024-10-25	Code

#1Gemini 1.5 ProSOTA
81.3
Accuracy (%)· 2024-03-08
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context Code
#2Video-RAG (Based on LLaVA-Video)
77.4
Accuracy (%)· 2024-11-20
Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension Code
#3GPT-4o
77.2
Accuracy (%)· 2024-06-14
GPT-4o: Visual perception performance of multimodal large language models in piglet activity understanding
#4Gemini 1.5 Flash
75
Accuracy (%)· 2024-03-08
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context Code
#5GPT-4o mini
68.9
Accuracy (%)· 2024-06-14
GPT-4o: Visual perception performance of multimodal large language models in piglet activity understanding
#6BIMBA-LLaVA-Qwen2-7B
64.67
Accuracy (%)· 2025-03-12
BIMBA: Selective-Scan Compression for Long-Range Video Question Answering Code
#7VILA-1.5 (34B)SOTA
64.1
Accuracy (%)· 2023-12-12
VILA: On Pre-training for Visual Language Models Code
#8MiniCPM-V 2.6 (8B)
63.7
Accuracy (%)· 2024-08-03
MiniCPM-V: A GPT-4V Level MLLM on Your Phone Code
#9VideoLLaMA2 (72B)
63.1
Accuracy (%)· 2024-06-11
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs Code
#10LongVU (7B)
60.6
Accuracy (%)· 2024-10-22
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding Code
#11VideoChat-T (7B)
55.8
Accuracy (%)· 2024-10-25
TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning Code