Question Answering on Video-MME (w/o subs)

Metric: Accuracy (%) (higher is better)

LeaderboardDataset

Loading chart...

Results

Submit a result

Sort:

#	Model↕	Accuracy (%)▼	Extra Data	Paper	Date↕	Code
1	Video-RAG (based on LLaVA-Video)	77.4	No	Video-RAG: Visually-aligned Retrieval-Augmented ...	2024-11-20	Code
2	Gemini 1.5 Pro	71.9	No	Gemini 1.5: Unlocking multimodal understanding a...	2024-03-08	Code
3	GPT-4o	70.3	No	GPT-4o: Visual perception performance of multimo...	2024-06-14	-
4	Gemini 1.5 Flash	66.3	No	Gemini 1.5: Unlocking multimodal understanding a...	2024-03-08	Code
5	LLaVA-OneVision (72B)	64.8	No	-	-	-
6	GPT-4o mini	62.3	No	GPT-4o: Visual perception performance of multimo...	2024-06-14	-
7	VILA-1.5 (34B)	61.4	No	VILA: On Pre-training for Visual Language Models	2023-12-12	Code
8	VideoLLaMA2 (72B)	60.9	No	VideoLLaMA 2: Advancing Spatial-Temporal Modelin...	2024-06-11	Code
9	VideoChat-T (7B)	46.3	No	TimeSuite: Improving MLLMs for Long Video Unders...	2024-10-25	Code

#1Video-RAG (based on LLaVA-Video)SOTA
77.4
Accuracy (%)· 2024-11-20
Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension Code
#2Gemini 1.5 ProSOTA
71.9
Accuracy (%)· 2024-03-08
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context Code
#3GPT-4o
70.3
Accuracy (%)· 2024-06-14
GPT-4o: Visual perception performance of multimodal large language models in piglet activity understanding
#4Gemini 1.5 Flash
66.3
Accuracy (%)· 2024-03-08
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context Code
#5LLaVA-OneVision (72B)
64.8
Accuracy (%)
No paper
#6GPT-4o mini
62.3
Accuracy (%)· 2024-06-14
GPT-4o: Visual perception performance of multimodal large language models in piglet activity understanding
#7VILA-1.5 (34B)SOTA
61.4
Accuracy (%)· 2023-12-12
VILA: On Pre-training for Visual Language Models Code
#8VideoLLaMA2 (72B)
60.9
Accuracy (%)· 2024-06-11
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs Code
#9VideoChat-T (7B)
46.3
Accuracy (%)· 2024-10-25
TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning Code