Video Question Answering on VNBench

Metric: Accuracy (higher is better)

LeaderboardDataset

Loading chart...

Results

Submit a result

Sort:

#	Model↕	Accuracy▼	Extra Data	Paper	Date↕	Code
1	BIMBA-LLaVA-Qwen2-7B	77.88	No	BIMBA: Selective-Scan Compression for Long-Range...	2025-03-12	Code
2	Gemini	66.7	No	Gemini 1.5: Unlocking multimodal understanding a...	2024-03-08	Code
3	LLaVA-OneVision-72B	58.7	No	LLaVA-OneVision: Easy Visual Task Transfer	2024-08-06	Code
4	LLaVA-OneVision-7B	51.8	No	LLaVA-OneVision: Easy Visual Task Transfer	2024-08-06	Code
5	Qwen2-VL-7B	33.9	No	Qwen2-VL: Enhancing Vision-Language Model's Perc...	2024-09-18	Code
6	LLaVA-NeXT-Video-7B	20.1	No	LLaVA-NeXT-Interleave: Tackling Multi-image, Vid...	2024-07-10	Code
7	VideoChat2	12.4	No	VideoChat: Chat-Centric Video Understanding	2023-05-10	Code
8	VideoLLaMA2	4.5	No	VideoLLaMA 2: Advancing Spatial-Temporal Modelin...	2024-06-11	Code
9	VideoChatGPT	4.1	No	Video-ChatGPT: Towards Detailed Video Understand...	2023-06-08	Code

#1BIMBA-LLaVA-Qwen2-7BSOTA
77.88
Accuracy· 2025-03-12
BIMBA: Selective-Scan Compression for Long-Range Video Question Answering Code
#2GeminiSOTA
66.7
Accuracy· 2024-03-08
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context Code
#3LLaVA-OneVision-72B
58.7
Accuracy· 2024-08-06
LLaVA-OneVision: Easy Visual Task Transfer Code
#4LLaVA-OneVision-7B
51.8
Accuracy· 2024-08-06
LLaVA-OneVision: Easy Visual Task Transfer Code
#5Qwen2-VL-7B
33.9
Accuracy· 2024-09-18
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution Code
#6LLaVA-NeXT-Video-7B
20.1
Accuracy· 2024-07-10
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models Code
#7VideoChat2SOTA
12.4
Accuracy· 2023-05-10
VideoChat: Chat-Centric Video Understanding Code
#8VideoLLaMA2
4.5
Accuracy· 2024-06-11
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs Code
#9VideoChatGPT
4.1
Accuracy· 2023-06-08
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models Code