Question Answering on MSVD-QA

Metric: Accuracy (higher is better)

LeaderboardDataset

Loading chart...

Results

Sort:

#	Model↕	Accuracy▼	Extra Data	Paper	Date↕	Code
1	Tarsier (34B)	80.3	No	Tarsier: Recipes for Training and Evaluating Lar...	2024-06-30	Code
2	Flash-VStream	80.3	No	Flash-VStream: Memory-Based Real-Time Understand...	2024-06-12	Code
3	LinVT-Qwen2-VL (7B)	80.2	No	LinVT: Empower Your Image-level Large Language M...	2024-12-06	Code
4	VILA1.5-40B	80.1	No	VILA: On Pre-training for Visual Language Models	2023-12-12	Code
5	PLLaVA (34B)	79.9	No	PLLaVA : Parameter-free LLaVA Extension from Ima...	2024-04-25	Code
6	SlowFast-LLaVA-34B	79.9	No	SlowFast-LLaVA: A Strong Training-Free Baseline ...	2024-07-22	Code
7	IG-VLM-34B	79.6	No	An Image Grid Can Be Worth a Video: Zero-shot Vi...	2024-03-27	Code
8	TS-LLaVA-34B	79.4	No	TS-LLaVA: Constructing Visual Tokens through Thu...	2024-11-17	Code
9	PPLLaVA-7B	77.1	No	PPLLaVA: Varied Video Sequence Understanding Wit...	2024-11-04	Code
10	Elysium	75.8	No	Elysium: Exploring Object-level Perception in Vi...	2024-03-25	Code
11	MovieChat	75.2	No	MovieChat: From Dense Token to Sparse Memory for...	2023-07-31	Code
12	ST-LLM	74.6	No	ST-LLM: Large Language Models Are Effective Temp...	2024-03-30	Code
13	MiniGPT4-video-7B	73.92	No	MiniGPT4-Video: Advancing Multimodal LLMs for Vi...	2024-04-04	Code
14	Video-LaVIT	73.2	No	Video-LaVIT: Unified Video-Language Pre-training...	2024-02-05	Code
15	VideoGPT+	72.4	No	VideoGPT+: Integrating Image and Video Encoders ...	2024-06-13	Code
16	LLaVA-Mini	70.9	No	LLaVA-Mini: Efficient Image and Video Large Mult...	2025-01-07	Code
17	Video-LLaVA-7B	70.7	No	Video-LLaVA: Learning United Visual Representati...	2023-11-16	Code
18	VideoChat2	70	No	MVBench: A Comprehensive Multi-modal Video Under...	2023-11-28	Code
19	LLaMA-VID-13B (2 Token)	70	No	LLaMA-VID: An Image is Worth 2 Tokens in Large L...	2023-11-28	Code
20	LLaMA-VID-7B (2 Token)	69.7	No	LLaMA-VID: An Image is Worth 2 Tokens in Large L...	2023-11-28	Code
21	Chat-UniVi-7B	69.3	No	Chat-UniVi: Unified Visual Representation Empowe...	2023-11-14	Code
22	BT-Adapter (zero-shot)	67	No	BT-Adapter: Video Conversation is Feasible Witho...	2023-09-27	Code
23	BT-Adapter (zero-shot)	67	No	BT-Adapter: Video Conversation is Feasible Witho...	2023-09-27	Code
24	Video-ChatGPT-7B	64.9	No	Video-ChatGPT: Towards Detailed Video Understand...	2023-06-08	Code
25	Video Chat-7B	56.3	No	VideoChat: Chat-Centric Video Understanding	2023-05-10	Code
26	LLaMA Adapter-7B	54.9	No	LLaMA-Adapter V2: Parameter-Efficient Visual Ins...	2023-04-28	Code
27	Video LLaMA-7B	51.6	No	Video-LLaMA: An Instruction-tuned Audio-Visual L...	2023-06-05	Code
28	FrozenBiLM	33.8	No	Zero-Shot Video Question Answering via Frozen Bi...	2022-06-16	Code

#1Tarsier (34B)
80.3
Accuracy· 2024-06-30
Tarsier: Recipes for Training and Evaluating Large Video Description Models Code
#2Flash-VStreamSOTA
80.3
Accuracy· 2024-06-12
Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams Code
#3LinVT-Qwen2-VL (7B)
80.2
Accuracy· 2024-12-06
LinVT: Empower Your Image-level Large Language Model to Understand Videos Code
#4VILA1.5-40BSOTA
80.1
Accuracy· 2023-12-12
VILA: On Pre-training for Visual Language Models Code
#5PLLaVA (34B)
79.9
Accuracy· 2024-04-25
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning Code
#6SlowFast-LLaVA-34B
79.9
Accuracy· 2024-07-22
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models Code
#7IG-VLM-34B
79.6
Accuracy· 2024-03-27
An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM Code
#8TS-LLaVA-34B
79.4
Accuracy· 2024-11-17
TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models Code
#9PPLLaVA-7B
77.1
Accuracy· 2024-11-04
PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance Code
#10Elysium
75.8
Accuracy· 2024-03-25
Elysium: Exploring Object-level Perception in Videos via MLLM Code
#11MovieChatSOTA
75.2
Accuracy· 2023-07-31
MovieChat: From Dense Token to Sparse Memory for Long Video Understanding Code
#12ST-LLM
74.6
Accuracy· 2024-03-30
ST-LLM: Large Language Models Are Effective Temporal Learners Code
#13MiniGPT4-video-7B
73.92
Accuracy· 2024-04-04
MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens Code
#14Video-LaVIT
73.2
Accuracy· 2024-02-05
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization Code
#15VideoGPT+
72.4
Accuracy· 2024-06-13
VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding Code
#16LLaVA-Mini
70.9
Accuracy· 2025-01-07
LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token Code
#17Video-LLaVA-7B
70.7
Accuracy· 2023-11-16
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection Code
#18VideoChat2
70
Accuracy· 2023-11-28
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark Code
#19LLaMA-VID-13B (2 Token)
70
Accuracy· 2023-11-28
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models Code
#20LLaMA-VID-7B (2 Token)
69.7
Accuracy· 2023-11-28
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models Code
#21Chat-UniVi-7B
69.3
Accuracy· 2023-11-14
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding Code
#22BT-Adapter (zero-shot)
67
Accuracy· 2023-09-27
BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning Code
#23BT-Adapter (zero-shot)
67
Accuracy· 2023-09-27
BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning Code
#24Video-ChatGPT-7BSOTA
64.9
Accuracy· 2023-06-08
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models Code
#25Video Chat-7BSOTA
56.3
Accuracy· 2023-05-10
VideoChat: Chat-Centric Video Understanding Code
#26LLaMA Adapter-7BSOTA
54.9
Accuracy· 2023-04-28
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model Code
#27Video LLaMA-7B
51.6
Accuracy· 2023-06-05
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding Code
#28FrozenBiLMSOTA
33.8
Accuracy· 2022-06-16
Zero-Shot Video Question Answering via Frozen Bidirectional Language Models Code