Question Answering on MSRVTT-QA

Metric: Accuracy (higher is better)

LeaderboardDataset

Loading chart...

Results

Hide extra data

Sort:

#	Model↕	Accuracy▼	Extra Data	Paper	Date↕	Code
1	Flash-VStream	72.4	No	Flash-VStream: Memory-Based Real-Time Understand...	2024-06-12	Code
2	PLLaVA (34B)	68.7	No	PLLaVA : Parameter-free LLaVA Extension from Ima...	2024-04-25	Code
3	Elysium	67.5	No	Elysium: Exploring Object-level Perception in Vi...	2024-03-25	Code
4	SlowFast-LLaVA-34B	67.4	No	SlowFast-LLaVA: A Strong Training-Free Baseline ...	2024-07-22	Code
5	Tarsier (34B)	66.4	No	Tarsier: Recipes for Training and Evaluating Lar...	2024-06-30	Code
6	LinVT-Qwen2-VL (7B)	66.2	No	LinVT: Empower Your Image-level Large Language M...	2024-12-06	Code
7	TS-LLaVA-34B	66.2	No	TS-LLaVA: Constructing Visual Tokens through Thu...	2024-11-17	Code
8	PPLLaVA-7B	64.3	No	PPLLaVA: Varied Video Sequence Understanding Wit...	2024-11-04	Code
9	IG-VLM	63.8	No	An Image Grid Can Be Worth a Video: Zero-shot Vi...	2024-03-27	Code
10	ST-LLM	63.2	No	ST-LLM: Large Language Models Are Effective Temp...	2024-03-30	Code
11	CAT-7B	62.1	No	CAT: Enhancing Multimodal Large Language Model t...	2024-03-07	Code
12	VideoGPT+	60.6	No	VideoGPT+: Integrating Image and Video Encoders ...	2024-06-13	Code
13	Vista-LLaMA-7B	60.5	No	Vista-LLaMA: Reliable Video Narrator via Equal D...	2023-12-12	-
14	MiniGPT4-video-7B	59.73	No	MiniGPT4-Video: Advancing Multimodal LLMs for Vi...	2024-04-04	Code
15	LLaVA-Mini	59.5	No	LLaVA-Mini: Efficient Image and Video Large Mult...	2025-01-07	Code
16	Video-LaVIT	59.3	No	Video-LaVIT: Unified Video-Language Pre-training...	2024-02-05	Code
17	Video-LLaVA-7B	59.2	Yes	Video-LLaVA: Learning United Visual Representati...	2023-11-16	Code
18	LLaMA-VID-13B (2 Token)	58.9	No	LLaMA-VID: An Image is Worth 2 Tokens in Large L...	2023-11-28	Code
19	LLaMA-VID-7B (2 Token)	57.7	No	LLaMA-VID: An Image is Worth 2 Tokens in Large L...	2023-11-28	Code
20	SUM-shot+Vicuna	56.8	No	Shot2Story20K: A New Benchmark for Comprehensive...	2023-12-16	Code
21	Omni-VideoAssistant	55.3	No	OmniDataComposer: A Unified Data Structure for M...	2023-08-08	Code
22	Chat-UniVi-7B	55	No	Chat-UniVi: Unified Visual Representation Empowe...	2023-11-14	Code
23	VideoChat2	54.1	No	MVBench: A Comprehensive Multi-modal Video Under...	2023-11-28	Code
24	MovieChat	52.7	No	MovieChat: From Dense Token to Sparse Memory for...	2023-07-31	Code
25	BT-Adapter (zero-shot)	51.2	No	BT-Adapter: Video Conversation is Feasible Witho...	2023-09-27	Code
26	BT-Adapter (zero-shot)	51.2	No	BT-Adapter: Video Conversation is Feasible Witho...	2023-09-27	Code
27	Video-ChatGPT-7B	49.3	No	Video-ChatGPT: Towards Detailed Video Understand...	2023-06-08	Code
28	Video Chat-7B	45	No	VideoChat: Chat-Centric Video Understanding	2023-05-10	Code
29	LLaMA Adapter-7B	43.8	No	LLaMA-Adapter V2: Parameter-Efficient Visual Ins...	2023-04-28	Code
30	Video LLaMA-7B	29.6	No	Video-LLaMA: An Instruction-tuned Audio-Visual L...	2023-06-05	Code

#1Flash-VStreamSOTA
72.4
Accuracy· 2024-06-12
Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams Code
#2PLLaVA (34B)SOTA
68.7
Accuracy· 2024-04-25
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning Code
#3ElysiumSOTA
67.5
Accuracy· 2024-03-25
Elysium: Exploring Object-level Perception in Videos via MLLM Code
#4SlowFast-LLaVA-34B
67.4
Accuracy· 2024-07-22
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models Code
#5Tarsier (34B)
66.4
Accuracy· 2024-06-30
Tarsier: Recipes for Training and Evaluating Large Video Description Models Code
#6LinVT-Qwen2-VL (7B)
66.2
Accuracy· 2024-12-06
LinVT: Empower Your Image-level Large Language Model to Understand Videos Code
#7TS-LLaVA-34B
66.2
Accuracy· 2024-11-17
TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models Code
#8PPLLaVA-7B
64.3
Accuracy· 2024-11-04
PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance Code
#9IG-VLM
63.8
Accuracy· 2024-03-27
An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM Code
#10ST-LLM
63.2
Accuracy· 2024-03-30
ST-LLM: Large Language Models Are Effective Temporal Learners Code
#11CAT-7BSOTA
62.1
Accuracy· 2024-03-07
CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios Code
#12VideoGPT+
60.6
Accuracy· 2024-06-13
VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding Code
#13Vista-LLaMA-7BSOTA
60.5
Accuracy· 2023-12-12
Vista-LLaMA: Reliable Video Narrator via Equal Distance to Visual Tokens
#14MiniGPT4-video-7B
59.73
Accuracy· 2024-04-04
MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens Code
#15LLaVA-Mini
59.5
Accuracy· 2025-01-07
LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token Code
#16Video-LaVIT
59.3
Accuracy· 2024-02-05
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization Code
#17Video-LLaVA-7BSOTA
59.2
Accuracy· Extra Data· 2023-11-16
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection Code
#18LLaMA-VID-13B (2 Token)
58.9
Accuracy· 2023-11-28
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models Code
#19LLaMA-VID-7B (2 Token)
57.7
Accuracy· 2023-11-28
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models Code
#20SUM-shot+Vicuna
56.8
Accuracy· 2023-12-16
Shot2Story20K: A New Benchmark for Comprehensive Understanding of Multi-shot Videos Code
#21Omni-VideoAssistantSOTA
55.3
Accuracy· 2023-08-08
OmniDataComposer: A Unified Data Structure for Multimodal Data Fusion and Infinite Data Generation Code
#22Chat-UniVi-7B
55
Accuracy· 2023-11-14
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding Code
#23VideoChat2
54.1
Accuracy· 2023-11-28
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark Code
#24MovieChatSOTA
52.7
Accuracy· 2023-07-31
MovieChat: From Dense Token to Sparse Memory for Long Video Understanding Code
#25BT-Adapter (zero-shot)
51.2
Accuracy· 2023-09-27
BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning Code
#26BT-Adapter (zero-shot)
51.2
Accuracy· 2023-09-27
BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning Code
#27Video-ChatGPT-7BSOTA
49.3
Accuracy· 2023-06-08
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models Code
#28Video Chat-7BSOTA
45
Accuracy· 2023-05-10
VideoChat: Chat-Centric Video Understanding Code
#29LLaMA Adapter-7BSOTA
43.8
Accuracy· 2023-04-28
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model Code
#30Video LLaMA-7B
29.6
Accuracy· 2023-06-05
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding Code