Video Question Answering on MSRVTT-QA

Metric: Confidence Score (lower is better)

LeaderboardDataset

Loading chart...

Results

Hide extra data

Sort:

#	Model↕	Confidence Score▲	Extra Data	Paper	Date↕	Code
1	Video LLaMA-7B	1.8	No	Video-LLaMA: An Instruction-tuned Audio-Visual L...	2023-06-05	Code
2	Video Chat-7B	2.5	No	VideoChat: Chat-Centric Video Understanding	2023-05-10	Code
3	MovieChat	2.6	No	MovieChat: From Dense Token to Sparse Memory for...	2023-07-31	Code
4	LLaMA Adapter-7B	2.7	No	LLaMA-Adapter V2: Parameter-Efficient Visual Ins...	2023-04-28	Code
5	Video-ChatGPT-7B	2.8	No	Video-ChatGPT: Towards Detailed Video Understand...	2023-06-08	Code
6	BT-Adapter (zero-shot)	2.9	No	BT-Adapter: Video Conversation is Feasible Witho...	2023-09-27	Code
7	BT-Adapter (zero-shot)	2.9	No	BT-Adapter: Video Conversation is Feasible Witho...	2023-09-27	Code
8	Chat-UniVi-7B	3.1	No	Chat-UniVi: Unified Visual Representation Empowe...	2023-11-14	Code
9	Elysium	3.2	No	Elysium: Exploring Object-level Perception in Vi...	2024-03-25	Code
10	LLaMA-VID-7B (2 Token)	3.2	No	LLaMA-VID: An Image is Worth 2 Tokens in Large L...	2023-11-28	Code
11	Vista-LLaMA-7B	3.3	No	Vista-LLaMA: Reliable Video Narrator via Equal D...	2023-12-12	-
12	Video-LaVIT	3.3	No	Video-LaVIT: Unified Video-Language Pre-training...	2024-02-05	Code
13	LLaMA-VID-13B (2 Token)	3.3	No	LLaMA-VID: An Image is Worth 2 Tokens in Large L...	2023-11-28	Code
14	Omni-VideoAssistant	3.3	No	OmniDataComposer: A Unified Data Structure for M...	2023-08-08	Code
15	VideoChat2	3.3	No	MVBench: A Comprehensive Multi-modal Video Under...	2023-11-28	Code
16	Flash-VStream	3.4	No	Flash-VStream: Memory-Based Real-Time Understand...	2024-06-12	Code
17	ST-LLM	3.4	No	ST-LLM: Large Language Models Are Effective Temp...	2024-03-30	Code
18	PPLLaVA-7B	3.5	No	PPLLaVA: Varied Video Sequence Understanding Wit...	2024-11-04	Code
19	IG-VLM	3.5	No	An Image Grid Can Be Worth a Video: Zero-shot Vi...	2024-03-27	Code
20	CAT-7B	3.5	No	CAT: Enhancing Multimodal Large Language Model t...	2024-03-07	Code
21	Video-LLaVA-7B	3.5	Yes	Video-LLaVA: Learning United Visual Representati...	2023-11-16	Code
22	PLLaVA (34B)	3.6	No	PLLaVA : Parameter-free LLaVA Extension from Ima...	2024-04-25	Code
23	TS-LLaVA-34B	3.6	No	TS-LLaVA: Constructing Visual Tokens through Thu...	2024-11-17	Code
24	VideoGPT+	3.6	No	VideoGPT+: Integrating Image and Video Encoders ...	2024-06-13	Code
25	LLaVA-Mini	3.6	No	LLaVA-Mini: Efficient Image and Video Large Mult...	2025-01-07	Code
26	SlowFast-LLaVA-34B	3.7	No	SlowFast-LLaVA: A Strong Training-Free Baseline ...	2024-07-22	Code
27	Tarsier (34B)	3.7	No	Tarsier: Recipes for Training and Evaluating Lar...	2024-06-30	Code
28	LinVT-Qwen2-VL (7B)	4	No	LinVT: Empower Your Image-level Large Language M...	2024-12-06	Code

#1Video LLaMA-7BSOTA
1.8
Confidence Score· 2023-06-05
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding Code
#2Video Chat-7BSOTA
2.5
Confidence Score· 2023-05-10
VideoChat: Chat-Centric Video Understanding Code
#3MovieChat
2.6
Confidence Score· 2023-07-31
MovieChat: From Dense Token to Sparse Memory for Long Video Understanding Code
#4LLaMA Adapter-7BSOTA
2.7
Confidence Score· 2023-04-28
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model Code
#5Video-ChatGPT-7B
2.8
Confidence Score· 2023-06-08
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models Code
#6BT-Adapter (zero-shot)
2.9
Confidence Score· 2023-09-27
BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning Code
#7BT-Adapter (zero-shot)
2.9
Confidence Score· 2023-09-27
BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning Code
#8Chat-UniVi-7B
3.1
Confidence Score· 2023-11-14
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding Code
#9Elysium
3.2
Confidence Score· 2024-03-25
Elysium: Exploring Object-level Perception in Videos via MLLM Code
#10LLaMA-VID-7B (2 Token)
3.2
Confidence Score· 2023-11-28
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models Code
#11Vista-LLaMA-7B
3.3
Confidence Score· 2023-12-12
Vista-LLaMA: Reliable Video Narrator via Equal Distance to Visual Tokens
#12Video-LaVIT
3.3
Confidence Score· 2024-02-05
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization Code
#13LLaMA-VID-13B (2 Token)
3.3
Confidence Score· 2023-11-28
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models Code
#14Omni-VideoAssistant
3.3
Confidence Score· 2023-08-08
OmniDataComposer: A Unified Data Structure for Multimodal Data Fusion and Infinite Data Generation Code
#15VideoChat2
3.3
Confidence Score· 2023-11-28
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark Code
#16Flash-VStream
3.4
Confidence Score· 2024-06-12
Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams Code
#17ST-LLM
3.4
Confidence Score· 2024-03-30
ST-LLM: Large Language Models Are Effective Temporal Learners Code
#18PPLLaVA-7B
3.5
Confidence Score· 2024-11-04
PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance Code
#19IG-VLM
3.5
Confidence Score· 2024-03-27
An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM Code
#20CAT-7B
3.5
Confidence Score· 2024-03-07
CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios Code
#21Video-LLaVA-7B
3.5
Confidence Score· Extra Data· 2023-11-16
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection Code
#22PLLaVA (34B)
3.6
Confidence Score· 2024-04-25
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning Code
#23TS-LLaVA-34B
3.6
Confidence Score· 2024-11-17
TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models Code
#24VideoGPT+
3.6
Confidence Score· 2024-06-13
VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding Code
#25LLaVA-Mini
3.6
Confidence Score· 2025-01-07
LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token Code
#26SlowFast-LLaVA-34B
3.7
Confidence Score· 2024-07-22
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models Code
#27Tarsier (34B)
3.7
Confidence Score· 2024-06-30
Tarsier: Recipes for Training and Evaluating Large Video Description Models Code
#28LinVT-Qwen2-VL (7B)
4
Confidence Score· 2024-12-06
LinVT: Empower Your Image-level Large Language Model to Understand Videos Code