Video Question Answering on MSRVTT-QA

Metric: Accuracy (higher is better)

LeaderboardDataset

Loading chart...

Results

Hide extra data

Sort:

#	Model↕	Accuracy▼	Extra Data	Paper	Date↕	Code
1	Flash-VStream	72.4	No	Flash-VStream: Memory-Based Real-Time Understand...	2024-06-12	Code
2	PLLaVA (34B)	68.7	No	PLLaVA : Parameter-free LLaVA Extension from Ima...	2024-04-25	Code
3	Elysium	67.5	No	Elysium: Exploring Object-level Perception in Vi...	2024-03-25	Code
4	SlowFast-LLaVA-34B	67.4	No	SlowFast-LLaVA: A Strong Training-Free Baseline ...	2024-07-22	Code
5	Tarsier (34B)	66.4	No	Tarsier: Recipes for Training and Evaluating Lar...	2024-06-30	Code
6	LinVT-Qwen2-VL (7B)	66.2	No	LinVT: Empower Your Image-level Large Language M...	2024-12-06	Code
7	TS-LLaVA-34B	66.2	No	TS-LLaVA: Constructing Visual Tokens through Thu...	2024-11-17	Code
8	PPLLaVA-7B	64.3	No	PPLLaVA: Varied Video Sequence Understanding Wit...	2024-11-04	Code
9	IG-VLM	63.8	No	An Image Grid Can Be Worth a Video: Zero-shot Vi...	2024-03-27	Code
10	ST-LLM	63.2	No	ST-LLM: Large Language Models Are Effective Temp...	2024-03-30	Code
11	CAT-7B	62.1	No	CAT: Enhancing Multimodal Large Language Model t...	2024-03-07	Code
12	VideoGPT+	60.6	No	VideoGPT+: Integrating Image and Video Encoders ...	2024-06-13	Code
13	Vista-LLaMA-7B	60.5	No	Vista-LLaMA: Reliable Video Narrator via Equal D...	2023-12-12	-
14	MiniGPT4-video-7B	59.73	No	MiniGPT4-Video: Advancing Multimodal LLMs for Vi...	2024-04-04	Code
15	LLaVA-Mini	59.5	No	LLaVA-Mini: Efficient Image and Video Large Mult...	2025-01-07	Code
16	Video-LaVIT	59.3	No	Video-LaVIT: Unified Video-Language Pre-training...	2024-02-05	Code
17	Video-LLaVA-7B	59.2	Yes	Video-LLaVA: Learning United Visual Representati...	2023-11-16	Code
18	LLaMA-VID-13B (2 Token)	58.9	No	LLaMA-VID: An Image is Worth 2 Tokens in Large L...	2023-11-28	Code
19	LLaMA-VID-7B (2 Token)	57.7	No	LLaMA-VID: An Image is Worth 2 Tokens in Large L...	2023-11-28	Code
20	SUM-shot+Vicuna	56.8	No	Shot2Story20K: A New Benchmark for Comprehensive...	2023-12-16	Code
21	Omni-VideoAssistant	55.3	No	OmniDataComposer: A Unified Data Structure for M...	2023-08-08	Code
22	Chat-UniVi-7B	55	No	Chat-UniVi: Unified Visual Representation Empowe...	2023-11-14	Code
23	VideoChat2	54.1	No	MVBench: A Comprehensive Multi-modal Video Under...	2023-11-28	Code
24	MovieChat	52.7	No	MovieChat: From Dense Token to Sparse Memory for...	2023-07-31	Code
25	BT-Adapter (zero-shot)	51.2	No	BT-Adapter: Video Conversation is Feasible Witho...	2023-09-27	Code
26	BT-Adapter (zero-shot)	51.2	No	BT-Adapter: Video Conversation is Feasible Witho...	2023-09-27	Code
27	Mirasol3B	50.42	No	Mirasol3B: A Multimodal Autoregressive model for...	2023-11-09	-
28	VAST	50.1	Yes	VAST: A Vision-Audio-Subtitle-Text Omni-Modality...	2023-05-29	Code
29	Video-ChatGPT-7B	49.3	No	Video-ChatGPT: Towards Detailed Video Understand...	2023-06-08	Code
30	VALOR	49.2	Yes	VALOR: Vision-Audio-Language Omni-Perception Pre...	2023-04-17	Code
31	COSA	49.2	Yes	COSA: Concatenated Sample Pretrained Vision-Lang...	2023-06-15	Code
32	MA-LMM	48.5	No	MA-LMM: Memory-Augmented Large Multimodal Model ...	2024-04-08	Code
33	mPLUG-2	48	No	mPLUG-2: A Modularized Multi-modal Foundation Mo...	2023-02-01	Code
34	FrozenBiLM	47	Yes	Zero-Shot Video Question Answering via Frozen Bi...	2022-06-16	Code
35	HBI	46.2	No	Video-Text as Game Players: Hierarchical Banzhaf...	2023-03-25	Code
36	EMCL-Net	45.8	No	Expectation-Maximization Contrastive Learning fo...	2022-11-21	Code
37	Video Chat-7B	45	No	VideoChat: Chat-Centric Video Understanding	2023-05-10	Code
38	VindLU	44.6	Yes	VindLU: A Recipe for Effective Video-and-Languag...	2022-12-09	Code
39	VIOLETv2	44.5	No	An Empirical Study of End-to-End Video-Language ...	2022-09-04	Code
40	Singularity-temporal	43.9	No	Revealing Single Frame Bias for Video-and-Langua...	2022-06-07	Code
41	LLaMA Adapter-7B	43.8	No	LLaMA-Adapter V2: Parameter-Efficient Visual Ins...	2023-04-28	Code
42	Singularity	43.5	No	Revealing Single Frame Bias for Video-and-Langua...	2022-06-07	Code
43	Video LLaMA-7B	29.6	No	Video-LLaMA: An Instruction-tuned Audio-Visual L...	2023-06-05	Code
44	FrozenBiLM (0-shot)	16.7	No	Zero-Shot Video Question Answering via Frozen Bi...	2022-06-16	Code

#1Flash-VStreamSOTA
72.4
Accuracy· 2024-06-12
Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams Code
#2PLLaVA (34B)SOTA
68.7
Accuracy· 2024-04-25
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning Code
#3ElysiumSOTA
67.5
Accuracy· 2024-03-25
Elysium: Exploring Object-level Perception in Videos via MLLM Code
#4SlowFast-LLaVA-34B
67.4
Accuracy· 2024-07-22
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models Code
#5Tarsier (34B)
66.4
Accuracy· 2024-06-30
Tarsier: Recipes for Training and Evaluating Large Video Description Models Code
#6LinVT-Qwen2-VL (7B)
66.2
Accuracy· 2024-12-06
LinVT: Empower Your Image-level Large Language Model to Understand Videos Code
#7TS-LLaVA-34B
66.2
Accuracy· 2024-11-17
TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models Code
#8PPLLaVA-7B
64.3
Accuracy· 2024-11-04
PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance Code
#9IG-VLM
63.8
Accuracy· 2024-03-27
An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM Code
#10ST-LLM
63.2
Accuracy· 2024-03-30
ST-LLM: Large Language Models Are Effective Temporal Learners Code
#11CAT-7BSOTA
62.1
Accuracy· 2024-03-07
CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios Code
#12VideoGPT+
60.6
Accuracy· 2024-06-13
VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding Code
#13Vista-LLaMA-7BSOTA
60.5
Accuracy· 2023-12-12
Vista-LLaMA: Reliable Video Narrator via Equal Distance to Visual Tokens
#14MiniGPT4-video-7B
59.73
Accuracy· 2024-04-04
MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens Code
#15LLaVA-Mini
59.5
Accuracy· 2025-01-07
LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token Code
#16Video-LaVIT
59.3
Accuracy· 2024-02-05
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization Code
#17Video-LLaVA-7BSOTA
59.2
Accuracy· Extra Data· 2023-11-16
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection Code
#18LLaMA-VID-13B (2 Token)
58.9
Accuracy· 2023-11-28
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models Code
#19LLaMA-VID-7B (2 Token)
57.7
Accuracy· 2023-11-28
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models Code
#20SUM-shot+Vicuna
56.8
Accuracy· 2023-12-16
Shot2Story20K: A New Benchmark for Comprehensive Understanding of Multi-shot Videos Code
#21Omni-VideoAssistantSOTA
55.3
Accuracy· 2023-08-08
OmniDataComposer: A Unified Data Structure for Multimodal Data Fusion and Infinite Data Generation Code
#22Chat-UniVi-7B
55
Accuracy· 2023-11-14
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding Code
#23VideoChat2
54.1
Accuracy· 2023-11-28
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark Code
#24MovieChatSOTA
52.7
Accuracy· 2023-07-31
MovieChat: From Dense Token to Sparse Memory for Long Video Understanding Code
#25BT-Adapter (zero-shot)
51.2
Accuracy· 2023-09-27
BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning Code
#26BT-Adapter (zero-shot)
51.2
Accuracy· 2023-09-27
BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning Code
#27Mirasol3B
50.42
Accuracy· 2023-11-09
Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities
#28VASTSOTA
50.1
Accuracy· Extra Data· 2023-05-29
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset Code
#29Video-ChatGPT-7B
49.3
Accuracy· 2023-06-08
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models Code
#30VALORSOTA
49.2
Accuracy· Extra Data· 2023-04-17
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset Code
#31COSA
49.2
Accuracy· Extra Data· 2023-06-15
COSA: Concatenated Sample Pretrained Vision-Language Foundation Model Code
#32MA-LMM
48.5
Accuracy· 2024-04-08
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding Code
#33mPLUG-2SOTA
48
Accuracy· 2023-02-01
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video Code
#34FrozenBiLMSOTA
47
Accuracy· Extra Data· 2022-06-16
Zero-Shot Video Question Answering via Frozen Bidirectional Language Models Code
#35HBI
46.2
Accuracy· 2023-03-25
Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning Code
#36EMCL-Net
45.8
Accuracy· 2022-11-21
Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations Code
#37Video Chat-7B
45
Accuracy· 2023-05-10
VideoChat: Chat-Centric Video Understanding Code
#38VindLU
44.6
Accuracy· Extra Data· 2022-12-09
VindLU: A Recipe for Effective Video-and-Language Pretraining Code
#39VIOLETv2
44.5
Accuracy· 2022-09-04
An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling Code
#40Singularity-temporalSOTA
43.9
Accuracy· 2022-06-07
Revealing Single Frame Bias for Video-and-Language Learning Code
#41LLaMA Adapter-7B
43.8
Accuracy· 2023-04-28
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model Code
#42Singularity
43.5
Accuracy· 2022-06-07
Revealing Single Frame Bias for Video-and-Language Learning Code
#43Video LLaMA-7B
29.6
Accuracy· 2023-06-05
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding Code
#44FrozenBiLM (0-shot)
16.7
Accuracy· 2022-06-16
Zero-Shot Video Question Answering via Frozen Bidirectional Language Models Code