Visual Question Answering (VQA) on VideoInstruct

Metric: mean (higher is better)

LeaderboardDataset

Loading chart...

Results

Sort:

#	Model↕	mean▼	Extra Data	Paper	Date↕	Code
1	PPLLaVA-7B-dpo	3.73	No	PPLLaVA: Varied Video Sequence Understanding Wit...	2024-11-04	Code
2	VLM-RLAIF	3.49	No	Tuning Large Multimodal Models for Videos using ...	2024-02-06	Code
3	TS-LLaVA-34B	3.38	No	TS-LLaVA: Constructing Visual Tokens through Thu...	2024-11-17	Code
4	PLLaVA-34B	3.32	No	PLLaVA : Parameter-free LLaVA Extension from Ima...	2024-04-25	Code
5	PPLLaVA-7B	3.32	No	PPLLaVA: Varied Video Sequence Understanding Wit...	2024-11-04	Code
6	SlowFast-LLaVA-34B	3.32	No	SlowFast-LLaVA: A Strong Training-Free Baseline ...	2024-07-22	Code
7	VideoGPT+	3.28	No	VideoGPT+: Integrating Image and Video Encoders ...	2024-06-13	Code
8	IG-VLM-GPT4v	3.17	No	An Image Grid Can Be Worth a Video: Zero-shot Vi...	2024-03-27	Code
9	ST-LLM-7B	3.15	No	ST-LLM: Large Language Models Are Effective Temp...	2024-03-30	Code
10	VideoChat2_HD_mistral	3.1	No	MVBench: A Comprehensive Multi-modal Video Under...	2023-11-28	Code
11	CAT-7B	3.07	No	CAT: Enhancing Multimodal Large Language Model t...	2024-03-07	Code
12	LITA-13B	3.04	No	LITA: Language Instructed Temporal-Localization ...	2024-03-27	Code
13	LLaMA-VID-13B (2 Token)	2.99	No	LLaMA-VID: An Image is Worth 2 Tokens in Large L...	2023-11-28	Code
14	Chat-UniVi	2.99	No	Chat-UniVi: Unified Visual Representation Empowe...	2023-11-14	Code
15	VideoChat2	2.98	No	MVBench: A Comprehensive Multi-modal Video Under...	2023-11-28	Code
16	LLaMA-VID-7B (2 Token)	2.89	No	LLaMA-VID: An Image is Worth 2 Tokens in Large L...	2023-11-28	Code
17	VTimeLLM	2.85	No	VTimeLLM: Empower LLM to Grasp Video Moments	2023-11-30	Code
18	BT-Adapter	2.69	No	BT-Adapter: Video Conversation is Feasible Witho...	2023-09-27	Code
19	BT-Adapter (zero-shot)	2.46	No	BT-Adapter: Video Conversation is Feasible Witho...	2023-09-27	Code
20	Video-ChatGPT	2.38	No	Video-ChatGPT: Towards Detailed Video Understand...	2023-06-08	Code
21	Video Chat	2.29	No	VideoChat: Chat-Centric Video Understanding	2023-05-10	Code
22	LLaMA Adapter	2.16	No	LLaMA-Adapter V2: Parameter-Efficient Visual Ins...	2023-04-28	Code
23	Video LLaMA	1.98	No	Video-LLaMA: An Instruction-tuned Audio-Visual L...	2023-06-05	Code

#1PPLLaVA-7B-dpoSOTA
3.73
mean· 2024-11-04
PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance Code
#2VLM-RLAIFSOTA
3.49
mean· 2024-02-06
Tuning Large Multimodal Models for Videos using Reinforcement Learning from AI Feedback Code
#3TS-LLaVA-34B
3.38
mean· 2024-11-17
TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models Code
#4PLLaVA-34B
3.32
mean· 2024-04-25
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning Code
#5PPLLaVA-7B
3.32
mean· 2024-11-04
PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance Code
#6SlowFast-LLaVA-34B
3.32
mean· 2024-07-22
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models Code
#7VideoGPT+
3.28
mean· 2024-06-13
VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding Code
#8IG-VLM-GPT4v
3.17
mean· 2024-03-27
An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM Code
#9ST-LLM-7B
3.15
mean· 2024-03-30
ST-LLM: Large Language Models Are Effective Temporal Learners Code
#10VideoChat2_HD_mistralSOTA
3.1
mean· 2023-11-28
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark Code
#11CAT-7B
3.07
mean· 2024-03-07
CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios Code
#12LITA-13B
3.04
mean· 2024-03-27
LITA: Language Instructed Temporal-Localization Assistant Code
#13LLaMA-VID-13B (2 Token)
2.99
mean· 2023-11-28
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models Code
#14Chat-UniViSOTA
2.99
mean· 2023-11-14
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding Code
#15VideoChat2
2.98
mean· 2023-11-28
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark Code
#16LLaMA-VID-7B (2 Token)
2.89
mean· 2023-11-28
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models Code
#17VTimeLLM
2.85
mean· 2023-11-30
VTimeLLM: Empower LLM to Grasp Video Moments Code
#18BT-AdapterSOTA
2.69
mean· 2023-09-27
BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning Code
#19BT-Adapter (zero-shot)
2.46
mean· 2023-09-27
BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning Code
#20Video-ChatGPTSOTA
2.38
mean· 2023-06-08
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models Code
#21Video ChatSOTA
2.29
mean· 2023-05-10
VideoChat: Chat-Centric Video Understanding Code
#22LLaMA AdapterSOTA
2.16
mean· 2023-04-28
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model Code
#23Video LLaMA
1.98
mean· 2023-06-05
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding Code