Video-based Generative Performance Benchmarking (Correctness of Information) on VideoInstruct

Metric: gpt-score (higher is better)

LeaderboardDataset

Loading chart...

Results

Sort:

#	Model↕	gpt-score▼	Extra Data	Paper	Date↕	Code
1	PPLLaVA-7B	3.85	No	PPLLaVA: Varied Video Sequence Understanding Wit...	2024-11-04	Code
2	PLLaVA-34B	3.6	No	PLLaVA : Parameter-free LLaVA Extension from Ima...	2024-04-25	Code
3	TS-LLaVA-34B	3.55	No	TS-LLaVA: Constructing Visual Tokens through Thu...	2024-11-17	Code
4	SlowFast-LLaVA-34B	3.48	No	SlowFast-LLaVA: A Strong Training-Free Baseline ...	2024-07-22	Code
5	VideoChat2_HD_mistral	3.4	No	MVBench: A Comprehensive Multi-modal Video Under...	2023-11-28	Code
6	VideoGPT+	3.27	No	VideoGPT+: Integrating Image and Video Encoders ...	2024-06-13	Code
7	ST-LLM	3.23	No	ST-LLM: Large Language Models Are Effective Temp...	2024-03-30	Code
8	MiniGPT4-video-7B	3.08	No	MiniGPT4-Video: Advancing Multimodal LLMs for Vi...	2024-04-04	Code
9	VideoChat2	3.02	No	MVBench: A Comprehensive Multi-modal Video Under...	2023-11-28	Code
10	Chat-UniVi	2.89	No	Chat-UniVi: Unified Visual Representation Empowe...	2023-11-14	Code
11	VTimeLLM	2.78	No	VTimeLLM: Empower LLM to Grasp Video Moments	2023-11-30	Code
12	MovieChat	2.76	No	MovieChat: From Dense Token to Sparse Memory for...	2023-07-31	Code
13	BT-Adapter	2.68	No	BT-Adapter: Video Conversation is Feasible Witho...	2023-09-27	Code
14	Video-ChatGPT	2.4	No	Video-ChatGPT: Towards Detailed Video Understand...	2023-06-08	Code
15	Video Chat	2.32	No	VideoChat: Chat-Centric Video Understanding	2023-05-10	Code
16	BT-Adapter (zero-shot)	2.16	No	BT-Adapter: Video Conversation is Feasible Witho...	2023-09-27	Code
17	LLaMA Adapter	2.03	No	LLaMA-Adapter V2: Parameter-Efficient Visual Ins...	2023-04-28	Code
18	Video LLaMA	1.96	No	Video-LLaMA: An Instruction-tuned Audio-Visual L...	2023-06-05	Code

#1PPLLaVA-7BSOTA
3.85
gpt-score· 2024-11-04
PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance Code
#2PLLaVA-34BSOTA
3.6
gpt-score· 2024-04-25
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning Code
#3TS-LLaVA-34B
3.55
gpt-score· 2024-11-17
TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models Code
#4SlowFast-LLaVA-34B
3.48
gpt-score· 2024-07-22
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models Code
#5VideoChat2_HD_mistralSOTA
3.4
gpt-score· 2023-11-28
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark Code
#6VideoGPT+
3.27
gpt-score· 2024-06-13
VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding Code
#7ST-LLM
3.23
gpt-score· 2024-03-30
ST-LLM: Large Language Models Are Effective Temporal Learners Code
#8MiniGPT4-video-7B
3.08
gpt-score· 2024-04-04
MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens Code
#9VideoChat2
3.02
gpt-score· 2023-11-28
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark Code
#10Chat-UniViSOTA
2.89
gpt-score· 2023-11-14
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding Code
#11VTimeLLM
2.78
gpt-score· 2023-11-30
VTimeLLM: Empower LLM to Grasp Video Moments Code
#12MovieChatSOTA
2.76
gpt-score· 2023-07-31
MovieChat: From Dense Token to Sparse Memory for Long Video Understanding Code
#13BT-Adapter
2.68
gpt-score· 2023-09-27
BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning Code
#14Video-ChatGPTSOTA
2.4
gpt-score· 2023-06-08
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models Code
#15Video ChatSOTA
2.32
gpt-score· 2023-05-10
VideoChat: Chat-Centric Video Understanding Code
#16BT-Adapter (zero-shot)
2.16
gpt-score· 2023-09-27
BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning Code
#17LLaMA AdapterSOTA
2.03
gpt-score· 2023-04-28
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model Code
#18Video LLaMA
1.96
gpt-score· 2023-06-05
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding Code