Tasks
SotA
Datasets
Papers
Methods
Submit
About
SotA
/
Reasoning
/
Generative Visual Question Answering
/
VideoInstruct
Generative Visual Question Answering on VideoInstruct
Metric: gpt-score (higher is better)
Leaderboard
Dataset
Loading chart...
Results
Submit a result
Export CSV
Sort:
gpt-score (best first)
gpt-score (worst first)
Date (newest first)
Date (oldest first)
Model name (A→Z)
#
Model
↕
gpt-score
▼
Extra Data
Paper
Date
↕
Code
1
PPLLaVA-7B
4.21
No
PPLLaVA: Varied Video Sequence Understanding Wit...
2024-11-04
Code
2
PLLaVA-34B
3.9
No
PLLaVA : Parameter-free LLaVA Extension from Ima...
2024-04-25
Code
3
TS-LLaVA-34B
3.86
No
TS-LLaVA: Constructing Visual Tokens through Thu...
2024-11-17
Code
4
PPLLaVA-7B
3.85
No
PPLLaVA: Varied Video Sequence Understanding Wit...
2024-11-04
Code
5
SlowFast-LLaVA-34B
3.84
No
SlowFast-LLaVA: A Strong Training-Free Baseline ...
2024-07-22
Code
6
PPLLaVA-7B
3.81
No
PPLLaVA: Varied Video Sequence Understanding Wit...
2024-11-04
Code
7
ST-LLM
3.74
No
ST-LLM: Large Language Models Are Effective Temp...
2024-03-30
Code
8
VideoGPT+
3.74
No
VideoGPT+: Integrating Image and Video Encoders ...
2024-06-13
Code
9
TS-LLaVA-34B
3.69
No
TS-LLaVA: Constructing Visual Tokens through Thu...
2024-11-17
Code
10
VideoChat2_HD_mistral
3.64
No
MVBench: A Comprehensive Multi-modal Video Under...
2023-11-28
Code
11
PLLaVA-34B
3.6
No
PLLaVA : Parameter-free LLaVA Extension from Ima...
2024-04-25
Code
12
MiniGPT4-video-7B
3.57
No
MiniGPT4-Video: Advancing Multimodal LLMs for Vi...
2024-04-04
Code
13
SlowFast-LLaVA-34B
3.57
No
SlowFast-LLaVA: A Strong Training-Free Baseline ...
2024-07-22
Code
14
PPLLaVA-7B
3.56
No
PPLLaVA: Varied Video Sequence Understanding Wit...
2024-11-04
Code
15
TS-LLaVA-34B
3.55
No
TS-LLaVA: Constructing Visual Tokens through Thu...
2024-11-17
Code
16
VideoChat2
3.51
No
MVBench: A Comprehensive Multi-modal Video Under...
2023-11-28
Code
17
SlowFast-LLaVA-34B
3.48
No
SlowFast-LLaVA: A Strong Training-Free Baseline ...
2024-07-22
Code
18
Chat-UniVi
3.46
No
Chat-UniVi: Unified Visual Representation Empowe...
2023-11-14
Code
19
VTimeLLM
3.4
No
VTimeLLM: Empower LLM to Grasp Video Moments
2023-11-30
Code
20
VideoChat2_HD_mistral
3.4
No
MVBench: A Comprehensive Multi-modal Video Under...
2023-11-28
Code
21
VideoGPT+
3.39
No
VideoGPT+: Integrating Image and Video Encoders ...
2024-06-13
Code
22
BT-Adapter
3.27
No
BT-Adapter: Video Conversation is Feasible Witho...
2023-09-27
Code
23
VideoGPT+
3.27
No
VideoGPT+: Integrating Image and Video Encoders ...
2024-06-13
Code
24
PLLaVA-34B
3.25
No
PLLaVA : Parameter-free LLaVA Extension from Ima...
2024-04-25
Code
25
ST-LLM
3.23
No
ST-LLM: Large Language Models Are Effective Temp...
2024-03-30
Code
26
PPLLaVA-7B
3.21
No
PPLLaVA: Varied Video Sequence Understanding Wit...
2024-11-04
Code
27
PLLaVA-34B
3.2
No
PLLaVA : Parameter-free LLaVA Extension from Ima...
2024-04-25
Code
28
VideoGPT+
3.18
No
VideoGPT+: Integrating Image and Video Encoders ...
2024-06-13
Code
29
VTimeLLM
3.1
No
VTimeLLM: Empower LLM to Grasp Video Moments
2023-11-30
Code
30
MiniGPT4-video-7B
3.08
No
MiniGPT4-Video: Advancing Multimodal LLMs for Vi...
2024-04-04
Code
31
ST-LLM
3.05
No
ST-LLM: Large Language Models Are Effective Temp...
2024-03-30
Code
32
TS-LLaVA-34B
3.03
No
TS-LLaVA: Constructing Visual Tokens through Thu...
2024-11-17
Code
33
VideoChat2
3.02
No
MVBench: A Comprehensive Multi-modal Video Under...
2023-11-28
Code
34
MiniGPT4-video-7B
3.02
No
MiniGPT4-Video: Advancing Multimodal LLMs for Vi...
2024-04-04
Code
35
MovieChat
3.01
No
MovieChat: From Dense Token to Sparse Memory for...
2023-07-31
Code
36
SlowFast-LLaVA-34B
2.96
No
SlowFast-LLaVA: A Strong Training-Free Baseline ...
2024-07-22
Code
37
MovieChat
2.93
No
MovieChat: From Dense Token to Sparse Memory for...
2023-07-31
Code
38
ST-LLM
2.93
No
ST-LLM: Large Language Models Are Effective Temp...
2024-03-30
Code
39
Chat-UniVi
2.91
No
Chat-UniVi: Unified Visual Representation Empowe...
2023-11-14
Code
40
BT-Adapter (zero-shot)
2.89
No
BT-Adapter: Video Conversation is Feasible Witho...
2023-09-27
Code
41
Chat-UniVi
2.89
No
Chat-UniVi: Unified Visual Representation Empowe...
2023-11-14
Code
42
VideoChat2
2.88
No
MVBench: A Comprehensive Multi-modal Video Under...
2023-11-28
Code
43
VideoChat2_HD_mistral
2.86
No
MVBench: A Comprehensive Multi-modal Video Under...
2023-11-28
Code
44
VideoGPT+
2.83
No
VideoGPT+: Integrating Image and Video Encoders ...
2024-06-13
Code
45
Chat-UniVi
2.81
No
Chat-UniVi: Unified Visual Representation Empowe...
2023-11-14
Code
46
VideoChat2
2.81
No
MVBench: A Comprehensive Multi-modal Video Under...
2023-11-28
Code
47
ST-LLM
2.81
No
ST-LLM: Large Language Models Are Effective Temp...
2024-03-30
Code
48
VTimeLLM
2.78
No
VTimeLLM: Empower LLM to Grasp Video Moments
2023-11-30
Code
49
SlowFast-LLaVA-34B
2.77
No
SlowFast-LLaVA: A Strong Training-Free Baseline ...
2024-07-22
Code
50
TS-LLaVA-34B
2.77
No
TS-LLaVA: Constructing Visual Tokens through Thu...
2024-11-17
Code
51
MovieChat
2.76
No
MovieChat: From Dense Token to Sparse Memory for...
2023-07-31
Code
52
BT-Adapter
2.69
No
BT-Adapter: Video Conversation is Feasible Witho...
2023-09-27
Code
53
BT-Adapter
2.68
No
BT-Adapter: Video Conversation is Feasible Witho...
2023-09-27
Code
54
PLLaVA-34B
2.67
No
PLLaVA : Parameter-free LLaVA Extension from Ima...
2024-04-25
Code
55
MiniGPT4-video-7B
2.67
No
MiniGPT4-Video: Advancing Multimodal LLMs for Vi...
2024-04-04
Code
56
VideoChat2
2.66
No
MVBench: A Comprehensive Multi-modal Video Under...
2023-11-28
Code
57
MiniGPT4-video-7B
2.65
No
MiniGPT4-Video: Advancing Multimodal LLMs for Vi...
2024-04-04
Code
58
VideoChat2_HD_mistral
2.65
No
MVBench: A Comprehensive Multi-modal Video Under...
2023-11-28
Code
59
Video-ChatGPT
2.62
No
Video-ChatGPT: Towards Detailed Video Understand...
2023-06-08
Code
60
VideoChat2_HD_mistral
2.62
No
MVBench: A Comprehensive Multi-modal Video Under...
2023-11-28
Code
61
Video Chat
2.53
No
VideoChat: Chat-Centric Video Understanding
2023-05-10
Code
62
Video-ChatGPT
2.52
No
Video-ChatGPT: Towards Detailed Video Understand...
2023-06-08
Code
63
Video Chat
2.5
No
VideoChat: Chat-Centric Video Understanding
2023-05-10
Code
64
VTimeLLM
2.49
No
VTimeLLM: Empower LLM to Grasp Video Moments
2023-11-30
Code
65
VTimeLLM
2.47
No
VTimeLLM: Empower LLM to Grasp Video Moments
2023-11-30
Code
66
BT-Adapter (zero-shot)
2.46
No
BT-Adapter: Video Conversation is Feasible Witho...
2023-09-27
Code
67
BT-Adapter
2.46
No
BT-Adapter: Video Conversation is Feasible Witho...
2023-09-27
Code
68
MovieChat
2.42
No
MovieChat: From Dense Token to Sparse Memory for...
2023-07-31
Code
69
Video-ChatGPT
2.4
No
Video-ChatGPT: Towards Detailed Video Understand...
2023-06-08
Code
70
Chat-UniVi
2.39
No
Chat-UniVi: Unified Visual Representation Empowe...
2023-11-14
Code
71
Video-ChatGPT
2.37
No
Video-ChatGPT: Towards Detailed Video Understand...
2023-06-08
Code
72
BT-Adapter
2.34
No
BT-Adapter: Video Conversation is Feasible Witho...
2023-09-27
Code
73
Video Chat
2.32
No
VideoChat: Chat-Centric Video Understanding
2023-05-10
Code
74
LLaMA Adapter
2.32
No
LLaMA-Adapter V2: Parameter-Efficient Visual Ins...
2023-04-28
Code
75
LLaMA Adapter
2.3
No
LLaMA-Adapter V2: Parameter-Efficient Visual Ins...
2023-04-28
Code
76
MovieChat
2.24
No
MovieChat: From Dense Token to Sparse Memory for...
2023-07-31
Code
77
Video Chat
2.24
No
VideoChat: Chat-Centric Video Understanding
2023-05-10
Code
78
BT-Adapter (zero-shot)
2.2
No
BT-Adapter: Video Conversation is Feasible Witho...
2023-09-27
Code
79
Video LLaMA
2.18
No
Video-LLaMA: An Instruction-tuned Audio-Visual L...
2023-06-05
Code
80
Video LLaMA
2.16
No
Video-LLaMA: An Instruction-tuned Audio-Visual L...
2023-06-05
Code
81
BT-Adapter (zero-shot)
2.16
No
BT-Adapter: Video Conversation is Feasible Witho...
2023-09-27
Code
82
LLaMA Adapter
2.15
No
LLaMA-Adapter V2: Parameter-Efficient Visual Ins...
2023-04-28
Code
83
BT-Adapter (zero-shot)
2.13
No
BT-Adapter: Video Conversation is Feasible Witho...
2023-09-27
Code
84
LLaMA Adapter
2.03
No
LLaMA-Adapter V2: Parameter-Efficient Visual Ins...
2023-04-28
Code
85
Video-ChatGPT
1.98
No
Video-ChatGPT: Towards Detailed Video Understand...
2023-06-08
Code
86
LLaMA Adapter
1.98
No
LLaMA-Adapter V2: Parameter-Efficient Visual Ins...
2023-04-28
Code
87
Video LLaMA
1.96
No
Video-LLaMA: An Instruction-tuned Audio-Visual L...
2023-06-05
Code
88
Video Chat
1.94
No
VideoChat: Chat-Centric Video Understanding
2023-05-10
Code
89
Video LLaMA
1.82
No
Video-LLaMA: An Instruction-tuned Audio-Visual L...
2023-06-05
Code
90
Video LLaMA
1.79
No
Video-LLaMA: An Instruction-tuned Audio-Visual L...
2023-06-05
Code
#1
PPLLaVA-7B
SOTA
4.21
gpt-score
· 2024-11-04
PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance
Code
#2
PLLaVA-34B
SOTA
3.9
gpt-score
· 2024-04-25
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning
Code
#3
TS-LLaVA-34B
3.86
gpt-score
· 2024-11-17
TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models
Code
#4
PPLLaVA-7B
3.85
gpt-score
· 2024-11-04
PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance
Code
#5
SlowFast-LLaVA-34B
3.84
gpt-score
· 2024-07-22
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models
Code
#6
PPLLaVA-7B
3.81
gpt-score
· 2024-11-04
PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance
Code
#7
ST-LLM
SOTA
3.74
gpt-score
· 2024-03-30
ST-LLM: Large Language Models Are Effective Temporal Learners
Code
#8
VideoGPT+
3.74
gpt-score
· 2024-06-13
VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding
Code
#9
TS-LLaVA-34B
3.69
gpt-score
· 2024-11-17
TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models
Code
#10
VideoChat2_HD_mistral
SOTA
3.64
gpt-score
· 2023-11-28
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
Code
#11
PLLaVA-34B
3.6
gpt-score
· 2024-04-25
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning
Code
#12
MiniGPT4-video-7B
3.57
gpt-score
· 2024-04-04
MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens
Code
#13
SlowFast-LLaVA-34B
3.57
gpt-score
· 2024-07-22
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models
Code
#14
PPLLaVA-7B
3.56
gpt-score
· 2024-11-04
PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance
Code
#15
TS-LLaVA-34B
3.55
gpt-score
· 2024-11-17
TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models
Code
#16
VideoChat2
3.51
gpt-score
· 2023-11-28
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
Code
#17
SlowFast-LLaVA-34B
3.48
gpt-score
· 2024-07-22
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models
Code
#18
Chat-UniVi
SOTA
3.46
gpt-score
· 2023-11-14
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
Code
#19
VTimeLLM
3.4
gpt-score
· 2023-11-30
VTimeLLM: Empower LLM to Grasp Video Moments
Code
#20
VideoChat2_HD_mistral
3.4
gpt-score
· 2023-11-28
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
Code
#21
VideoGPT+
3.39
gpt-score
· 2024-06-13
VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding
Code
#22
BT-Adapter
SOTA
3.27
gpt-score
· 2023-09-27
BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning
Code
#23
VideoGPT+
3.27
gpt-score
· 2024-06-13
VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding
Code
#24
PLLaVA-34B
3.25
gpt-score
· 2024-04-25
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning
Code
#25
ST-LLM
3.23
gpt-score
· 2024-03-30
ST-LLM: Large Language Models Are Effective Temporal Learners
Code
#26
PPLLaVA-7B
3.21
gpt-score
· 2024-11-04
PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance
Code
#27
PLLaVA-34B
3.2
gpt-score
· 2024-04-25
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning
Code
#28
VideoGPT+
3.18
gpt-score
· 2024-06-13
VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding
Code
#29
VTimeLLM
3.1
gpt-score
· 2023-11-30
VTimeLLM: Empower LLM to Grasp Video Moments
Code
#30
MiniGPT4-video-7B
3.08
gpt-score
· 2024-04-04
MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens
Code
#31
ST-LLM
3.05
gpt-score
· 2024-03-30
ST-LLM: Large Language Models Are Effective Temporal Learners
Code
#32
TS-LLaVA-34B
3.03
gpt-score
· 2024-11-17
TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models
Code
#33
VideoChat2
3.02
gpt-score
· 2023-11-28
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
Code
#34
MiniGPT4-video-7B
3.02
gpt-score
· 2024-04-04
MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens
Code
#35
MovieChat
SOTA
3.01
gpt-score
· 2023-07-31
MovieChat: From Dense Token to Sparse Memory for Long Video Understanding
Code
#36
SlowFast-LLaVA-34B
2.96
gpt-score
· 2024-07-22
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models
Code
#37
MovieChat
2.93
gpt-score
· 2023-07-31
MovieChat: From Dense Token to Sparse Memory for Long Video Understanding
Code
#38
ST-LLM
2.93
gpt-score
· 2024-03-30
ST-LLM: Large Language Models Are Effective Temporal Learners
Code
#39
Chat-UniVi
2.91
gpt-score
· 2023-11-14
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
Code
#40
BT-Adapter (zero-shot)
2.89
gpt-score
· 2023-09-27
BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning
Code
#41
Chat-UniVi
2.89
gpt-score
· 2023-11-14
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
Code
#42
VideoChat2
2.88
gpt-score
· 2023-11-28
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
Code
#43
VideoChat2_HD_mistral
2.86
gpt-score
· 2023-11-28
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
Code
#44
VideoGPT+
2.83
gpt-score
· 2024-06-13
VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding
Code
#45
Chat-UniVi
2.81
gpt-score
· 2023-11-14
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
Code
#46
VideoChat2
2.81
gpt-score
· 2023-11-28
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
Code
#47
ST-LLM
2.81
gpt-score
· 2024-03-30
ST-LLM: Large Language Models Are Effective Temporal Learners
Code
#48
VTimeLLM
2.78
gpt-score
· 2023-11-30
VTimeLLM: Empower LLM to Grasp Video Moments
Code
#49
SlowFast-LLaVA-34B
2.77
gpt-score
· 2024-07-22
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models
Code
#50
TS-LLaVA-34B
2.77
gpt-score
· 2024-11-17
TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models
Code
#51
MovieChat
2.76
gpt-score
· 2023-07-31
MovieChat: From Dense Token to Sparse Memory for Long Video Understanding
Code
#52
BT-Adapter
2.69
gpt-score
· 2023-09-27
BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning
Code
#53
BT-Adapter
2.68
gpt-score
· 2023-09-27
BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning
Code
#54
PLLaVA-34B
2.67
gpt-score
· 2024-04-25
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning
Code
#55
MiniGPT4-video-7B
2.67
gpt-score
· 2024-04-04
MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens
Code
#56
VideoChat2
2.66
gpt-score
· 2023-11-28
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
Code
#57
MiniGPT4-video-7B
2.65
gpt-score
· 2024-04-04
MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens
Code
#58
VideoChat2_HD_mistral
2.65
gpt-score
· 2023-11-28
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
Code
#59
Video-ChatGPT
SOTA
2.62
gpt-score
· 2023-06-08
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
Code
#60
VideoChat2_HD_mistral
2.62
gpt-score
· 2023-11-28
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
Code
#61
Video Chat
SOTA
2.53
gpt-score
· 2023-05-10
VideoChat: Chat-Centric Video Understanding
Code
#62
Video-ChatGPT
2.52
gpt-score
· 2023-06-08
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
Code
#63
Video Chat
2.5
gpt-score
· 2023-05-10
VideoChat: Chat-Centric Video Understanding
Code
#64
VTimeLLM
2.49
gpt-score
· 2023-11-30
VTimeLLM: Empower LLM to Grasp Video Moments
Code
#65
VTimeLLM
2.47
gpt-score
· 2023-11-30
VTimeLLM: Empower LLM to Grasp Video Moments
Code
#66
BT-Adapter (zero-shot)
2.46
gpt-score
· 2023-09-27
BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning
Code
#67
BT-Adapter
2.46
gpt-score
· 2023-09-27
BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning
Code
#68
MovieChat
2.42
gpt-score
· 2023-07-31
MovieChat: From Dense Token to Sparse Memory for Long Video Understanding
Code
#69
Video-ChatGPT
2.4
gpt-score
· 2023-06-08
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
Code
#70
Chat-UniVi
2.39
gpt-score
· 2023-11-14
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
Code
#71
Video-ChatGPT
2.37
gpt-score
· 2023-06-08
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
Code
#72
BT-Adapter
2.34
gpt-score
· 2023-09-27
BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning
Code
#73
Video Chat
2.32
gpt-score
· 2023-05-10
VideoChat: Chat-Centric Video Understanding
Code
#74
LLaMA Adapter
SOTA
2.32
gpt-score
· 2023-04-28
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
Code
#75
LLaMA Adapter
2.3
gpt-score
· 2023-04-28
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
Code
#76
MovieChat
2.24
gpt-score
· 2023-07-31
MovieChat: From Dense Token to Sparse Memory for Long Video Understanding
Code
#77
Video Chat
2.24
gpt-score
· 2023-05-10
VideoChat: Chat-Centric Video Understanding
Code
#78
BT-Adapter (zero-shot)
2.2
gpt-score
· 2023-09-27
BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning
Code
#79
Video LLaMA
2.18
gpt-score
· 2023-06-05
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
Code
#80
Video LLaMA
2.16
gpt-score
· 2023-06-05
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
Code
#81
BT-Adapter (zero-shot)
2.16
gpt-score
· 2023-09-27
BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning
Code
#82
LLaMA Adapter
2.15
gpt-score
· 2023-04-28
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
Code
#83
BT-Adapter (zero-shot)
2.13
gpt-score
· 2023-09-27
BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning
Code
#84
LLaMA Adapter
2.03
gpt-score
· 2023-04-28
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
Code
#85
Video-ChatGPT
1.98
gpt-score
· 2023-06-08
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
Code
#86
LLaMA Adapter
1.98
gpt-score
· 2023-04-28
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
Code
#87
Video LLaMA
1.96
gpt-score
· 2023-06-05
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
Code
#88
Video Chat
1.94
gpt-score
· 2023-05-10
VideoChat: Chat-Centric Video Understanding
Code
#89
Video LLaMA
1.82
gpt-score
· 2023-06-05
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
Code
#90
Video LLaMA
1.79
gpt-score
· 2023-06-05
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
Code