Tasks
SotA
Datasets
Papers
Methods
Submit
About
SotA
/
Reasoning
/
Generative Visual Question Answering
/
VideoInstruct
Generative Visual Question Answering on VideoInstruct
Metric: Contextual Understanding (higher is better)
Leaderboard
Dataset
Loading chart...
Results
Submit a result
Export CSV
Sort:
Contextual Understanding (best first)
Contextual Understanding (worst first)
Date (newest first)
Date (oldest first)
Model name (A→Z)
#
Model
↕
Contextual Understanding
▼
Extra Data
Paper
Date
↕
Code
1
PPLLaVA-7B-dpo
4.21
No
PPLLaVA: Varied Video Sequence Understanding Wit...
2024-11-04
Code
2
VLM-RLAIF
4
No
Tuning Large Multimodal Models for Videos using ...
2024-02-06
Code
3
PLLaVA-34B
3.9
No
PLLaVA : Parameter-free LLaVA Extension from Ima...
2024-04-25
Code
4
PPLLaVA-7B
3.88
No
PPLLaVA: Varied Video Sequence Understanding Wit...
2024-11-04
Code
5
VideoGPT+
3.74
No
VideoGPT+: Integrating Image and Video Encoders ...
2024-06-13
Code
6
ST-LLM-7B
3.74
No
ST-LLM: Large Language Models Are Effective Temp...
2024-03-30
Code
7
VideoChat2_HD_mistral
3.72
No
MVBench: A Comprehensive Multi-modal Video Under...
2023-11-28
Code
8
IG-VLM-GPT4v
3.61
No
An Image Grid Can Be Worth a Video: Zero-shot Vi...
2024-03-27
Code
9
LLaMA-VID-13B (2 Token)
3.6
No
LLaMA-VID: An Image is Worth 2 Tokens in Large L...
2023-11-28
Code
10
LLaMA-VID-7B (2 Token)
3.53
No
LLaMA-VID: An Image is Worth 2 Tokens in Large L...
2023-11-28
Code
11
VideoChat2
3.51
No
MVBench: A Comprehensive Multi-modal Video Under...
2023-11-28
Code
12
CAT-7B
3.49
No
CAT: Enhancing Multimodal Large Language Model t...
2024-03-07
Code
13
Chat-UniVi
3.46
No
Chat-UniVi: Unified Visual Representation Empowe...
2023-11-14
Code
14
LITA-13B
3.43
No
LITA: Language Instructed Temporal-Localization ...
2024-03-27
Code
15
VTimeLLM
3.4
No
VTimeLLM: Empower LLM to Grasp Video Moments
2023-11-30
Code
16
BT-Adapter
3.27
No
BT-Adapter: Video Conversation is Feasible Witho...
2023-09-27
Code
17
BT-Adapter (zero-shot)
2.89
No
BT-Adapter: Video Conversation is Feasible Witho...
2023-09-27
Code
18
Video-ChatGPT
2.62
No
Video-ChatGPT: Towards Detailed Video Understand...
2023-06-08
Code
19
Video Chat
2.53
No
VideoChat: Chat-Centric Video Understanding
2023-05-10
Code
20
LLaMA Adapter
2.3
No
LLaMA-Adapter V2: Parameter-Efficient Visual Ins...
2023-04-28
Code
21
Video LLaMA
2.16
No
Video-LLaMA: An Instruction-tuned Audio-Visual L...
2023-06-05
Code
#1
PPLLaVA-7B-dpo
SOTA
4.21
Contextual Understanding
· 2024-11-04
PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance
Code
#2
VLM-RLAIF
SOTA
4
Contextual Understanding
· 2024-02-06
Tuning Large Multimodal Models for Videos using Reinforcement Learning from AI Feedback
Code
#3
PLLaVA-34B
3.9
Contextual Understanding
· 2024-04-25
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning
Code
#4
PPLLaVA-7B
3.88
Contextual Understanding
· 2024-11-04
PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance
Code
#5
VideoGPT+
3.74
Contextual Understanding
· 2024-06-13
VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding
Code
#6
ST-LLM-7B
3.74
Contextual Understanding
· 2024-03-30
ST-LLM: Large Language Models Are Effective Temporal Learners
Code
#7
VideoChat2_HD_mistral
SOTA
3.72
Contextual Understanding
· 2023-11-28
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
Code
#8
IG-VLM-GPT4v
3.61
Contextual Understanding
· 2024-03-27
An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM
Code
#9
LLaMA-VID-13B (2 Token)
3.6
Contextual Understanding
· 2023-11-28
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
Code
#10
LLaMA-VID-7B (2 Token)
3.53
Contextual Understanding
· 2023-11-28
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
Code
#11
VideoChat2
3.51
Contextual Understanding
· 2023-11-28
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
Code
#12
CAT-7B
3.49
Contextual Understanding
· 2024-03-07
CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios
Code
#13
Chat-UniVi
SOTA
3.46
Contextual Understanding
· 2023-11-14
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
Code
#14
LITA-13B
3.43
Contextual Understanding
· 2024-03-27
LITA: Language Instructed Temporal-Localization Assistant
Code
#15
VTimeLLM
3.4
Contextual Understanding
· 2023-11-30
VTimeLLM: Empower LLM to Grasp Video Moments
Code
#16
BT-Adapter
SOTA
3.27
Contextual Understanding
· 2023-09-27
BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning
Code
#17
BT-Adapter (zero-shot)
2.89
Contextual Understanding
· 2023-09-27
BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning
Code
#18
Video-ChatGPT
SOTA
2.62
Contextual Understanding
· 2023-06-08
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
Code
#19
Video Chat
SOTA
2.53
Contextual Understanding
· 2023-05-10
VideoChat: Chat-Centric Video Understanding
Code
#20
LLaMA Adapter
SOTA
2.3
Contextual Understanding
· 2023-04-28
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
Code
#21
Video LLaMA
2.16
Contextual Understanding
· 2023-06-05
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
Code