TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

SotA/Reasoning/Generative Visual Question Answering/VideoInstruct

Generative Visual Question Answering on VideoInstruct

Metric: gpt-score (higher is better)

LeaderboardDataset
Loading chart...

Results

Submit a result
#Model↕gpt-score▼Extra DataPaperDate↕Code
1PPLLaVA-7B4.21NoPPLLaVA: Varied Video Sequence Understanding Wit...2024-11-04Code
2PLLaVA-34B3.9NoPLLaVA : Parameter-free LLaVA Extension from Ima...2024-04-25Code
3TS-LLaVA-34B3.86NoTS-LLaVA: Constructing Visual Tokens through Thu...2024-11-17Code
4PPLLaVA-7B3.85NoPPLLaVA: Varied Video Sequence Understanding Wit...2024-11-04Code
5SlowFast-LLaVA-34B3.84NoSlowFast-LLaVA: A Strong Training-Free Baseline ...2024-07-22Code
6PPLLaVA-7B3.81NoPPLLaVA: Varied Video Sequence Understanding Wit...2024-11-04Code
7ST-LLM3.74NoST-LLM: Large Language Models Are Effective Temp...2024-03-30Code
8VideoGPT+3.74NoVideoGPT+: Integrating Image and Video Encoders ...2024-06-13Code
9TS-LLaVA-34B3.69NoTS-LLaVA: Constructing Visual Tokens through Thu...2024-11-17Code
10VideoChat2_HD_mistral3.64NoMVBench: A Comprehensive Multi-modal Video Under...2023-11-28Code
11PLLaVA-34B3.6NoPLLaVA : Parameter-free LLaVA Extension from Ima...2024-04-25Code
12MiniGPT4-video-7B3.57NoMiniGPT4-Video: Advancing Multimodal LLMs for Vi...2024-04-04Code
13SlowFast-LLaVA-34B3.57NoSlowFast-LLaVA: A Strong Training-Free Baseline ...2024-07-22Code
14PPLLaVA-7B3.56NoPPLLaVA: Varied Video Sequence Understanding Wit...2024-11-04Code
15TS-LLaVA-34B3.55NoTS-LLaVA: Constructing Visual Tokens through Thu...2024-11-17Code
16VideoChat23.51NoMVBench: A Comprehensive Multi-modal Video Under...2023-11-28Code
17SlowFast-LLaVA-34B3.48NoSlowFast-LLaVA: A Strong Training-Free Baseline ...2024-07-22Code
18Chat-UniVi3.46NoChat-UniVi: Unified Visual Representation Empowe...2023-11-14Code
19VTimeLLM3.4NoVTimeLLM: Empower LLM to Grasp Video Moments2023-11-30Code
20VideoChat2_HD_mistral3.4NoMVBench: A Comprehensive Multi-modal Video Under...2023-11-28Code
21VideoGPT+3.39NoVideoGPT+: Integrating Image and Video Encoders ...2024-06-13Code
22BT-Adapter3.27NoBT-Adapter: Video Conversation is Feasible Witho...2023-09-27Code
23VideoGPT+3.27NoVideoGPT+: Integrating Image and Video Encoders ...2024-06-13Code
24PLLaVA-34B3.25NoPLLaVA : Parameter-free LLaVA Extension from Ima...2024-04-25Code
25ST-LLM3.23NoST-LLM: Large Language Models Are Effective Temp...2024-03-30Code
26PPLLaVA-7B3.21NoPPLLaVA: Varied Video Sequence Understanding Wit...2024-11-04Code
27PLLaVA-34B3.2NoPLLaVA : Parameter-free LLaVA Extension from Ima...2024-04-25Code
28VideoGPT+3.18NoVideoGPT+: Integrating Image and Video Encoders ...2024-06-13Code
29VTimeLLM3.1NoVTimeLLM: Empower LLM to Grasp Video Moments2023-11-30Code
30MiniGPT4-video-7B3.08NoMiniGPT4-Video: Advancing Multimodal LLMs for Vi...2024-04-04Code
31ST-LLM3.05NoST-LLM: Large Language Models Are Effective Temp...2024-03-30Code
32TS-LLaVA-34B3.03NoTS-LLaVA: Constructing Visual Tokens through Thu...2024-11-17Code
33VideoChat23.02NoMVBench: A Comprehensive Multi-modal Video Under...2023-11-28Code
34MiniGPT4-video-7B3.02NoMiniGPT4-Video: Advancing Multimodal LLMs for Vi...2024-04-04Code
35MovieChat3.01NoMovieChat: From Dense Token to Sparse Memory for...2023-07-31Code
36SlowFast-LLaVA-34B2.96NoSlowFast-LLaVA: A Strong Training-Free Baseline ...2024-07-22Code
37MovieChat2.93NoMovieChat: From Dense Token to Sparse Memory for...2023-07-31Code
38ST-LLM2.93NoST-LLM: Large Language Models Are Effective Temp...2024-03-30Code
39Chat-UniVi2.91NoChat-UniVi: Unified Visual Representation Empowe...2023-11-14Code
40BT-Adapter (zero-shot)2.89NoBT-Adapter: Video Conversation is Feasible Witho...2023-09-27Code
41Chat-UniVi2.89NoChat-UniVi: Unified Visual Representation Empowe...2023-11-14Code
42VideoChat22.88NoMVBench: A Comprehensive Multi-modal Video Under...2023-11-28Code
43VideoChat2_HD_mistral2.86NoMVBench: A Comprehensive Multi-modal Video Under...2023-11-28Code
44VideoGPT+2.83NoVideoGPT+: Integrating Image and Video Encoders ...2024-06-13Code
45Chat-UniVi2.81NoChat-UniVi: Unified Visual Representation Empowe...2023-11-14Code
46VideoChat22.81NoMVBench: A Comprehensive Multi-modal Video Under...2023-11-28Code
47ST-LLM2.81NoST-LLM: Large Language Models Are Effective Temp...2024-03-30Code
48VTimeLLM2.78NoVTimeLLM: Empower LLM to Grasp Video Moments2023-11-30Code
49SlowFast-LLaVA-34B2.77NoSlowFast-LLaVA: A Strong Training-Free Baseline ...2024-07-22Code
50TS-LLaVA-34B2.77NoTS-LLaVA: Constructing Visual Tokens through Thu...2024-11-17Code
51MovieChat2.76NoMovieChat: From Dense Token to Sparse Memory for...2023-07-31Code
52BT-Adapter2.69NoBT-Adapter: Video Conversation is Feasible Witho...2023-09-27Code
53BT-Adapter2.68NoBT-Adapter: Video Conversation is Feasible Witho...2023-09-27Code
54PLLaVA-34B2.67NoPLLaVA : Parameter-free LLaVA Extension from Ima...2024-04-25Code
55MiniGPT4-video-7B2.67NoMiniGPT4-Video: Advancing Multimodal LLMs for Vi...2024-04-04Code
56VideoChat22.66NoMVBench: A Comprehensive Multi-modal Video Under...2023-11-28Code
57MiniGPT4-video-7B2.65NoMiniGPT4-Video: Advancing Multimodal LLMs for Vi...2024-04-04Code
58VideoChat2_HD_mistral2.65NoMVBench: A Comprehensive Multi-modal Video Under...2023-11-28Code
59Video-ChatGPT2.62NoVideo-ChatGPT: Towards Detailed Video Understand...2023-06-08Code
60VideoChat2_HD_mistral2.62NoMVBench: A Comprehensive Multi-modal Video Under...2023-11-28Code
61Video Chat2.53NoVideoChat: Chat-Centric Video Understanding2023-05-10Code
62Video-ChatGPT2.52NoVideo-ChatGPT: Towards Detailed Video Understand...2023-06-08Code
63Video Chat2.5NoVideoChat: Chat-Centric Video Understanding2023-05-10Code
64VTimeLLM2.49NoVTimeLLM: Empower LLM to Grasp Video Moments2023-11-30Code
65VTimeLLM2.47NoVTimeLLM: Empower LLM to Grasp Video Moments2023-11-30Code
66BT-Adapter (zero-shot)2.46NoBT-Adapter: Video Conversation is Feasible Witho...2023-09-27Code
67BT-Adapter2.46NoBT-Adapter: Video Conversation is Feasible Witho...2023-09-27Code
68MovieChat2.42NoMovieChat: From Dense Token to Sparse Memory for...2023-07-31Code
69Video-ChatGPT2.4NoVideo-ChatGPT: Towards Detailed Video Understand...2023-06-08Code
70Chat-UniVi2.39NoChat-UniVi: Unified Visual Representation Empowe...2023-11-14Code
71Video-ChatGPT2.37NoVideo-ChatGPT: Towards Detailed Video Understand...2023-06-08Code
72BT-Adapter2.34NoBT-Adapter: Video Conversation is Feasible Witho...2023-09-27Code
73Video Chat2.32NoVideoChat: Chat-Centric Video Understanding2023-05-10Code
74LLaMA Adapter2.32NoLLaMA-Adapter V2: Parameter-Efficient Visual Ins...2023-04-28Code
75LLaMA Adapter2.3NoLLaMA-Adapter V2: Parameter-Efficient Visual Ins...2023-04-28Code
76MovieChat2.24NoMovieChat: From Dense Token to Sparse Memory for...2023-07-31Code
77Video Chat2.24NoVideoChat: Chat-Centric Video Understanding2023-05-10Code
78BT-Adapter (zero-shot)2.2NoBT-Adapter: Video Conversation is Feasible Witho...2023-09-27Code
79Video LLaMA2.18NoVideo-LLaMA: An Instruction-tuned Audio-Visual L...2023-06-05Code
80Video LLaMA2.16NoVideo-LLaMA: An Instruction-tuned Audio-Visual L...2023-06-05Code
81BT-Adapter (zero-shot)2.16NoBT-Adapter: Video Conversation is Feasible Witho...2023-09-27Code
82LLaMA Adapter2.15NoLLaMA-Adapter V2: Parameter-Efficient Visual Ins...2023-04-28Code
83BT-Adapter (zero-shot)2.13NoBT-Adapter: Video Conversation is Feasible Witho...2023-09-27Code
84LLaMA Adapter2.03NoLLaMA-Adapter V2: Parameter-Efficient Visual Ins...2023-04-28Code
85Video-ChatGPT1.98NoVideo-ChatGPT: Towards Detailed Video Understand...2023-06-08Code
86LLaMA Adapter1.98NoLLaMA-Adapter V2: Parameter-Efficient Visual Ins...2023-04-28Code
87Video LLaMA1.96NoVideo-LLaMA: An Instruction-tuned Audio-Visual L...2023-06-05Code
88Video Chat1.94NoVideoChat: Chat-Centric Video Understanding2023-05-10Code
89Video LLaMA1.82NoVideo-LLaMA: An Instruction-tuned Audio-Visual L...2023-06-05Code
90Video LLaMA1.79NoVideo-LLaMA: An Instruction-tuned Audio-Visual L...2023-06-05Code