TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

SotA/Reasoning/Generative Visual Question Answering/VideoInstruct

Generative Visual Question Answering on VideoInstruct

Metric: mean (higher is better)

LeaderboardDataset
Loading chart...

Results

Submit a result
#Model↕mean▼Extra DataPaperDate↕Code
1PPLLaVA-7B-dpo3.73NoPPLLaVA: Varied Video Sequence Understanding Wit...2024-11-04Code
2VLM-RLAIF3.49NoTuning Large Multimodal Models for Videos using ...2024-02-06Code
3TS-LLaVA-34B3.38NoTS-LLaVA: Constructing Visual Tokens through Thu...2024-11-17Code
4PLLaVA-34B3.32NoPLLaVA : Parameter-free LLaVA Extension from Ima...2024-04-25Code
5PPLLaVA-7B3.32NoPPLLaVA: Varied Video Sequence Understanding Wit...2024-11-04Code
6SlowFast-LLaVA-34B3.32NoSlowFast-LLaVA: A Strong Training-Free Baseline ...2024-07-22Code
7VideoGPT+3.28NoVideoGPT+: Integrating Image and Video Encoders ...2024-06-13Code
8IG-VLM-GPT4v3.17NoAn Image Grid Can Be Worth a Video: Zero-shot Vi...2024-03-27Code
9ST-LLM-7B3.15NoST-LLM: Large Language Models Are Effective Temp...2024-03-30Code
10VideoChat2_HD_mistral3.1NoMVBench: A Comprehensive Multi-modal Video Under...2023-11-28Code
11CAT-7B3.07NoCAT: Enhancing Multimodal Large Language Model t...2024-03-07Code
12LITA-13B3.04NoLITA: Language Instructed Temporal-Localization ...2024-03-27Code
13LLaMA-VID-13B (2 Token)2.99NoLLaMA-VID: An Image is Worth 2 Tokens in Large L...2023-11-28Code
14Chat-UniVi2.99NoChat-UniVi: Unified Visual Representation Empowe...2023-11-14Code
15VideoChat22.98NoMVBench: A Comprehensive Multi-modal Video Under...2023-11-28Code
16LLaMA-VID-7B (2 Token)2.89NoLLaMA-VID: An Image is Worth 2 Tokens in Large L...2023-11-28Code
17VTimeLLM2.85NoVTimeLLM: Empower LLM to Grasp Video Moments2023-11-30Code
18BT-Adapter2.69NoBT-Adapter: Video Conversation is Feasible Witho...2023-09-27Code
19BT-Adapter (zero-shot)2.46NoBT-Adapter: Video Conversation is Feasible Witho...2023-09-27Code
20Video-ChatGPT2.38NoVideo-ChatGPT: Towards Detailed Video Understand...2023-06-08Code
21Video Chat2.29NoVideoChat: Chat-Centric Video Understanding2023-05-10Code
22LLaMA Adapter2.16NoLLaMA-Adapter V2: Parameter-Efficient Visual Ins...2023-04-28Code
23Video LLaMA1.98NoVideo-LLaMA: An Instruction-tuned Audio-Visual L...2023-06-05Code