TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

SotA/Reasoning/Video Question Answering/MVBench

Video Question Answering on MVBench

Metric: Avg. (higher is better)

LeaderboardDataset
Loading chart...

Results

Submit a result
#Model↕Avg.▼Extra DataPaperDate↕Code
1LinVT-Qwen2-VL (7B)69.3NoLinVT: Empower Your Image-level Large Language M...2024-12-06Code
2Tarsier (34B)67.6NoTarsier: Recipes for Training and Evaluating Lar...2024-06-30Code
3InternVideo267.2NoInternVideo2: Scaling Foundation Models for Mult...2024-03-22Code
4LongVU (7B)66.9NoLongVU: Spatiotemporal Adaptive Compression for ...2024-10-22Code
5Oryx(34B)64.7NoOryx MLLM: On-Demand Spatial-Temporal Understand...2024-09-19Code
6VideoLLaMA2 (72B)62NoVideoLLaMA 2: Advancing Spatial-Temporal Modelin...2024-06-11Code
7VideoChat-T (7B)59.9NoTimeSuite: Improving MLLMs for Long Video Unders...2024-10-25Code
8mPLUG-Owl3(7B)59.5NomPLUG-Owl3: Towards Long Image-Sequence Understa...2024-08-09Code
9PPLLaVA (7b)59.2NoPPLLaVA: Varied Video Sequence Understanding Wit...2024-11-04Code
10VideoGPT+58.7NoVideoGPT+: Integrating Image and Video Encoders ...2024-06-13Code
11PLLaVA58.1NoPLLaVA : Parameter-free LLaVA Extension from Ima...2024-04-25Code
12ST-LLM54.9NoST-LLM: Large Language Models Are Effective Temp...2024-03-30Code
13VideoChat251.9NoMVBench: A Comprehensive Multi-modal Video Under...2023-11-28Code
14HawkEye47.55NoHawkEye: Training Video-Text LLMs for Grounding ...2024-03-15Code
15SPHINX-Plus39.7NoSPHINX-X: Scaling Data and Parameters for a Fami...2024-02-08Code
16TimeChat38.5NoTimeChat: A Time-sensitive Multimodal Large Lang...2023-12-04Code
17LLaVa36NoVisual Instruction Tuning2023-04-17Code
18VideoChat35.5NoVideoChat: Chat-Centric Video Understanding2023-05-10Code
19VideoLLaMA34.1NoVideo-LLaMA: An Instruction-tuned Audio-Visual L...2023-06-05Code
20Video-ChatGPT32.7NoVideo-ChatGPT: Towards Detailed Video Understand...2023-06-08Code
21InstructBLIP32.5NoInstructBLIP: Towards General-purpose Vision-Lan...2023-05-11Code
22MiniGPT418.8NoMiniGPT-4: Enhancing Vision-Language Understandi...2023-04-20Code