TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

SotA/Reasoning/Video Question Answering/NExT-QA

Video Question Answering on NExT-QA

Metric: Accuracy (higher is better)

LeaderboardDataset
Loading chart...

Results

Submit a result
#Model↕Accuracy▼Extra DataPaperDate↕Code
1LinVT-Qwen2-VL (7B)85.5NoLinVT: Empower Your Image-level Large Language M...2024-12-06Code
2InternVL-2.5(8B)85.5NoExpanding Performance Boundaries of Open-Source ...2024-12-06Code
3VideoLLaMA3(7B)84.5NoVideoLLaMA 3: Frontier Multimodal Foundation Mod...2025-01-22Code
4PLM-8B84.1NoPerceptionLM: Open-Access Data and Models for De...2025-04-17Code
5BIMBA-LLaVA-Qwen2-7B83.73NoBIMBA: Selective-Scan Compression for Long-Range...2025-03-12Code
6PLM-3B83.4NoPerceptionLM: Open-Access Data and Models for De...2025-04-17Code
7LLaVA-Video83.2NoVideo Instruction Tuning With Synthetic Data2024-10-03-
8NVILA(8B)82.2NoNVILA: Efficient Frontier Visual Language Models2024-12-05Code
9Oryx-1.5(7B)81.8NoOryx MLLM: On-Demand Spatial-Temporal Understand...2024-09-19Code
10Qwen2-VL(7B)81.2NoQwen2-VL: Enhancing Vision-Language Model's Perc...2024-09-18Code
11LongVILA(7B)80.7NoLongVILA: Scaling Long-Context Visual Language M...2024-08-19Code
12PLM-1B80.3NoPerceptionLM: Open-Access Data and Models for De...2025-04-17Code
13LLaVA-OV(72B)80.2NoLLaVA-OneVision: Easy Visual Task Transfer2024-08-06Code
14VideoMultiAgent (GPT-4o)79.6NoVideoMultiAgents: A Multi-Agent Framework for Vi...2025-04-25Code
15VideoChat2_HD_mistral79.5NoMVBench: A Comprehensive Multi-modal Video Under...2023-11-28Code
16LLaVA-OV(7B)79.4NoLLaVA-OneVision: Easy Visual Task Transfer2024-08-06Code
17Tarsier (34B)79.2NoTarsier: Recipes for Training and Evaluating Lar...2024-06-30Code
18LLaVA-NeXT-Interleave(14B)79.1NoLLaVA-NeXT-Interleave: Tackling Multi-image, Vid...2024-07-10Code
19VideoChat2_mistral78.6NoMVBench: A Comprehensive Multi-modal Video Under...2023-11-28Code
20mPLUG-Owl3(8B)78.6NomPLUG-Owl3: Towards Long Image-Sequence Understa...2024-08-09Code
21LLaVA-NeXT-Interleave(7B)78.2NoLLaVA-NeXT-Interleave: Tackling Multi-image, Vid...2024-07-10Code
22AKEYS78.1NoAgentic Keyframe Search for Video Question Answe...2025-03-20Code
23LLaVA-NeXT-Interleave(DPO)77.9NoLLaVA-NeXT-Interleave: Tackling Multi-image, Vid...2024-07-10Code
24Vamos77.3NoVamos: Versatile Action Models for Video Underst...2023-11-22Code
25ViLA (3B)75.6NoViLA: Efficient Video-Language Alignment for Vid...2023-12-13Code
26VideoLLaMA2.1(7B)75.6NoVideoLLaMA 2: Advancing Spatial-Temporal Modelin...2024-06-11Code
27LLaMA-VQA (33B)75.5NoLarge Language Models are Temporal and Causal Re...2023-10-24Code
28ENTER75.1NoENTER: Event Based Interpretable Reasoning for V...2025-01-24-
29ViLA (3B, 4 frames)74.4NoViLA: Efficient Video-Language Alignment for Vid...2023-12-13Code
30CREMA73.9NoCREMA: Generalizable and Efficient Video-Languag...2024-02-08Code
31SeViLA73.8NoSelf-Chained Image-Language Model for Video Loca...2023-05-11Code
32TS-LLaVA-34B73.6NoTS-LLaVA: Constructing Visual Tokens through Thu...2024-11-17Code
33TCR73.5NoText-Conditioned Resampler For Long Form Video U...2023-12-19-
34VideoTree (GPT4)73.5NoVideoTree: Adaptive Tree-based Video Representat...2024-05-29Code
35LVNet(GPT-4o)72.9NoToo Many Frames, Not All Useful: Efficient Strat...2024-06-13Code
36LSTP72.1NoEfficient Temporal Extrapolation of Multimodal L...2024-02-25Code
37Mirasol3B72NoMirasol3B: A Multimodal Autoregressive model for...2023-11-09-
38VideoAgent (GPT-4)71.3NoVideoAgent: Long-form Video Understanding with L...2024-03-15Code
39IG-VLM(LLaVA v1.6)70.9NoAn Image Grid Can Be Worth a Video: Zero-shot Vi...2024-03-27Code
40VidCtx (7B)70.7NoVidCtx: Context-aware Video Question Answering w...2024-12-23Code
41MoReVQA(PaLM-2)69.2NoMoReVQA: Exploring Modular Reasoning Models for ...2024-04-09-
42VideoChat268.6NoMVBench: A Comprehensive Multi-modal Video Under...2023-11-28Code
43IG-VLM (GPT-4)68.6NoAn Image Grid Can Be Worth a Video: Zero-shot Vi...2024-03-27Code
44TraveLER (GPT-4)68.2NoTraveLER: A Modular Multi-LMM Agent Framework fo...2024-04-01Code
45LLoVi (GPT-4)67.7NoA Simple LLM Framework for Long-Range Video Ques...2023-12-28Code
46LongVA(32 frames)67.1NoLong Context Transfer from Language to Vision2024-06-24Code
47Q-ViD66.3NoQuestion-Instructed Visual Descriptions for Zero...2024-02-16Code
48ProViQ64.6NoZero-Shot Video Question Answering with Procedur...2023-12-01-
49SlowFast-LLaVA-34B64.2NoSlowFast-LLaVA: A Strong Training-Free Baseline ...2024-07-22Code
50Sevila (4B)63.6NoSelf-Chained Image-Language Model for Video Loca...2023-05-11Code
51RTQ63.2NoRTQ: Rethinking Video-language Understanding Bas...2023-12-01Code
52HiTeA63.1YesHiTeA: Hierarchical Temporal-Aware Video-Languag...2022-12-30-
53VideoChat261.7NoMVBench: A Comprehensive Multi-modal Video Under...2023-11-28Code
54DeepStack-L(7B)61NoDeepStack: Deeply Stacking Visual Tokens is Surp...2024-06-06-
55LangRepo (12B)60.9NoLanguage Repository for Long Video Understanding2024-03-21Code
56CoVGT(PT)60.7YesContrastive Video Question Answering via Video G...2023-02-27Code
57SeViT60.6NoSemi-Parametric Video-Grounded Text Generation2023-01-27-
58ViperGPT(0-shot)60NoViperGPT: Visual Inference via Python Execution ...2023-03-14Code
59CoVGT60NoContrastive Video Question Answering via Video G...2023-02-27Code
60ViperGPT (GPT-3.5)60NoViperGPT: Visual Inference via Python Execution ...2023-03-14Code
61GF58.83NoGlance and Focus: Memory Prompting for Multi-Eve...2024-01-03Code
62VFC58.6YesVerbs in Action: Improving verb understanding in...2023-04-13Code
63ATM58.3NoATM: Action Temporality Modeling for Video Quest...2023-09-05-
64MIST57.2NoMIST: Multi-modal Iterative Spatial-Temporal Tra...2022-12-19Code
65VGT(PT)56.9YesVideo Graph Transformer for Video Question Answe...2022-07-12Code
66PAXION56.9YesPaxion: Patching Action Knowledge in Video-Langu...2023-05-18Code
67MVU (13B)55.2NoUnderstanding Long Videos with Multimodal Langua...2024-03-25Code
68VGT55NoVideo Graph Transformer for Video Question Answe...2022-07-12Code
69ATP54.3NoRevisiting the "Video" in Video-Language Underst...2022-06-03Code
70LLoVi (7B)54.3NoA Simple LLM Framework for Long-Range Video Ques...2023-12-28Code
71P3D-G53.4No(2.5+1)D Spatio-Temporal Scene Graphs for Video ...2022-02-18-
72VFC51.5NoVerbs in Action: Improving verb understanding in...2023-04-13Code
73HQGA51.4NoVideo as Conditional Graph Hierarchy for Multi-G...2021-12-12Code
74Mistral (7B)51.1NoMistral 7B2023-10-10Code