TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

SotA/Natural Language Processing/Question Answering/NExT-QA

Question Answering on NExT-QA

Metric: Accuracy (higher is better)

LeaderboardDataset
Loading chart...

Results

Submit a result
#Model↕Accuracy▼Extra DataPaperDate↕Code
1VideoMultiAgent (GPT-4o)79.6NoVideoMultiAgents: A Multi-Agent Framework for Vi...2025-04-25Code
2Tarsier (34B)79.2NoTarsier: Recipes for Training and Evaluating Lar...2024-06-30Code
3AKEYS78.1NoAgentic Keyframe Search for Video Question Answe...2025-03-20Code
4ENTER75.1NoENTER: Event Based Interpretable Reasoning for V...2025-01-24-
5TS-LLaVA-34B73.6NoTS-LLaVA: Constructing Visual Tokens through Thu...2024-11-17Code
6VideoTree (GPT4)73.5NoVideoTree: Adaptive Tree-based Video Representat...2024-05-29Code
7LVNet(GPT-4o)72.9NoToo Many Frames, Not All Useful: Efficient Strat...2024-06-13Code
8VideoAgent (GPT-4)71.3NoVideoAgent: Long-form Video Understanding with L...2024-03-15Code
9IG-VLM(LLaVA v1.6)70.9NoAn Image Grid Can Be Worth a Video: Zero-shot Vi...2024-03-27Code
10VidCtx (7B)70.7NoVidCtx: Context-aware Video Question Answering w...2024-12-23Code
11MoReVQA(PaLM-2)69.2NoMoReVQA: Exploring Modular Reasoning Models for ...2024-04-09-
12IG-VLM (GPT-4)68.6NoAn Image Grid Can Be Worth a Video: Zero-shot Vi...2024-03-27Code
13TraveLER (GPT-4)68.2NoTraveLER: A Modular Multi-LMM Agent Framework fo...2024-04-01Code
14LLoVi (GPT-4)67.7NoA Simple LLM Framework for Long-Range Video Ques...2023-12-28Code
15LongVA(32 frames)67.1NoLong Context Transfer from Language to Vision2024-06-24Code
16Q-ViD66.3NoQuestion-Instructed Visual Descriptions for Zero...2024-02-16Code
17ProViQ64.6NoZero-Shot Video Question Answering with Procedur...2023-12-01-
18SlowFast-LLaVA-34B64.2NoSlowFast-LLaVA: A Strong Training-Free Baseline ...2024-07-22Code
19Sevila (4B)63.6NoSelf-Chained Image-Language Model for Video Loca...2023-05-11Code
20VideoChat261.7NoMVBench: A Comprehensive Multi-modal Video Under...2023-11-28Code
21DeepStack-L(7B)61NoDeepStack: Deeply Stacking Visual Tokens is Surp...2024-06-06-
22LangRepo (12B)60.9NoLanguage Repository for Long Video Understanding2024-03-21Code
23ViperGPT (GPT-3.5)60NoViperGPT: Visual Inference via Python Execution ...2023-03-14Code
24MVU (13B)55.2NoUnderstanding Long Videos with Multimodal Langua...2024-03-25Code
25LLoVi (7B)54.3NoA Simple LLM Framework for Long-Range Video Ques...2023-12-28Code
26VFC51.5NoVerbs in Action: Improving verb understanding in...2023-04-13Code
27Mistral (7B)51.1NoMistral 7B2023-10-10Code