Tasks
SotA
Datasets
Papers
Methods
Submit
About
SotA
/
Natural Language Processing
/
Question Answering
/
NExT-QA
Question Answering on NExT-QA
Metric: Accuracy (higher is better)
Leaderboard
Dataset
Loading chart...
Results
Submit a result
Export CSV
#
Model
↕
Accuracy
▼
Extra Data
Paper
Date
↕
Code
1
VideoMultiAgent (GPT-4o)
79.6
No
VideoMultiAgents: A Multi-Agent Framework for Vi...
2025-04-25
Code
2
Tarsier (34B)
79.2
No
Tarsier: Recipes for Training and Evaluating Lar...
2024-06-30
Code
3
AKEYS
78.1
No
Agentic Keyframe Search for Video Question Answe...
2025-03-20
Code
4
ENTER
75.1
No
ENTER: Event Based Interpretable Reasoning for V...
2025-01-24
-
5
TS-LLaVA-34B
73.6
No
TS-LLaVA: Constructing Visual Tokens through Thu...
2024-11-17
Code
6
VideoTree (GPT4)
73.5
No
VideoTree: Adaptive Tree-based Video Representat...
2024-05-29
Code
7
LVNet(GPT-4o)
72.9
No
Too Many Frames, Not All Useful: Efficient Strat...
2024-06-13
Code
8
VideoAgent (GPT-4)
71.3
No
VideoAgent: Long-form Video Understanding with L...
2024-03-15
Code
9
IG-VLM(LLaVA v1.6)
70.9
No
An Image Grid Can Be Worth a Video: Zero-shot Vi...
2024-03-27
Code
10
VidCtx (7B)
70.7
No
VidCtx: Context-aware Video Question Answering w...
2024-12-23
Code
11
MoReVQA(PaLM-2)
69.2
No
MoReVQA: Exploring Modular Reasoning Models for ...
2024-04-09
-
12
IG-VLM (GPT-4)
68.6
No
An Image Grid Can Be Worth a Video: Zero-shot Vi...
2024-03-27
Code
13
TraveLER (GPT-4)
68.2
No
TraveLER: A Modular Multi-LMM Agent Framework fo...
2024-04-01
Code
14
LLoVi (GPT-4)
67.7
No
A Simple LLM Framework for Long-Range Video Ques...
2023-12-28
Code
15
LongVA(32 frames)
67.1
No
Long Context Transfer from Language to Vision
2024-06-24
Code
16
Q-ViD
66.3
No
Question-Instructed Visual Descriptions for Zero...
2024-02-16
Code
17
ProViQ
64.6
No
Zero-Shot Video Question Answering with Procedur...
2023-12-01
-
18
SlowFast-LLaVA-34B
64.2
No
SlowFast-LLaVA: A Strong Training-Free Baseline ...
2024-07-22
Code
19
Sevila (4B)
63.6
No
Self-Chained Image-Language Model for Video Loca...
2023-05-11
Code
20
VideoChat2
61.7
No
MVBench: A Comprehensive Multi-modal Video Under...
2023-11-28
Code
21
DeepStack-L(7B)
61
No
DeepStack: Deeply Stacking Visual Tokens is Surp...
2024-06-06
-
22
LangRepo (12B)
60.9
No
Language Repository for Long Video Understanding
2024-03-21
Code
23
ViperGPT (GPT-3.5)
60
No
ViperGPT: Visual Inference via Python Execution ...
2023-03-14
Code
24
MVU (13B)
55.2
No
Understanding Long Videos with Multimodal Langua...
2024-03-25
Code
25
LLoVi (7B)
54.3
No
A Simple LLM Framework for Long-Range Video Ques...
2023-12-28
Code
26
VFC
51.5
No
Verbs in Action: Improving verb understanding in...
2023-04-13
Code
27
Mistral (7B)
51.1
No
Mistral 7B
2023-10-10
Code