Tasks
SotA
Datasets
Papers
Methods
Submit
About
SotA
/
Natural Language Processing
/
Question Answering
/
NExT-QA
Question Answering on NExT-QA
Metric: Accuracy (higher is better)
Leaderboard
Dataset
Loading chart...
Results
Submit a result
Export CSV
Sort:
Accuracy (best first)
Accuracy (worst first)
Date (newest first)
Date (oldest first)
Model name (A→Z)
#
Model
↕
Accuracy
▼
Extra Data
Paper
Date
↕
Code
1
VideoMultiAgent (GPT-4o)
79.6
No
VideoMultiAgents: A Multi-Agent Framework for Vi...
2025-04-25
Code
2
Tarsier (34B)
79.2
No
Tarsier: Recipes for Training and Evaluating Lar...
2024-06-30
Code
3
AKEYS
78.1
No
Agentic Keyframe Search for Video Question Answe...
2025-03-20
Code
4
ENTER
75.1
No
ENTER: Event Based Interpretable Reasoning for V...
2025-01-24
-
5
TS-LLaVA-34B
73.6
No
TS-LLaVA: Constructing Visual Tokens through Thu...
2024-11-17
Code
6
VideoTree (GPT4)
73.5
No
VideoTree: Adaptive Tree-based Video Representat...
2024-05-29
Code
7
LVNet(GPT-4o)
72.9
No
Too Many Frames, Not All Useful: Efficient Strat...
2024-06-13
Code
8
VideoAgent (GPT-4)
71.3
No
VideoAgent: Long-form Video Understanding with L...
2024-03-15
Code
9
IG-VLM(LLaVA v1.6)
70.9
No
An Image Grid Can Be Worth a Video: Zero-shot Vi...
2024-03-27
Code
10
VidCtx (7B)
70.7
No
VidCtx: Context-aware Video Question Answering w...
2024-12-23
Code
11
MoReVQA(PaLM-2)
69.2
No
MoReVQA: Exploring Modular Reasoning Models for ...
2024-04-09
-
12
IG-VLM (GPT-4)
68.6
No
An Image Grid Can Be Worth a Video: Zero-shot Vi...
2024-03-27
Code
13
TraveLER (GPT-4)
68.2
No
TraveLER: A Modular Multi-LMM Agent Framework fo...
2024-04-01
Code
14
LLoVi (GPT-4)
67.7
No
A Simple LLM Framework for Long-Range Video Ques...
2023-12-28
Code
15
LongVA(32 frames)
67.1
No
Long Context Transfer from Language to Vision
2024-06-24
Code
16
Q-ViD
66.3
No
Question-Instructed Visual Descriptions for Zero...
2024-02-16
Code
17
ProViQ
64.6
No
Zero-Shot Video Question Answering with Procedur...
2023-12-01
-
18
SlowFast-LLaVA-34B
64.2
No
SlowFast-LLaVA: A Strong Training-Free Baseline ...
2024-07-22
Code
19
Sevila (4B)
63.6
No
Self-Chained Image-Language Model for Video Loca...
2023-05-11
Code
20
VideoChat2
61.7
No
MVBench: A Comprehensive Multi-modal Video Under...
2023-11-28
Code
21
DeepStack-L(7B)
61
No
DeepStack: Deeply Stacking Visual Tokens is Surp...
2024-06-06
-
22
LangRepo (12B)
60.9
No
Language Repository for Long Video Understanding
2024-03-21
Code
23
ViperGPT (GPT-3.5)
60
No
ViperGPT: Visual Inference via Python Execution ...
2023-03-14
Code
24
MVU (13B)
55.2
No
Understanding Long Videos with Multimodal Langua...
2024-03-25
Code
25
LLoVi (7B)
54.3
No
A Simple LLM Framework for Long-Range Video Ques...
2023-12-28
Code
26
VFC
51.5
No
Verbs in Action: Improving verb understanding in...
2023-04-13
Code
27
Mistral (7B)
51.1
No
Mistral 7B
2023-10-10
Code
#1
VideoMultiAgent (GPT-4o)
SOTA
79.6
Accuracy
· 2025-04-25
VideoMultiAgents: A Multi-Agent Framework for Video Question Answering
Code
#2
Tarsier (34B)
SOTA
79.2
Accuracy
· 2024-06-30
Tarsier: Recipes for Training and Evaluating Large Video Description Models
Code
#3
AKEYS
78.1
Accuracy
· 2025-03-20
Agentic Keyframe Search for Video Question Answering
Code
#4
ENTER
75.1
Accuracy
· 2025-01-24
ENTER: Event Based Interpretable Reasoning for VideoQA
#5
TS-LLaVA-34B
73.6
Accuracy
· 2024-11-17
TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models
Code
#6
VideoTree (GPT4)
SOTA
73.5
Accuracy
· 2024-05-29
VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos
Code
#7
LVNet(GPT-4o)
72.9
Accuracy
· 2024-06-13
Too Many Frames, Not All Useful: Efficient Strategies for Long-Form Video QA
Code
#8
VideoAgent (GPT-4)
SOTA
71.3
Accuracy
· 2024-03-15
VideoAgent: Long-form Video Understanding with Large Language Model as Agent
Code
#9
IG-VLM(LLaVA v1.6)
70.9
Accuracy
· 2024-03-27
An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM
Code
#10
VidCtx (7B)
70.7
Accuracy
· 2024-12-23
VidCtx: Context-aware Video Question Answering with Image Models
Code
#11
MoReVQA(PaLM-2)
69.2
Accuracy
· 2024-04-09
MoReVQA: Exploring Modular Reasoning Models for Video Question Answering
#12
IG-VLM (GPT-4)
68.6
Accuracy
· 2024-03-27
An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM
Code
#13
TraveLER (GPT-4)
68.2
Accuracy
· 2024-04-01
TraveLER: A Modular Multi-LMM Agent Framework for Video Question-Answering
Code
#14
LLoVi (GPT-4)
SOTA
67.7
Accuracy
· 2023-12-28
A Simple LLM Framework for Long-Range Video Question-Answering
Code
#15
LongVA(32 frames)
67.1
Accuracy
· 2024-06-24
Long Context Transfer from Language to Vision
Code
#16
Q-ViD
66.3
Accuracy
· 2024-02-16
Question-Instructed Visual Descriptions for Zero-Shot Video Question Answering
Code
#17
ProViQ
SOTA
64.6
Accuracy
· 2023-12-01
Zero-Shot Video Question Answering with Procedural Programs
#18
SlowFast-LLaVA-34B
64.2
Accuracy
· 2024-07-22
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models
Code
#19
Sevila (4B)
SOTA
63.6
Accuracy
· 2023-05-11
Self-Chained Image-Language Model for Video Localization and Question Answering
Code
#20
VideoChat2
61.7
Accuracy
· 2023-11-28
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
Code
#21
DeepStack-L(7B)
61
Accuracy
· 2024-06-06
DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effective for LMMs
#22
LangRepo (12B)
60.9
Accuracy
· 2024-03-21
Language Repository for Long Video Understanding
Code
#23
ViperGPT (GPT-3.5)
SOTA
60
Accuracy
· 2023-03-14
ViperGPT: Visual Inference via Python Execution for Reasoning
Code
#24
MVU (13B)
55.2
Accuracy
· 2024-03-25
Understanding Long Videos with Multimodal Language Models
Code
#25
LLoVi (7B)
54.3
Accuracy
· 2023-12-28
A Simple LLM Framework for Long-Range Video Question-Answering
Code
#26
VFC
51.5
Accuracy
· 2023-04-13
Verbs in Action: Improving verb understanding in video-language models
Code
#27
Mistral (7B)
51.1
Accuracy
· 2023-10-10
Mistral 7B
Code