Tasks
SotA
Datasets
Papers
Methods
Submit
About
SotA
/
Reasoning
/
Video Question Answering
/
NExT-QA
Video Question Answering on NExT-QA
Metric: Accuracy (higher is better)
Leaderboard
Dataset
Loading chart...
Results
Submit a result
Hide extra data
Export CSV
Sort:
Accuracy (best first)
Accuracy (worst first)
Date (newest first)
Date (oldest first)
Model name (A→Z)
#
Model
↕
Accuracy
▼
Extra Data
Paper
Date
↕
Code
1
LinVT-Qwen2-VL (7B)
85.5
No
LinVT: Empower Your Image-level Large Language M...
2024-12-06
Code
2
InternVL-2.5(8B)
85.5
No
Expanding Performance Boundaries of Open-Source ...
2024-12-06
Code
3
VideoLLaMA3(7B)
84.5
No
VideoLLaMA 3: Frontier Multimodal Foundation Mod...
2025-01-22
Code
4
PLM-8B
84.1
No
PerceptionLM: Open-Access Data and Models for De...
2025-04-17
Code
5
BIMBA-LLaVA-Qwen2-7B
83.73
No
BIMBA: Selective-Scan Compression for Long-Range...
2025-03-12
Code
6
PLM-3B
83.4
No
PerceptionLM: Open-Access Data and Models for De...
2025-04-17
Code
7
LLaVA-Video
83.2
No
Video Instruction Tuning With Synthetic Data
2024-10-03
-
8
NVILA(8B)
82.2
No
NVILA: Efficient Frontier Visual Language Models
2024-12-05
Code
9
Oryx-1.5(7B)
81.8
No
Oryx MLLM: On-Demand Spatial-Temporal Understand...
2024-09-19
Code
10
Qwen2-VL(7B)
81.2
No
Qwen2-VL: Enhancing Vision-Language Model's Perc...
2024-09-18
Code
11
LongVILA(7B)
80.7
No
LongVILA: Scaling Long-Context Visual Language M...
2024-08-19
Code
12
PLM-1B
80.3
No
PerceptionLM: Open-Access Data and Models for De...
2025-04-17
Code
13
LLaVA-OV(72B)
80.2
No
LLaVA-OneVision: Easy Visual Task Transfer
2024-08-06
Code
14
VideoMultiAgent (GPT-4o)
79.6
No
VideoMultiAgents: A Multi-Agent Framework for Vi...
2025-04-25
Code
15
VideoChat2_HD_mistral
79.5
No
MVBench: A Comprehensive Multi-modal Video Under...
2023-11-28
Code
16
LLaVA-OV(7B)
79.4
No
LLaVA-OneVision: Easy Visual Task Transfer
2024-08-06
Code
17
Tarsier (34B)
79.2
No
Tarsier: Recipes for Training and Evaluating Lar...
2024-06-30
Code
18
LLaVA-NeXT-Interleave(14B)
79.1
No
LLaVA-NeXT-Interleave: Tackling Multi-image, Vid...
2024-07-10
Code
19
VideoChat2_mistral
78.6
No
MVBench: A Comprehensive Multi-modal Video Under...
2023-11-28
Code
20
mPLUG-Owl3(8B)
78.6
No
mPLUG-Owl3: Towards Long Image-Sequence Understa...
2024-08-09
Code
21
LLaVA-NeXT-Interleave(7B)
78.2
No
LLaVA-NeXT-Interleave: Tackling Multi-image, Vid...
2024-07-10
Code
22
AKEYS
78.1
No
Agentic Keyframe Search for Video Question Answe...
2025-03-20
Code
23
LLaVA-NeXT-Interleave(DPO)
77.9
No
LLaVA-NeXT-Interleave: Tackling Multi-image, Vid...
2024-07-10
Code
24
Vamos
77.3
No
Vamos: Versatile Action Models for Video Underst...
2023-11-22
Code
25
ViLA (3B)
75.6
No
ViLA: Efficient Video-Language Alignment for Vid...
2023-12-13
Code
26
VideoLLaMA2.1(7B)
75.6
No
VideoLLaMA 2: Advancing Spatial-Temporal Modelin...
2024-06-11
Code
27
LLaMA-VQA (33B)
75.5
No
Large Language Models are Temporal and Causal Re...
2023-10-24
Code
28
ENTER
75.1
No
ENTER: Event Based Interpretable Reasoning for V...
2025-01-24
-
29
ViLA (3B, 4 frames)
74.4
No
ViLA: Efficient Video-Language Alignment for Vid...
2023-12-13
Code
30
CREMA
73.9
No
CREMA: Generalizable and Efficient Video-Languag...
2024-02-08
Code
31
SeViLA
73.8
No
Self-Chained Image-Language Model for Video Loca...
2023-05-11
Code
32
TS-LLaVA-34B
73.6
No
TS-LLaVA: Constructing Visual Tokens through Thu...
2024-11-17
Code
33
TCR
73.5
No
Text-Conditioned Resampler For Long Form Video U...
2023-12-19
-
34
VideoTree (GPT4)
73.5
No
VideoTree: Adaptive Tree-based Video Representat...
2024-05-29
Code
35
LVNet(GPT-4o)
72.9
No
Too Many Frames, Not All Useful: Efficient Strat...
2024-06-13
Code
36
LSTP
72.1
No
Efficient Temporal Extrapolation of Multimodal L...
2024-02-25
Code
37
Mirasol3B
72
No
Mirasol3B: A Multimodal Autoregressive model for...
2023-11-09
-
38
VideoAgent (GPT-4)
71.3
No
VideoAgent: Long-form Video Understanding with L...
2024-03-15
Code
39
IG-VLM(LLaVA v1.6)
70.9
No
An Image Grid Can Be Worth a Video: Zero-shot Vi...
2024-03-27
Code
40
VidCtx (7B)
70.7
No
VidCtx: Context-aware Video Question Answering w...
2024-12-23
Code
41
MoReVQA(PaLM-2)
69.2
No
MoReVQA: Exploring Modular Reasoning Models for ...
2024-04-09
-
42
VideoChat2
68.6
No
MVBench: A Comprehensive Multi-modal Video Under...
2023-11-28
Code
43
IG-VLM (GPT-4)
68.6
No
An Image Grid Can Be Worth a Video: Zero-shot Vi...
2024-03-27
Code
44
TraveLER (GPT-4)
68.2
No
TraveLER: A Modular Multi-LMM Agent Framework fo...
2024-04-01
Code
45
LLoVi (GPT-4)
67.7
No
A Simple LLM Framework for Long-Range Video Ques...
2023-12-28
Code
46
LongVA(32 frames)
67.1
No
Long Context Transfer from Language to Vision
2024-06-24
Code
47
Q-ViD
66.3
No
Question-Instructed Visual Descriptions for Zero...
2024-02-16
Code
48
ProViQ
64.6
No
Zero-Shot Video Question Answering with Procedur...
2023-12-01
-
49
SlowFast-LLaVA-34B
64.2
No
SlowFast-LLaVA: A Strong Training-Free Baseline ...
2024-07-22
Code
50
Sevila (4B)
63.6
No
Self-Chained Image-Language Model for Video Loca...
2023-05-11
Code
51
RTQ
63.2
No
RTQ: Rethinking Video-language Understanding Bas...
2023-12-01
Code
52
HiTeA
63.1
Yes
HiTeA: Hierarchical Temporal-Aware Video-Languag...
2022-12-30
-
53
VideoChat2
61.7
No
MVBench: A Comprehensive Multi-modal Video Under...
2023-11-28
Code
54
DeepStack-L(7B)
61
No
DeepStack: Deeply Stacking Visual Tokens is Surp...
2024-06-06
-
55
LangRepo (12B)
60.9
No
Language Repository for Long Video Understanding
2024-03-21
Code
56
CoVGT(PT)
60.7
Yes
Contrastive Video Question Answering via Video G...
2023-02-27
Code
57
SeViT
60.6
No
Semi-Parametric Video-Grounded Text Generation
2023-01-27
-
58
ViperGPT(0-shot)
60
No
ViperGPT: Visual Inference via Python Execution ...
2023-03-14
Code
59
CoVGT
60
No
Contrastive Video Question Answering via Video G...
2023-02-27
Code
60
ViperGPT (GPT-3.5)
60
No
ViperGPT: Visual Inference via Python Execution ...
2023-03-14
Code
61
GF
58.83
No
Glance and Focus: Memory Prompting for Multi-Eve...
2024-01-03
Code
62
VFC
58.6
Yes
Verbs in Action: Improving verb understanding in...
2023-04-13
Code
63
ATM
58.3
No
ATM: Action Temporality Modeling for Video Quest...
2023-09-05
-
64
MIST
57.2
No
MIST: Multi-modal Iterative Spatial-Temporal Tra...
2022-12-19
Code
65
VGT(PT)
56.9
Yes
Video Graph Transformer for Video Question Answe...
2022-07-12
Code
66
PAXION
56.9
Yes
Paxion: Patching Action Knowledge in Video-Langu...
2023-05-18
Code
67
MVU (13B)
55.2
No
Understanding Long Videos with Multimodal Langua...
2024-03-25
Code
68
VGT
55
No
Video Graph Transformer for Video Question Answe...
2022-07-12
Code
69
ATP
54.3
No
Revisiting the "Video" in Video-Language Underst...
2022-06-03
Code
70
LLoVi (7B)
54.3
No
A Simple LLM Framework for Long-Range Video Ques...
2023-12-28
Code
71
P3D-G
53.4
No
(2.5+1)D Spatio-Temporal Scene Graphs for Video ...
2022-02-18
-
72
VFC
51.5
No
Verbs in Action: Improving verb understanding in...
2023-04-13
Code
73
HQGA
51.4
No
Video as Conditional Graph Hierarchy for Multi-G...
2021-12-12
Code
74
Mistral (7B)
51.1
No
Mistral 7B
2023-10-10
Code
#1
LinVT-Qwen2-VL (7B)
SOTA
85.5
Accuracy
· 2024-12-06
LinVT: Empower Your Image-level Large Language Model to Understand Videos
Code
#2
InternVL-2.5(8B)
85.5
Accuracy
· 2024-12-06
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
Code
#3
VideoLLaMA3(7B)
84.5
Accuracy
· 2025-01-22
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
Code
#4
PLM-8B
84.1
Accuracy
· 2025-04-17
PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding
Code
#5
BIMBA-LLaVA-Qwen2-7B
83.73
Accuracy
· 2025-03-12
BIMBA: Selective-Scan Compression for Long-Range Video Question Answering
Code
#6
PLM-3B
83.4
Accuracy
· 2025-04-17
PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding
Code
#7
LLaVA-Video
SOTA
83.2
Accuracy
· 2024-10-03
Video Instruction Tuning With Synthetic Data
#8
NVILA(8B)
82.2
Accuracy
· 2024-12-05
NVILA: Efficient Frontier Visual Language Models
Code
#9
Oryx-1.5(7B)
SOTA
81.8
Accuracy
· 2024-09-19
Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution
Code
#10
Qwen2-VL(7B)
SOTA
81.2
Accuracy
· 2024-09-18
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Code
#11
LongVILA(7B)
SOTA
80.7
Accuracy
· 2024-08-19
LongVILA: Scaling Long-Context Visual Language Models for Long Videos
Code
#12
PLM-1B
80.3
Accuracy
· 2025-04-17
PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding
Code
#13
LLaVA-OV(72B)
SOTA
80.2
Accuracy
· 2024-08-06
LLaVA-OneVision: Easy Visual Task Transfer
Code
#14
VideoMultiAgent (GPT-4o)
79.6
Accuracy
· 2025-04-25
VideoMultiAgents: A Multi-Agent Framework for Video Question Answering
Code
#15
VideoChat2_HD_mistral
SOTA
79.5
Accuracy
· 2023-11-28
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
Code
#16
LLaVA-OV(7B)
79.4
Accuracy
· 2024-08-06
LLaVA-OneVision: Easy Visual Task Transfer
Code
#17
Tarsier (34B)
79.2
Accuracy
· 2024-06-30
Tarsier: Recipes for Training and Evaluating Large Video Description Models
Code
#18
LLaVA-NeXT-Interleave(14B)
79.1
Accuracy
· 2024-07-10
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models
Code
#19
VideoChat2_mistral
78.6
Accuracy
· 2023-11-28
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
Code
#20
mPLUG-Owl3(8B)
78.6
Accuracy
· 2024-08-09
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models
Code
#21
LLaVA-NeXT-Interleave(7B)
78.2
Accuracy
· 2024-07-10
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models
Code
#22
AKEYS
78.1
Accuracy
· 2025-03-20
Agentic Keyframe Search for Video Question Answering
Code
#23
LLaVA-NeXT-Interleave(DPO)
77.9
Accuracy
· 2024-07-10
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models
Code
#24
Vamos
SOTA
77.3
Accuracy
· 2023-11-22
Vamos: Versatile Action Models for Video Understanding
Code
#25
ViLA (3B)
75.6
Accuracy
· 2023-12-13
ViLA: Efficient Video-Language Alignment for Video Question Answering
Code
#26
VideoLLaMA2.1(7B)
75.6
Accuracy
· 2024-06-11
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
Code
#27
LLaMA-VQA (33B)
SOTA
75.5
Accuracy
· 2023-10-24
Large Language Models are Temporal and Causal Reasoners for Video Question Answering
Code
#28
ENTER
75.1
Accuracy
· 2025-01-24
ENTER: Event Based Interpretable Reasoning for VideoQA
#29
ViLA (3B, 4 frames)
74.4
Accuracy
· 2023-12-13
ViLA: Efficient Video-Language Alignment for Video Question Answering
Code
#30
CREMA
73.9
Accuracy
· 2024-02-08
CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion
Code
#31
SeViLA
SOTA
73.8
Accuracy
· 2023-05-11
Self-Chained Image-Language Model for Video Localization and Question Answering
Code
#32
TS-LLaVA-34B
73.6
Accuracy
· 2024-11-17
TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models
Code
#33
TCR
73.5
Accuracy
· 2023-12-19
Text-Conditioned Resampler For Long Form Video Understanding
#34
VideoTree (GPT4)
73.5
Accuracy
· 2024-05-29
VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos
Code
#35
LVNet(GPT-4o)
72.9
Accuracy
· 2024-06-13
Too Many Frames, Not All Useful: Efficient Strategies for Long-Form Video QA
Code
#36
LSTP
72.1
Accuracy
· 2024-02-25
Efficient Temporal Extrapolation of Multimodal Large Language Models with Temporal Grounding Bridge
Code
#37
Mirasol3B
72
Accuracy
· 2023-11-09
Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities
#38
VideoAgent (GPT-4)
71.3
Accuracy
· 2024-03-15
VideoAgent: Long-form Video Understanding with Large Language Model as Agent
Code
#39
IG-VLM(LLaVA v1.6)
70.9
Accuracy
· 2024-03-27
An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM
Code
#40
VidCtx (7B)
70.7
Accuracy
· 2024-12-23
VidCtx: Context-aware Video Question Answering with Image Models
Code
#41
MoReVQA(PaLM-2)
69.2
Accuracy
· 2024-04-09
MoReVQA: Exploring Modular Reasoning Models for Video Question Answering
#42
VideoChat2
68.6
Accuracy
· 2023-11-28
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
Code
#43
IG-VLM (GPT-4)
68.6
Accuracy
· 2024-03-27
An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM
Code
#44
TraveLER (GPT-4)
68.2
Accuracy
· 2024-04-01
TraveLER: A Modular Multi-LMM Agent Framework for Video Question-Answering
Code
#45
LLoVi (GPT-4)
67.7
Accuracy
· 2023-12-28
A Simple LLM Framework for Long-Range Video Question-Answering
Code
#46
LongVA(32 frames)
67.1
Accuracy
· 2024-06-24
Long Context Transfer from Language to Vision
Code
#47
Q-ViD
66.3
Accuracy
· 2024-02-16
Question-Instructed Visual Descriptions for Zero-Shot Video Question Answering
Code
#48
ProViQ
64.6
Accuracy
· 2023-12-01
Zero-Shot Video Question Answering with Procedural Programs
#49
SlowFast-LLaVA-34B
64.2
Accuracy
· 2024-07-22
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models
Code
#50
Sevila (4B)
63.6
Accuracy
· 2023-05-11
Self-Chained Image-Language Model for Video Localization and Question Answering
Code
#51
RTQ
63.2
Accuracy
· 2023-12-01
RTQ: Rethinking Video-language Understanding Based on Image-text Model
Code
#52
HiTeA
SOTA
63.1
Accuracy
· Extra Data
· 2022-12-30
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training
#53
VideoChat2
61.7
Accuracy
· 2023-11-28
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
Code
#54
DeepStack-L(7B)
61
Accuracy
· 2024-06-06
DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effective for LMMs
#55
LangRepo (12B)
60.9
Accuracy
· 2024-03-21
Language Repository for Long Video Understanding
Code
#56
CoVGT(PT)
60.7
Accuracy
· Extra Data
· 2023-02-27
Contrastive Video Question Answering via Video Graph Transformer
Code
#57
SeViT
60.6
Accuracy
· 2023-01-27
Semi-Parametric Video-Grounded Text Generation
#58
ViperGPT(0-shot)
60
Accuracy
· 2023-03-14
ViperGPT: Visual Inference via Python Execution for Reasoning
Code
#59
CoVGT
60
Accuracy
· 2023-02-27
Contrastive Video Question Answering via Video Graph Transformer
Code
#60
ViperGPT (GPT-3.5)
60
Accuracy
· 2023-03-14
ViperGPT: Visual Inference via Python Execution for Reasoning
Code
#61
GF
58.83
Accuracy
· 2024-01-03
Glance and Focus: Memory Prompting for Multi-Event Video Question Answering
Code
#62
VFC
58.6
Accuracy
· Extra Data
· 2023-04-13
Verbs in Action: Improving verb understanding in video-language models
Code
#63
ATM
58.3
Accuracy
· 2023-09-05
ATM: Action Temporality Modeling for Video Question Answering
#64
MIST
SOTA
57.2
Accuracy
· 2022-12-19
MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering
Code
#65
VGT(PT)
SOTA
56.9
Accuracy
· Extra Data
· 2022-07-12
Video Graph Transformer for Video Question Answering
Code
#66
PAXION
56.9
Accuracy
· Extra Data
· 2023-05-18
Paxion: Patching Action Knowledge in Video-Language Foundation Models
Code
#67
MVU (13B)
55.2
Accuracy
· 2024-03-25
Understanding Long Videos with Multimodal Language Models
Code
#68
VGT
55
Accuracy
· 2022-07-12
Video Graph Transformer for Video Question Answering
Code
#69
ATP
SOTA
54.3
Accuracy
· 2022-06-03
Revisiting the "Video" in Video-Language Understanding
Code
#70
LLoVi (7B)
54.3
Accuracy
· 2023-12-28
A Simple LLM Framework for Long-Range Video Question-Answering
Code
#71
P3D-G
SOTA
53.4
Accuracy
· 2022-02-18
(2.5+1)D Spatio-Temporal Scene Graphs for Video Question Answering
#72
VFC
51.5
Accuracy
· 2023-04-13
Verbs in Action: Improving verb understanding in video-language models
Code
#73
HQGA
SOTA
51.4
Accuracy
· 2021-12-12
Video as Conditional Graph Hierarchy for Multi-Granular Question Answering
Code
#74
Mistral (7B)
51.1
Accuracy
· 2023-10-10
Mistral 7B
Code