Video Question Answering on NExT-QA

Metric: Accuracy (higher is better)

LeaderboardDataset

Loading chart...

Results

Hide extra data

Sort:

#	Model↕	Accuracy▼	Extra Data	Paper	Date↕	Code
1	LinVT-Qwen2-VL (7B)	85.5	No	LinVT: Empower Your Image-level Large Language M...	2024-12-06	Code
2	InternVL-2.5(8B)	85.5	No	Expanding Performance Boundaries of Open-Source ...	2024-12-06	Code
3	VideoLLaMA3(7B)	84.5	No	VideoLLaMA 3: Frontier Multimodal Foundation Mod...	2025-01-22	Code
4	PLM-8B	84.1	No	PerceptionLM: Open-Access Data and Models for De...	2025-04-17	Code
5	BIMBA-LLaVA-Qwen2-7B	83.73	No	BIMBA: Selective-Scan Compression for Long-Range...	2025-03-12	Code
6	PLM-3B	83.4	No	PerceptionLM: Open-Access Data and Models for De...	2025-04-17	Code
7	LLaVA-Video	83.2	No	Video Instruction Tuning With Synthetic Data	2024-10-03	-
8	NVILA(8B)	82.2	No	NVILA: Efficient Frontier Visual Language Models	2024-12-05	Code
9	Oryx-1.5(7B)	81.8	No	Oryx MLLM: On-Demand Spatial-Temporal Understand...	2024-09-19	Code
10	Qwen2-VL(7B)	81.2	No	Qwen2-VL: Enhancing Vision-Language Model's Perc...	2024-09-18	Code
11	LongVILA(7B)	80.7	No	LongVILA: Scaling Long-Context Visual Language M...	2024-08-19	Code
12	PLM-1B	80.3	No	PerceptionLM: Open-Access Data and Models for De...	2025-04-17	Code
13	LLaVA-OV(72B)	80.2	No	LLaVA-OneVision: Easy Visual Task Transfer	2024-08-06	Code
14	VideoMultiAgent (GPT-4o)	79.6	No	VideoMultiAgents: A Multi-Agent Framework for Vi...	2025-04-25	Code
15	VideoChat2_HD_mistral	79.5	No	MVBench: A Comprehensive Multi-modal Video Under...	2023-11-28	Code
16	LLaVA-OV(7B)	79.4	No	LLaVA-OneVision: Easy Visual Task Transfer	2024-08-06	Code
17	Tarsier (34B)	79.2	No	Tarsier: Recipes for Training and Evaluating Lar...	2024-06-30	Code
18	LLaVA-NeXT-Interleave(14B)	79.1	No	LLaVA-NeXT-Interleave: Tackling Multi-image, Vid...	2024-07-10	Code
19	VideoChat2_mistral	78.6	No	MVBench: A Comprehensive Multi-modal Video Under...	2023-11-28	Code
20	mPLUG-Owl3(8B)	78.6	No	mPLUG-Owl3: Towards Long Image-Sequence Understa...	2024-08-09	Code
21	LLaVA-NeXT-Interleave(7B)	78.2	No	LLaVA-NeXT-Interleave: Tackling Multi-image, Vid...	2024-07-10	Code
22	AKEYS	78.1	No	Agentic Keyframe Search for Video Question Answe...	2025-03-20	Code
23	LLaVA-NeXT-Interleave(DPO)	77.9	No	LLaVA-NeXT-Interleave: Tackling Multi-image, Vid...	2024-07-10	Code
24	Vamos	77.3	No	Vamos: Versatile Action Models for Video Underst...	2023-11-22	Code
25	ViLA (3B)	75.6	No	ViLA: Efficient Video-Language Alignment for Vid...	2023-12-13	Code
26	VideoLLaMA2.1(7B)	75.6	No	VideoLLaMA 2: Advancing Spatial-Temporal Modelin...	2024-06-11	Code
27	LLaMA-VQA (33B)	75.5	No	Large Language Models are Temporal and Causal Re...	2023-10-24	Code
28	ENTER	75.1	No	ENTER: Event Based Interpretable Reasoning for V...	2025-01-24	-
29	ViLA (3B, 4 frames)	74.4	No	ViLA: Efficient Video-Language Alignment for Vid...	2023-12-13	Code
30	CREMA	73.9	No	CREMA: Generalizable and Efficient Video-Languag...	2024-02-08	Code
31	SeViLA	73.8	No	Self-Chained Image-Language Model for Video Loca...	2023-05-11	Code
32	TS-LLaVA-34B	73.6	No	TS-LLaVA: Constructing Visual Tokens through Thu...	2024-11-17	Code
33	TCR	73.5	No	Text-Conditioned Resampler For Long Form Video U...	2023-12-19	-
34	VideoTree (GPT4)	73.5	No	VideoTree: Adaptive Tree-based Video Representat...	2024-05-29	Code
35	LVNet(GPT-4o)	72.9	No	Too Many Frames, Not All Useful: Efficient Strat...	2024-06-13	Code
36	LSTP	72.1	No	Efficient Temporal Extrapolation of Multimodal L...	2024-02-25	Code
37	Mirasol3B	72	No	Mirasol3B: A Multimodal Autoregressive model for...	2023-11-09	-
38	VideoAgent (GPT-4)	71.3	No	VideoAgent: Long-form Video Understanding with L...	2024-03-15	Code
39	IG-VLM(LLaVA v1.6)	70.9	No	An Image Grid Can Be Worth a Video: Zero-shot Vi...	2024-03-27	Code
40	VidCtx (7B)	70.7	No	VidCtx: Context-aware Video Question Answering w...	2024-12-23	Code
41	MoReVQA(PaLM-2)	69.2	No	MoReVQA: Exploring Modular Reasoning Models for ...	2024-04-09	-
42	VideoChat2	68.6	No	MVBench: A Comprehensive Multi-modal Video Under...	2023-11-28	Code
43	IG-VLM (GPT-4)	68.6	No	An Image Grid Can Be Worth a Video: Zero-shot Vi...	2024-03-27	Code
44	TraveLER (GPT-4)	68.2	No	TraveLER: A Modular Multi-LMM Agent Framework fo...	2024-04-01	Code
45	LLoVi (GPT-4)	67.7	No	A Simple LLM Framework for Long-Range Video Ques...	2023-12-28	Code
46	LongVA(32 frames)	67.1	No	Long Context Transfer from Language to Vision	2024-06-24	Code
47	Q-ViD	66.3	No	Question-Instructed Visual Descriptions for Zero...	2024-02-16	Code
48	ProViQ	64.6	No	Zero-Shot Video Question Answering with Procedur...	2023-12-01	-
49	SlowFast-LLaVA-34B	64.2	No	SlowFast-LLaVA: A Strong Training-Free Baseline ...	2024-07-22	Code
50	Sevila (4B)	63.6	No	Self-Chained Image-Language Model for Video Loca...	2023-05-11	Code
51	RTQ	63.2	No	RTQ: Rethinking Video-language Understanding Bas...	2023-12-01	Code
52	HiTeA	63.1	Yes	HiTeA: Hierarchical Temporal-Aware Video-Languag...	2022-12-30	-
53	VideoChat2	61.7	No	MVBench: A Comprehensive Multi-modal Video Under...	2023-11-28	Code
54	DeepStack-L(7B)	61	No	DeepStack: Deeply Stacking Visual Tokens is Surp...	2024-06-06	-
55	LangRepo (12B)	60.9	No	Language Repository for Long Video Understanding	2024-03-21	Code
56	CoVGT(PT)	60.7	Yes	Contrastive Video Question Answering via Video G...	2023-02-27	Code
57	SeViT	60.6	No	Semi-Parametric Video-Grounded Text Generation	2023-01-27	-
58	ViperGPT(0-shot)	60	No	ViperGPT: Visual Inference via Python Execution ...	2023-03-14	Code
59	CoVGT	60	No	Contrastive Video Question Answering via Video G...	2023-02-27	Code
60	ViperGPT (GPT-3.5)	60	No	ViperGPT: Visual Inference via Python Execution ...	2023-03-14	Code
61	GF	58.83	No	Glance and Focus: Memory Prompting for Multi-Eve...	2024-01-03	Code
62	VFC	58.6	Yes	Verbs in Action: Improving verb understanding in...	2023-04-13	Code
63	ATM	58.3	No	ATM: Action Temporality Modeling for Video Quest...	2023-09-05	-
64	MIST	57.2	No	MIST: Multi-modal Iterative Spatial-Temporal Tra...	2022-12-19	Code
65	VGT(PT)	56.9	Yes	Video Graph Transformer for Video Question Answe...	2022-07-12	Code
66	PAXION	56.9	Yes	Paxion: Patching Action Knowledge in Video-Langu...	2023-05-18	Code
67	MVU (13B)	55.2	No	Understanding Long Videos with Multimodal Langua...	2024-03-25	Code
68	VGT	55	No	Video Graph Transformer for Video Question Answe...	2022-07-12	Code
69	ATP	54.3	No	Revisiting the "Video" in Video-Language Underst...	2022-06-03	Code
70	LLoVi (7B)	54.3	No	A Simple LLM Framework for Long-Range Video Ques...	2023-12-28	Code
71	P3D-G	53.4	No	(2.5+1)D Spatio-Temporal Scene Graphs for Video ...	2022-02-18	-
72	VFC	51.5	No	Verbs in Action: Improving verb understanding in...	2023-04-13	Code
73	HQGA	51.4	No	Video as Conditional Graph Hierarchy for Multi-G...	2021-12-12	Code
74	Mistral (7B)	51.1	No	Mistral 7B	2023-10-10	Code

#1LinVT-Qwen2-VL (7B)SOTA
85.5
Accuracy· 2024-12-06
LinVT: Empower Your Image-level Large Language Model to Understand Videos Code
#2InternVL-2.5(8B)
85.5
Accuracy· 2024-12-06
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling Code
#3VideoLLaMA3(7B)
84.5
Accuracy· 2025-01-22
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding Code
#4PLM-8B
84.1
Accuracy· 2025-04-17
PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding Code
#5BIMBA-LLaVA-Qwen2-7B
83.73
Accuracy· 2025-03-12
BIMBA: Selective-Scan Compression for Long-Range Video Question Answering Code
#6PLM-3B
83.4
Accuracy· 2025-04-17
PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding Code
#7LLaVA-VideoSOTA
83.2
Accuracy· 2024-10-03
Video Instruction Tuning With Synthetic Data
#8NVILA(8B)
82.2
Accuracy· 2024-12-05
NVILA: Efficient Frontier Visual Language Models Code
#9Oryx-1.5(7B)SOTA
81.8
Accuracy· 2024-09-19
Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution Code
#10Qwen2-VL(7B)SOTA
81.2
Accuracy· 2024-09-18
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution Code
#11LongVILA(7B)SOTA
80.7
Accuracy· 2024-08-19
LongVILA: Scaling Long-Context Visual Language Models for Long Videos Code
#12PLM-1B
80.3
Accuracy· 2025-04-17
PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding Code
#13LLaVA-OV(72B)SOTA
80.2
Accuracy· 2024-08-06
LLaVA-OneVision: Easy Visual Task Transfer Code
#14VideoMultiAgent (GPT-4o)
79.6
Accuracy· 2025-04-25
VideoMultiAgents: A Multi-Agent Framework for Video Question Answering Code
#15VideoChat2_HD_mistralSOTA
79.5
Accuracy· 2023-11-28
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark Code
#16LLaVA-OV(7B)
79.4
Accuracy· 2024-08-06
LLaVA-OneVision: Easy Visual Task Transfer Code
#17Tarsier (34B)
79.2
Accuracy· 2024-06-30
Tarsier: Recipes for Training and Evaluating Large Video Description Models Code
#18LLaVA-NeXT-Interleave(14B)
79.1
Accuracy· 2024-07-10
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models Code
#19VideoChat2_mistral
78.6
Accuracy· 2023-11-28
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark Code
#20mPLUG-Owl3(8B)
78.6
Accuracy· 2024-08-09
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models Code
#21LLaVA-NeXT-Interleave(7B)
78.2
Accuracy· 2024-07-10
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models Code
#22AKEYS
78.1
Accuracy· 2025-03-20
Agentic Keyframe Search for Video Question Answering Code
#23LLaVA-NeXT-Interleave(DPO)
77.9
Accuracy· 2024-07-10
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models Code
#24VamosSOTA
77.3
Accuracy· 2023-11-22
Vamos: Versatile Action Models for Video Understanding Code
#25ViLA (3B)
75.6
Accuracy· 2023-12-13
ViLA: Efficient Video-Language Alignment for Video Question Answering Code
#26VideoLLaMA2.1(7B)
75.6
Accuracy· 2024-06-11
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs Code
#27LLaMA-VQA (33B)SOTA
75.5
Accuracy· 2023-10-24
Large Language Models are Temporal and Causal Reasoners for Video Question Answering Code
#28ENTER
75.1
Accuracy· 2025-01-24
ENTER: Event Based Interpretable Reasoning for VideoQA
#29ViLA (3B, 4 frames)
74.4
Accuracy· 2023-12-13
ViLA: Efficient Video-Language Alignment for Video Question Answering Code
#30CREMA
73.9
Accuracy· 2024-02-08
CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion Code
#31SeViLASOTA
73.8
Accuracy· 2023-05-11
Self-Chained Image-Language Model for Video Localization and Question Answering Code
#32TS-LLaVA-34B
73.6
Accuracy· 2024-11-17
TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models Code
#33TCR
73.5
Accuracy· 2023-12-19
Text-Conditioned Resampler For Long Form Video Understanding
#34VideoTree (GPT4)
73.5
Accuracy· 2024-05-29
VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos Code
#35LVNet(GPT-4o)
72.9
Accuracy· 2024-06-13
Too Many Frames, Not All Useful: Efficient Strategies for Long-Form Video QA Code
#36LSTP
72.1
Accuracy· 2024-02-25
Efficient Temporal Extrapolation of Multimodal Large Language Models with Temporal Grounding Bridge Code
#37Mirasol3B
72
Accuracy· 2023-11-09
Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities
#38VideoAgent (GPT-4)
71.3
Accuracy· 2024-03-15
VideoAgent: Long-form Video Understanding with Large Language Model as Agent Code
#39IG-VLM(LLaVA v1.6)
70.9
Accuracy· 2024-03-27
An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM Code
#40VidCtx (7B)
70.7
Accuracy· 2024-12-23
VidCtx: Context-aware Video Question Answering with Image Models Code
#41MoReVQA(PaLM-2)
69.2
Accuracy· 2024-04-09
MoReVQA: Exploring Modular Reasoning Models for Video Question Answering
#42VideoChat2
68.6
Accuracy· 2023-11-28
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark Code
#43IG-VLM (GPT-4)
68.6
Accuracy· 2024-03-27
An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM Code
#44TraveLER (GPT-4)
68.2
Accuracy· 2024-04-01
TraveLER: A Modular Multi-LMM Agent Framework for Video Question-Answering Code
#45LLoVi (GPT-4)
67.7
Accuracy· 2023-12-28
A Simple LLM Framework for Long-Range Video Question-Answering Code
#46LongVA(32 frames)
67.1
Accuracy· 2024-06-24
Long Context Transfer from Language to Vision Code
#47Q-ViD
66.3
Accuracy· 2024-02-16
Question-Instructed Visual Descriptions for Zero-Shot Video Question Answering Code
#48ProViQ
64.6
Accuracy· 2023-12-01
Zero-Shot Video Question Answering with Procedural Programs
#49SlowFast-LLaVA-34B
64.2
Accuracy· 2024-07-22
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models Code
#50Sevila (4B)
63.6
Accuracy· 2023-05-11
Self-Chained Image-Language Model for Video Localization and Question Answering Code
#51RTQ
63.2
Accuracy· 2023-12-01
RTQ: Rethinking Video-language Understanding Based on Image-text Model Code
#52HiTeASOTA
63.1
Accuracy· Extra Data· 2022-12-30
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training
#53VideoChat2
61.7
Accuracy· 2023-11-28
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark Code
#54DeepStack-L(7B)
61
Accuracy· 2024-06-06
DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effective for LMMs
#55LangRepo (12B)
60.9
Accuracy· 2024-03-21
Language Repository for Long Video Understanding Code
#56CoVGT(PT)
60.7
Accuracy· Extra Data· 2023-02-27
Contrastive Video Question Answering via Video Graph Transformer Code
#57SeViT
60.6
Accuracy· 2023-01-27
Semi-Parametric Video-Grounded Text Generation
#58ViperGPT(0-shot)
60
Accuracy· 2023-03-14
ViperGPT: Visual Inference via Python Execution for Reasoning Code
#59CoVGT
60
Accuracy· 2023-02-27
Contrastive Video Question Answering via Video Graph Transformer Code
#60ViperGPT (GPT-3.5)
60
Accuracy· 2023-03-14
ViperGPT: Visual Inference via Python Execution for Reasoning Code
#61GF
58.83
Accuracy· 2024-01-03
Glance and Focus: Memory Prompting for Multi-Event Video Question Answering Code
#62VFC
58.6
Accuracy· Extra Data· 2023-04-13
Verbs in Action: Improving verb understanding in video-language models Code
#63ATM
58.3
Accuracy· 2023-09-05
ATM: Action Temporality Modeling for Video Question Answering
#64MISTSOTA
57.2
Accuracy· 2022-12-19
MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering Code
#65VGT(PT)SOTA
56.9
Accuracy· Extra Data· 2022-07-12
Video Graph Transformer for Video Question Answering Code
#66PAXION
56.9
Accuracy· Extra Data· 2023-05-18
Paxion: Patching Action Knowledge in Video-Language Foundation Models Code
#67MVU (13B)
55.2
Accuracy· 2024-03-25
Understanding Long Videos with Multimodal Language Models Code
#68VGT
55
Accuracy· 2022-07-12
Video Graph Transformer for Video Question Answering Code
#69ATPSOTA
54.3
Accuracy· 2022-06-03
Revisiting the "Video" in Video-Language Understanding Code
#70LLoVi (7B)
54.3
Accuracy· 2023-12-28
A Simple LLM Framework for Long-Range Video Question-Answering Code
#71P3D-GSOTA
53.4
Accuracy· 2022-02-18
(2.5+1)D Spatio-Temporal Scene Graphs for Video Question Answering
#72VFC
51.5
Accuracy· 2023-04-13
Verbs in Action: Improving verb understanding in video-language models Code
#73HQGASOTA
51.4
Accuracy· 2021-12-12
Video as Conditional Graph Hierarchy for Multi-Granular Question Answering Code
#74Mistral (7B)
51.1
Accuracy· 2023-10-10
Mistral 7B Code