Video Question Answering on IntentQA

Metric: Accuracy (higher is better)

LeaderboardDataset

Loading chart...

Results

Hide extra data

Sort:

#	Model↕	Accuracy▼	Extra Data	Paper	Date↕	Code
1	ENTER	71.5	No	ENTER: Event Based Interpretable Reasoning for V...	2025-01-24	-
2	LVNet	71.1	No	Too Many Frames, Not All Useful: Efficient Strat...	2024-06-13	Code
3	TS-LLaVA-34B	67.9	No	TS-LLaVA: Constructing Visual Tokens through Thu...	2024-11-17	Code
4	VidCtx (7B)	67.1	No	VidCtx: Context-aware Video Question Answering w...	2024-12-23	Code
5	VideoTree (GPT4)	66.9	No	VideoTree: Adaptive Tree-based Video Representat...	2024-05-29	Code
6	IG-VLM	65.3	No	An Image Grid Can Be Worth a Video: Zero-shot Vi...	2024-03-27	Code
7	LLoVi (GPT-4)	64	No	A Simple LLM Framework for Long-Range Video Ques...	2023-12-28	Code
8	SeViLA (4B)	60.9	Yes	Self-Chained Image-Language Model for Video Loca...	2023-05-11	Code
9	SlowFast-LLaVA-34B	60.1	No	SlowFast-LLaVA: A Strong Training-Free Baseline ...	2024-07-22	Code
10	LangRepo (12B)	59.1	No	Language Repository for Long Video Understanding	2024-03-21	Code
11	LLoVi (7B)	53.6	No	A Simple LLM Framework for Long-Range Video Ques...	2023-12-28	Code
12	Mistral (7B)	50.4	No	Mistral 7B	2023-10-10	Code
13	Random	20	No	CREPE: Can Vision-Language Foundation Models Rea...	2022-12-13	Code

#1ENTERSOTA
71.5
Accuracy· 2025-01-24
ENTER: Event Based Interpretable Reasoning for VideoQA
#2LVNetSOTA
71.1
Accuracy· 2024-06-13
Too Many Frames, Not All Useful: Efficient Strategies for Long-Form Video QA Code
#3TS-LLaVA-34B
67.9
Accuracy· 2024-11-17
TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models Code
#4VidCtx (7B)
67.1
Accuracy· 2024-12-23
VidCtx: Context-aware Video Question Answering with Image Models Code
#5VideoTree (GPT4)SOTA
66.9
Accuracy· 2024-05-29
VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos Code
#6IG-VLMSOTA
65.3
Accuracy· 2024-03-27
An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM Code
#7LLoVi (GPT-4)SOTA
64
Accuracy· 2023-12-28
A Simple LLM Framework for Long-Range Video Question-Answering Code
#8SeViLA (4B)SOTA
60.9
Accuracy· Extra Data· 2023-05-11
Self-Chained Image-Language Model for Video Localization and Question Answering Code
#9SlowFast-LLaVA-34B
60.1
Accuracy· 2024-07-22
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models Code
#10LangRepo (12B)
59.1
Accuracy· 2024-03-21
Language Repository for Long Video Understanding Code
#11LLoVi (7B)
53.6
Accuracy· 2023-12-28
A Simple LLM Framework for Long-Range Video Question-Answering Code
#12Mistral (7B)
50.4
Accuracy· 2023-10-10
Mistral 7B Code
#13RandomSOTA
20
Accuracy· 2022-12-13
CREPE: Can Vision-Language Foundation Models Reason Compositionally?Code