Tasks
SotA
Datasets
Papers
Methods
Submit
About
SotA
/
Natural Language Processing
/
Question Answering
/
IntentQA
Question Answering on IntentQA
Metric: Accuracy (higher is better)
Leaderboard
Dataset
Loading chart...
Results
Submit a result
Hide extra data
Export CSV
Sort:
Accuracy (best first)
Accuracy (worst first)
Date (newest first)
Date (oldest first)
Model name (A→Z)
#
Model
↕
Accuracy
▼
Extra Data
Paper
Date
↕
Code
1
ENTER
71.5
No
ENTER: Event Based Interpretable Reasoning for V...
2025-01-24
-
2
LVNet
71.1
No
Too Many Frames, Not All Useful: Efficient Strat...
2024-06-13
Code
3
TS-LLaVA-34B
67.9
No
TS-LLaVA: Constructing Visual Tokens through Thu...
2024-11-17
Code
4
VidCtx (7B)
67.1
No
VidCtx: Context-aware Video Question Answering w...
2024-12-23
Code
5
VideoTree (GPT4)
66.9
No
VideoTree: Adaptive Tree-based Video Representat...
2024-05-29
Code
6
IG-VLM
65.3
No
An Image Grid Can Be Worth a Video: Zero-shot Vi...
2024-03-27
Code
7
LLoVi (GPT-4)
64
No
A Simple LLM Framework for Long-Range Video Ques...
2023-12-28
Code
8
SeViLA (4B)
60.9
Yes
Self-Chained Image-Language Model for Video Loca...
2023-05-11
Code
9
SlowFast-LLaVA-34B
60.1
No
SlowFast-LLaVA: A Strong Training-Free Baseline ...
2024-07-22
Code
10
LangRepo (12B)
59.1
No
Language Repository for Long Video Understanding
2024-03-21
Code
11
LLoVi (7B)
53.6
No
A Simple LLM Framework for Long-Range Video Ques...
2023-12-28
Code
12
Mistral (7B)
50.4
No
Mistral 7B
2023-10-10
Code
13
Random
20
No
CREPE: Can Vision-Language Foundation Models Rea...
2022-12-13
Code
#1
ENTER
SOTA
71.5
Accuracy
· 2025-01-24
ENTER: Event Based Interpretable Reasoning for VideoQA
#2
LVNet
SOTA
71.1
Accuracy
· 2024-06-13
Too Many Frames, Not All Useful: Efficient Strategies for Long-Form Video QA
Code
#3
TS-LLaVA-34B
67.9
Accuracy
· 2024-11-17
TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models
Code
#4
VidCtx (7B)
67.1
Accuracy
· 2024-12-23
VidCtx: Context-aware Video Question Answering with Image Models
Code
#5
VideoTree (GPT4)
SOTA
66.9
Accuracy
· 2024-05-29
VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos
Code
#6
IG-VLM
SOTA
65.3
Accuracy
· 2024-03-27
An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM
Code
#7
LLoVi (GPT-4)
SOTA
64
Accuracy
· 2023-12-28
A Simple LLM Framework for Long-Range Video Question-Answering
Code
#8
SeViLA (4B)
SOTA
60.9
Accuracy
· Extra Data
· 2023-05-11
Self-Chained Image-Language Model for Video Localization and Question Answering
Code
#9
SlowFast-LLaVA-34B
60.1
Accuracy
· 2024-07-22
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models
Code
#10
LangRepo (12B)
59.1
Accuracy
· 2024-03-21
Language Repository for Long Video Understanding
Code
#11
LLoVi (7B)
53.6
Accuracy
· 2023-12-28
A Simple LLM Framework for Long-Range Video Question-Answering
Code
#12
Mistral (7B)
50.4
Accuracy
· 2023-10-10
Mistral 7B
Code
#13
Random
SOTA
20
Accuracy
· 2022-12-13
CREPE: Can Vision-Language Foundation Models Reason Compositionally?
Code