Tasks
SotA
Datasets
Papers
Methods
Submit
About
SotA
/
Reasoning
/
Video Question Answering
/
TVQA
Video Question Answering on TVQA
Metric: Accuracy (higher is better)
Leaderboard
Dataset
Loading chart...
Results
Submit a result
Hide extra data
Export CSV
Sort:
Accuracy (best first)
Accuracy (worst first)
Date (newest first)
Date (oldest first)
Model name (A→Z)
#
Model
↕
Accuracy
▼
Extra Data
Paper
Date
↕
Code
1
LLaMA-VQA
82.2
No
Large Language Models are Temporal and Causal Re...
2023-10-24
Code
2
FrozenBiLM
82
Yes
Zero-Shot Video Question Answering via Frozen Bi...
2022-06-16
Code
3
VindLU
79
Yes
VindLU: A Recipe for Effective Video-and-Languag...
2022-12-09
Code
4
iPerceive (Chadha et al., 2020)
76.96
No
iPerceive: Applying Common-Sense Reasoning to Mu...
2020-11-16
-
5
Hero w/ pre-training
74.24
No
HERO: Hierarchical Encoder for Video+Language Om...
2020-05-01
Code
6
STAGE (Lei et al., 2019)
70.5
No
TVQA+: Spatio-Temporal Grounding for Video Quest...
2019-04-25
Code
7
FrozenBiLM (with speech)
59.7
No
Zero-Shot Video Question Answering via Frozen Bi...
2022-06-16
Code
8
IG-VLM (no speech, GPT-4V)
57.8
No
An Image Grid Can Be Worth a Video: Zero-shot Vi...
2024-03-27
Code
9
MiniGPT4-video-7B
54.21
No
MiniGPT4-Video: Advancing Multimodal LLMs for Vi...
2024-04-04
Code
10
VideoChat_HD_mistral (no speech)
50.6
No
MVBench: A Comprehensive Multi-modal Video Under...
2023-11-28
Code
11
VideoChat_mistral (no speech)
46.4
No
MVBench: A Comprehensive Multi-modal Video Under...
2023-11-28
Code
12
VideoChat2 (no speech)
40.6
No
MVBench: A Comprehensive Multi-modal Video Under...
2023-11-28
Code
13
SEVILA (no speech)
38.2
No
Self-Chained Image-Language Model for Video Loca...
2023-05-11
Code
14
InternVideo (no speech)
35.9
No
InternVideo: General Video Foundation Models via...
2022-12-06
Code
15
FrozenBILM (no speech)
29.7
No
Zero-Shot Video Question Answering via Frozen Bi...
2022-06-16
Code
#1
LLaMA-VQA
SOTA
82.2
Accuracy
· 2023-10-24
Large Language Models are Temporal and Causal Reasoners for Video Question Answering
Code
#2
FrozenBiLM
SOTA
82
Accuracy
· Extra Data
· 2022-06-16
Zero-Shot Video Question Answering via Frozen Bidirectional Language Models
Code
#3
VindLU
79
Accuracy
· Extra Data
· 2022-12-09
VindLU: A Recipe for Effective Video-and-Language Pretraining
Code
#4
iPerceive (Chadha et al., 2020)
SOTA
76.96
Accuracy
· 2020-11-16
iPerceive: Applying Common-Sense Reasoning to Multi-Modal Dense Video Captioning and Video Question Answering
#5
Hero w/ pre-training
SOTA
74.24
Accuracy
· 2020-05-01
HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training
Code
#6
STAGE (Lei et al., 2019)
SOTA
70.5
Accuracy
· 2019-04-25
TVQA+: Spatio-Temporal Grounding for Video Question Answering
Code
#7
FrozenBiLM (with speech)
59.7
Accuracy
· 2022-06-16
Zero-Shot Video Question Answering via Frozen Bidirectional Language Models
Code
#8
IG-VLM (no speech, GPT-4V)
57.8
Accuracy
· 2024-03-27
An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM
Code
#9
MiniGPT4-video-7B
54.21
Accuracy
· 2024-04-04
MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens
Code
#10
VideoChat_HD_mistral (no speech)
50.6
Accuracy
· 2023-11-28
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
Code
#11
VideoChat_mistral (no speech)
46.4
Accuracy
· 2023-11-28
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
Code
#12
VideoChat2 (no speech)
40.6
Accuracy
· 2023-11-28
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
Code
#13
SEVILA (no speech)
38.2
Accuracy
· 2023-05-11
Self-Chained Image-Language Model for Video Localization and Question Answering
Code
#14
InternVideo (no speech)
35.9
Accuracy
· 2022-12-06
InternVideo: General Video Foundation Models via Generative and Discriminative Learning
Code
#15
FrozenBILM (no speech)
29.7
Accuracy
· 2022-06-16
Zero-Shot Video Question Answering via Frozen Bidirectional Language Models
Code