Tasks
SotA
Datasets
Papers
Methods
Submit
About
SotA
/
Reasoning
/
Video Question Answering
/
ActivityNet-QA
Video Question Answering on ActivityNet-QA
Metric: Accuracy (higher is better)
Leaderboard
Dataset
Loading chart...
Results
Submit a result
Hide extra data
Export CSV
Sort:
Accuracy (best first)
Accuracy (worst first)
Date (newest first)
Date (oldest first)
Model name (A→Z)
#
Model
↕
Accuracy
▼
Extra Data
Paper
Date
↕
Code
1
Tarsier (34B)
61.6
No
Tarsier: Recipes for Training and Evaluating Lar...
2024-06-30
Code
2
GPT-2 + CLIP-14 + CLIP-multilingual (Zero-Shot)
61.2
No
Composing Ensembles of Pre-trained Models via It...
2022-10-20
-
3
PLLaVA (34B)
60.9
No
PLLaVA : Parameter-free LLaVA Extension from Ima...
2024-04-25
Code
4
PPLLaVA-7B
60.7
No
PPLLaVA: Varied Video Sequence Understanding Wit...
2024-11-04
Code
5
LinVT-Qwen2-VL(7B)
60.1
No
LinVT: Empower Your Image-level Large Language M...
2024-12-06
Code
6
SlowFast-LLaVA-34B
59.2
No
SlowFast-LLaVA: A Strong Training-Free Baseline ...
2024-07-22
Code
7
TS-LLaVA-34B
58.9
No
TS-LLaVA: Constructing Visual Tokens through Thu...
2024-11-17
Code
8
GPT-2 + CLIP-32 (Zero-Shot)
58.4
No
Composing Ensembles of Pre-trained Models via It...
2022-10-20
-
9
IG-VLM
58.4
No
An Image Grid Can Be Worth a Video: Zero-shot Vi...
2024-03-27
Code
10
VideoCoCa
56.1
Yes
VideoCoCa: Video-Text Modeling with Zero-Shot Tr...
2022-12-09
-
11
LLaVA-Mini
53.5
No
LLaVA-Mini: Efficient Image and Video Large Mult...
2025-01-07
Code
12
Flash-VStream
51.9
No
Flash-VStream: Memory-Based Real-Time Understand...
2024-06-12
Code
13
Mirasol3B
51.13
No
Mirasol3B: A Multimodal Autoregressive model for...
2023-11-09
-
14
ST-LLM
50.9
No
ST-LLM: Large Language Models Are Effective Temp...
2024-03-30
Code
15
VideoGPT+
50.6
No
VideoGPT+: Integrating Image and Video Encoders ...
2024-06-13
Code
16
VAST
50.4
Yes
VAST: A Vision-Audio-Subtitle-Text Omni-Modality...
2023-05-29
Code
17
CAT-7B
50.2
No
CAT: Enhancing Multimodal Large Language Model t...
2024-03-07
Code
18
Video-LaVIT
50.1
No
Video-LaVIT: Unified Video-Language Pre-training...
2024-02-05
Code
19
COSA
49.9
Yes
COSA: Concatenated Sample Pretrained Vision-Lang...
2023-06-15
Code
20
MA-LMM
49.8
No
MA-LMM: Memory-Augmented Large Multimodal Model ...
2024-04-08
Code
21
VideoChat2
49.1
No
MVBench: A Comprehensive Multi-modal Video Under...
2023-11-28
Code
22
VideoChat2
49.1
No
MVBench: A Comprehensive Multi-modal Video Under...
2023-11-28
Code
23
VALOR
48.6
Yes
VALOR: Vision-Audio-Language Omni-Perception Pre...
2023-04-17
Code
24
UMT-L (ViT-L/16)
47.9
Yes
Unmasked Teacher: Towards Training-Efficient Vid...
2023-03-28
Code
25
LLaMA-VID-13B (2 Token)
47.5
No
LLaMA-VID: An Image is Worth 2 Tokens in Large L...
2023-11-28
Code
26
LLaMA-VID-13B (2 Token)
47.5
No
LLaMA-VID: An Image is Worth 2 Tokens in Large L...
2023-11-28
Code
27
LLaMA-VID-7B (2 Token)
47.4
No
LLaMA-VID: An Image is Worth 2 Tokens in Large L...
2023-11-28
Code
28
LLaMA-VID-7B (2 Token)
47.4
No
LLaMA-VID: An Image is Worth 2 Tokens in Large L...
2023-11-28
Code
29
Chat-UniVi-13B
46.4
No
Chat-UniVi: Unified Visual Representation Empowe...
2023-11-14
Code
30
Chat-UniVi-13B
46.4
No
Chat-UniVi: Unified Visual Representation Empowe...
2023-11-14
Code
31
MiniGPT4-video-7B
46.3
No
MiniGPT4-Video: Advancing Multimodal LLMs for Vi...
2024-04-04
Code
32
BT-Adapter (zero-shot)
46.1
No
BT-Adapter: Video Conversation is Feasible Witho...
2023-09-27
Code
33
Chat-UniVi
46.1
No
Chat-UniVi: Unified Visual Representation Empowe...
2023-11-14
Code
34
BT-Adapter (zero-shot)
46.1
No
BT-Adapter: Video Conversation is Feasible Witho...
2023-09-27
Code
35
MovieChat
45.7
No
MovieChat: From Dense Token to Sparse Memory for...
2023-07-31
Code
36
MovieChat
45.7
No
MovieChat: From Dense Token to Sparse Memory for...
2023-07-31
Code
37
Video-LLaVA
45.3
No
Video-LLaVA: Learning United Visual Representati...
2023-11-16
Code
38
Video-LLaVA
45.3
No
Video-LLaVA: Learning United Visual Representati...
2023-11-16
Code
39
TESTA (ViT-B/16)
45
Yes
TESTA: Temporal-Spatial Token Aggregation for Lo...
2023-10-29
Code
40
FrozenBiLM+
44.8
No
Open-vocabulary Video Question Answering: A New ...
2023-08-18
Code
41
VindLU
44.7
Yes
VindLU: A Recipe for Effective Video-and-Languag...
2022-12-09
Code
42
Singularity-temporal
44.1
Yes
Revealing Single Frame Bias for Video-and-Langua...
2022-06-07
Code
43
Elysium
43.4
No
Elysium: Exploring Object-level Perception in Vi...
2024-03-25
Code
44
FrozenBiLM
43.2
Yes
Zero-Shot Video Question Answering via Frozen Bi...
2022-06-16
Code
45
Singularity
43.1
Yes
Revealing Single Frame Bias for Video-and-Langua...
2022-06-07
Code
46
Text + Text (no Multimodal Pretext Training)
41.4
No
Towards Fast Adaptation of Pretrained Contrastiv...
2022-06-05
Code
47
All-in-one+
40
No
Open-vocabulary Video Question Answering: A New ...
2023-08-18
Code
48
VIOLET+
39.7
No
Open-vocabulary Video Question Answering: A New ...
2023-08-18
Code
49
Just Ask (fine-tune)
38.9
No
Just Ask: Learning to Answer Questions from Mill...
2020-12-01
Code
50
LocVLM-Vid-B+
38.2
No
Learning to Localize Objects Improves Spatial Re...
2024-04-11
Code
51
LocVLM-Vid-B
37.4
No
Learning to Localize Objects Improves Spatial Re...
2024-04-11
Code
52
Video-ChatGPT
35.2
No
Video-ChatGPT: Towards Detailed Video Understand...
2023-06-08
Code
53
Video-ChatGPT
35.2
No
Video-ChatGPT: Towards Detailed Video Understand...
2023-06-08
Code
54
LLaMA Adapter V2
34.2
No
LLaMA-Adapter V2: Parameter-Efficient Visual Ins...
2023-04-28
Code
55
LLaMA Adapter
34.2
No
LLaMA-Adapter V2: Parameter-Efficient Visual Ins...
2023-04-28
Code
56
E-SA
31.8
No
ActivityNet-QA: A Dataset for Understanding Comp...
2019-06-06
Code
57
E-MN
27.1
No
ActivityNet-QA: A Dataset for Understanding Comp...
2019-06-06
Code
58
Video Chat
26.5
No
VideoChat: Chat-Centric Video Understanding
2023-05-10
Code
59
Video Chat
26.5
No
VideoChat: Chat-Centric Video Understanding
2023-05-10
Code
60
FrozenBiLM (0-shot)
25.9
No
Zero-Shot Video Question Answering via Frozen Bi...
2022-06-16
Code
61
E-VQA
25.1
No
ActivityNet-QA: A Dataset for Understanding Comp...
2019-06-06
Code
62
FrozenBiLM
24.7
No
Zero-Shot Video Question Answering via Frozen Bi...
2022-06-16
Code
63
Video LLaMA
12.4
No
Video-LLaMA: An Instruction-tuned Audio-Visual L...
2023-06-05
Code
64
Just Ask (0-shot)
12.2
No
Just Ask: Learning to Answer Questions from Mill...
2020-12-01
Code
#1
Tarsier (34B)
SOTA
61.6
Accuracy
· 2024-06-30
Tarsier: Recipes for Training and Evaluating Large Video Description Models
Code
#2
GPT-2 + CLIP-14 + CLIP-multilingual (Zero-Shot)
SOTA
61.2
Accuracy
· 2022-10-20
Composing Ensembles of Pre-trained Models via Iterative Consensus
#3
PLLaVA (34B)
60.9
Accuracy
· 2024-04-25
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning
Code
#4
PPLLaVA-7B
60.7
Accuracy
· 2024-11-04
PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance
Code
#5
LinVT-Qwen2-VL(7B)
60.1
Accuracy
· 2024-12-06
LinVT: Empower Your Image-level Large Language Model to Understand Videos
Code
#6
SlowFast-LLaVA-34B
59.2
Accuracy
· 2024-07-22
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models
Code
#7
TS-LLaVA-34B
58.9
Accuracy
· 2024-11-17
TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models
Code
#8
GPT-2 + CLIP-32 (Zero-Shot)
58.4
Accuracy
· 2022-10-20
Composing Ensembles of Pre-trained Models via Iterative Consensus
#9
IG-VLM
58.4
Accuracy
· 2024-03-27
An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM
Code
#10
VideoCoCa
56.1
Accuracy
· Extra Data
· 2022-12-09
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners
#11
LLaVA-Mini
53.5
Accuracy
· 2025-01-07
LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token
Code
#12
Flash-VStream
51.9
Accuracy
· 2024-06-12
Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams
Code
#13
Mirasol3B
51.13
Accuracy
· 2023-11-09
Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities
#14
ST-LLM
50.9
Accuracy
· 2024-03-30
ST-LLM: Large Language Models Are Effective Temporal Learners
Code
#15
VideoGPT+
50.6
Accuracy
· 2024-06-13
VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding
Code
#16
VAST
50.4
Accuracy
· Extra Data
· 2023-05-29
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
Code
#17
CAT-7B
50.2
Accuracy
· 2024-03-07
CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios
Code
#18
Video-LaVIT
50.1
Accuracy
· 2024-02-05
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization
Code
#19
COSA
49.9
Accuracy
· Extra Data
· 2023-06-15
COSA: Concatenated Sample Pretrained Vision-Language Foundation Model
Code
#20
MA-LMM
49.8
Accuracy
· 2024-04-08
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding
Code
#21
VideoChat2
49.1
Accuracy
· 2023-11-28
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
Code
#22
VideoChat2
49.1
Accuracy
· 2023-11-28
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
Code
#23
VALOR
48.6
Accuracy
· Extra Data
· 2023-04-17
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
Code
#24
UMT-L (ViT-L/16)
47.9
Accuracy
· Extra Data
· 2023-03-28
Unmasked Teacher: Towards Training-Efficient Video Foundation Models
Code
#25
LLaMA-VID-13B (2 Token)
47.5
Accuracy
· 2023-11-28
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
Code
#26
LLaMA-VID-13B (2 Token)
47.5
Accuracy
· 2023-11-28
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
Code
#27
LLaMA-VID-7B (2 Token)
47.4
Accuracy
· 2023-11-28
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
Code
#28
LLaMA-VID-7B (2 Token)
47.4
Accuracy
· 2023-11-28
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
Code
#29
Chat-UniVi-13B
46.4
Accuracy
· 2023-11-14
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
Code
#30
Chat-UniVi-13B
46.4
Accuracy
· 2023-11-14
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
Code
#31
MiniGPT4-video-7B
46.3
Accuracy
· 2024-04-04
MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens
Code
#32
BT-Adapter (zero-shot)
46.1
Accuracy
· 2023-09-27
BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning
Code
#33
Chat-UniVi
46.1
Accuracy
· 2023-11-14
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
Code
#34
BT-Adapter (zero-shot)
46.1
Accuracy
· 2023-09-27
BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning
Code
#35
MovieChat
45.7
Accuracy
· 2023-07-31
MovieChat: From Dense Token to Sparse Memory for Long Video Understanding
Code
#36
MovieChat
45.7
Accuracy
· 2023-07-31
MovieChat: From Dense Token to Sparse Memory for Long Video Understanding
Code
#37
Video-LLaVA
45.3
Accuracy
· 2023-11-16
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
Code
#38
Video-LLaVA
45.3
Accuracy
· 2023-11-16
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
Code
#39
TESTA (ViT-B/16)
45
Accuracy
· Extra Data
· 2023-10-29
TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding
Code
#40
FrozenBiLM+
44.8
Accuracy
· 2023-08-18
Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models
Code
#41
VindLU
44.7
Accuracy
· Extra Data
· 2022-12-09
VindLU: A Recipe for Effective Video-and-Language Pretraining
Code
#42
Singularity-temporal
SOTA
44.1
Accuracy
· Extra Data
· 2022-06-07
Revealing Single Frame Bias for Video-and-Language Learning
Code
#43
Elysium
43.4
Accuracy
· 2024-03-25
Elysium: Exploring Object-level Perception in Videos via MLLM
Code
#44
FrozenBiLM
43.2
Accuracy
· Extra Data
· 2022-06-16
Zero-Shot Video Question Answering via Frozen Bidirectional Language Models
Code
#45
Singularity
43.1
Accuracy
· Extra Data
· 2022-06-07
Revealing Single Frame Bias for Video-and-Language Learning
Code
#46
Text + Text (no Multimodal Pretext Training)
SOTA
41.4
Accuracy
· 2022-06-05
Towards Fast Adaptation of Pretrained Contrastive Models for Multi-channel Video-Language Retrieval
Code
#47
All-in-one+
40
Accuracy
· 2023-08-18
Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models
Code
#48
VIOLET+
39.7
Accuracy
· 2023-08-18
Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models
Code
#49
Just Ask (fine-tune)
SOTA
38.9
Accuracy
· 2020-12-01
Just Ask: Learning to Answer Questions from Millions of Narrated Videos
Code
#50
LocVLM-Vid-B+
38.2
Accuracy
· 2024-04-11
Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs
Code
#51
LocVLM-Vid-B
37.4
Accuracy
· 2024-04-11
Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs
Code
#52
Video-ChatGPT
35.2
Accuracy
· 2023-06-08
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
Code
#53
Video-ChatGPT
35.2
Accuracy
· 2023-06-08
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
Code
#54
LLaMA Adapter V2
34.2
Accuracy
· 2023-04-28
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
Code
#55
LLaMA Adapter
34.2
Accuracy
· 2023-04-28
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
Code
#56
E-SA
SOTA
31.8
Accuracy
· 2019-06-06
ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering
Code
#57
E-MN
27.1
Accuracy
· 2019-06-06
ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering
Code
#58
Video Chat
26.5
Accuracy
· 2023-05-10
VideoChat: Chat-Centric Video Understanding
Code
#59
Video Chat
26.5
Accuracy
· 2023-05-10
VideoChat: Chat-Centric Video Understanding
Code
#60
FrozenBiLM (0-shot)
25.9
Accuracy
· 2022-06-16
Zero-Shot Video Question Answering via Frozen Bidirectional Language Models
Code
#61
E-VQA
25.1
Accuracy
· 2019-06-06
ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering
Code
#62
FrozenBiLM
24.7
Accuracy
· 2022-06-16
Zero-Shot Video Question Answering via Frozen Bidirectional Language Models
Code
#63
Video LLaMA
12.4
Accuracy
· 2023-06-05
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
Code
#64
Just Ask (0-shot)
12.2
Accuracy
· 2020-12-01
Just Ask: Learning to Answer Questions from Millions of Narrated Videos
Code