Tasks
SotA
Datasets
Papers
Methods
Submit
About
SotA
/
Natural Language Processing
/
Question Answering
/
ActivityNet-QA
Question Answering on ActivityNet-QA
Metric: Accuracy (higher is better)
Leaderboard
Dataset
Loading chart...
Results
Submit a result
Export CSV
#
Model
↕
Accuracy
▼
Extra Data
Paper
Date
↕
Code
1
Tarsier (34B)
61.6
No
Tarsier: Recipes for Training and Evaluating Lar...
2024-06-30
Code
2
PLLaVA (34B)
60.9
No
PLLaVA : Parameter-free LLaVA Extension from Ima...
2024-04-25
Code
3
PPLLaVA-7B
60.7
No
PPLLaVA: Varied Video Sequence Understanding Wit...
2024-11-04
Code
4
LinVT-Qwen2-VL(7B)
60.1
No
LinVT: Empower Your Image-level Large Language M...
2024-12-06
Code
5
SlowFast-LLaVA-34B
59.2
No
SlowFast-LLaVA: A Strong Training-Free Baseline ...
2024-07-22
Code
6
TS-LLaVA-34B
58.9
No
TS-LLaVA: Constructing Visual Tokens through Thu...
2024-11-17
Code
7
IG-VLM
58.4
No
An Image Grid Can Be Worth a Video: Zero-shot Vi...
2024-03-27
Code
8
LLaVA-Mini
53.5
No
LLaVA-Mini: Efficient Image and Video Large Mult...
2025-01-07
Code
9
Flash-VStream
51.9
No
Flash-VStream: Memory-Based Real-Time Understand...
2024-06-12
Code
10
ST-LLM
50.9
No
ST-LLM: Large Language Models Are Effective Temp...
2024-03-30
Code
11
VideoGPT+
50.6
No
VideoGPT+: Integrating Image and Video Encoders ...
2024-06-13
Code
12
CAT-7B
50.2
No
CAT: Enhancing Multimodal Large Language Model t...
2024-03-07
Code
13
Video-LaVIT
50.1
No
Video-LaVIT: Unified Video-Language Pre-training...
2024-02-05
Code
14
VideoChat2
49.1
No
MVBench: A Comprehensive Multi-modal Video Under...
2023-11-28
Code
15
LLaMA-VID-13B (2 Token)
47.5
No
LLaMA-VID: An Image is Worth 2 Tokens in Large L...
2023-11-28
Code
16
LLaMA-VID-7B (2 Token)
47.4
No
LLaMA-VID: An Image is Worth 2 Tokens in Large L...
2023-11-28
Code
17
Chat-UniVi-13B
46.4
No
Chat-UniVi: Unified Visual Representation Empowe...
2023-11-14
Code
18
MiniGPT4-video-7B
46.3
No
MiniGPT4-Video: Advancing Multimodal LLMs for Vi...
2024-04-04
Code
19
Chat-UniVi
46.1
No
Chat-UniVi: Unified Visual Representation Empowe...
2023-11-14
Code
20
BT-Adapter (zero-shot)
46.1
No
BT-Adapter: Video Conversation is Feasible Witho...
2023-09-27
Code
21
MovieChat
45.7
No
MovieChat: From Dense Token to Sparse Memory for...
2023-07-31
Code
22
Video-LLaVA
45.3
No
Video-LLaVA: Learning United Visual Representati...
2023-11-16
Code
23
Elysium
43.4
No
Elysium: Exploring Object-level Perception in Vi...
2024-03-25
Code
24
Video-ChatGPT
35.2
No
Video-ChatGPT: Towards Detailed Video Understand...
2023-06-08
Code
25
LLaMA Adapter
34.2
No
LLaMA-Adapter V2: Parameter-Efficient Visual Ins...
2023-04-28
Code
26
Video Chat
26.5
No
VideoChat: Chat-Centric Video Understanding
2023-05-10
Code
27
FrozenBiLM
24.7
No
Zero-Shot Video Question Answering via Frozen Bi...
2022-06-16
Code
28
Video LLaMA
12.4
No
Video-LLaMA: An Instruction-tuned Audio-Visual L...
2023-06-05
Code