Video Question Answering on ActivityNet-QA

Metric: Accuracy (higher is better)

LeaderboardDataset

Loading chart...

Results

Hide extra data

Sort:

#	Model↕	Accuracy▼	Extra Data	Paper	Date↕	Code
1	Tarsier (34B)	61.6	No	Tarsier: Recipes for Training and Evaluating Lar...	2024-06-30	Code
2	GPT-2 + CLIP-14 + CLIP-multilingual (Zero-Shot)	61.2	No	Composing Ensembles of Pre-trained Models via It...	2022-10-20	-
3	PLLaVA (34B)	60.9	No	PLLaVA : Parameter-free LLaVA Extension from Ima...	2024-04-25	Code
4	PPLLaVA-7B	60.7	No	PPLLaVA: Varied Video Sequence Understanding Wit...	2024-11-04	Code
5	LinVT-Qwen2-VL(7B)	60.1	No	LinVT: Empower Your Image-level Large Language M...	2024-12-06	Code
6	SlowFast-LLaVA-34B	59.2	No	SlowFast-LLaVA: A Strong Training-Free Baseline ...	2024-07-22	Code
7	TS-LLaVA-34B	58.9	No	TS-LLaVA: Constructing Visual Tokens through Thu...	2024-11-17	Code
8	GPT-2 + CLIP-32 (Zero-Shot)	58.4	No	Composing Ensembles of Pre-trained Models via It...	2022-10-20	-
9	IG-VLM	58.4	No	An Image Grid Can Be Worth a Video: Zero-shot Vi...	2024-03-27	Code
10	VideoCoCa	56.1	Yes	VideoCoCa: Video-Text Modeling with Zero-Shot Tr...	2022-12-09	-
11	LLaVA-Mini	53.5	No	LLaVA-Mini: Efficient Image and Video Large Mult...	2025-01-07	Code
12	Flash-VStream	51.9	No	Flash-VStream: Memory-Based Real-Time Understand...	2024-06-12	Code
13	Mirasol3B	51.13	No	Mirasol3B: A Multimodal Autoregressive model for...	2023-11-09	-
14	ST-LLM	50.9	No	ST-LLM: Large Language Models Are Effective Temp...	2024-03-30	Code
15	VideoGPT+	50.6	No	VideoGPT+: Integrating Image and Video Encoders ...	2024-06-13	Code
16	VAST	50.4	Yes	VAST: A Vision-Audio-Subtitle-Text Omni-Modality...	2023-05-29	Code
17	CAT-7B	50.2	No	CAT: Enhancing Multimodal Large Language Model t...	2024-03-07	Code
18	Video-LaVIT	50.1	No	Video-LaVIT: Unified Video-Language Pre-training...	2024-02-05	Code
19	COSA	49.9	Yes	COSA: Concatenated Sample Pretrained Vision-Lang...	2023-06-15	Code
20	MA-LMM	49.8	No	MA-LMM: Memory-Augmented Large Multimodal Model ...	2024-04-08	Code
21	VideoChat2	49.1	No	MVBench: A Comprehensive Multi-modal Video Under...	2023-11-28	Code
22	VideoChat2	49.1	No	MVBench: A Comprehensive Multi-modal Video Under...	2023-11-28	Code
23	VALOR	48.6	Yes	VALOR: Vision-Audio-Language Omni-Perception Pre...	2023-04-17	Code
24	UMT-L (ViT-L/16)	47.9	Yes	Unmasked Teacher: Towards Training-Efficient Vid...	2023-03-28	Code
25	LLaMA-VID-13B (2 Token)	47.5	No	LLaMA-VID: An Image is Worth 2 Tokens in Large L...	2023-11-28	Code
26	LLaMA-VID-13B (2 Token)	47.5	No	LLaMA-VID: An Image is Worth 2 Tokens in Large L...	2023-11-28	Code
27	LLaMA-VID-7B (2 Token)	47.4	No	LLaMA-VID: An Image is Worth 2 Tokens in Large L...	2023-11-28	Code
28	LLaMA-VID-7B (2 Token)	47.4	No	LLaMA-VID: An Image is Worth 2 Tokens in Large L...	2023-11-28	Code
29	Chat-UniVi-13B	46.4	No	Chat-UniVi: Unified Visual Representation Empowe...	2023-11-14	Code
30	Chat-UniVi-13B	46.4	No	Chat-UniVi: Unified Visual Representation Empowe...	2023-11-14	Code
31	MiniGPT4-video-7B	46.3	No	MiniGPT4-Video: Advancing Multimodal LLMs for Vi...	2024-04-04	Code
32	BT-Adapter (zero-shot)	46.1	No	BT-Adapter: Video Conversation is Feasible Witho...	2023-09-27	Code
33	Chat-UniVi	46.1	No	Chat-UniVi: Unified Visual Representation Empowe...	2023-11-14	Code
34	BT-Adapter (zero-shot)	46.1	No	BT-Adapter: Video Conversation is Feasible Witho...	2023-09-27	Code
35	MovieChat	45.7	No	MovieChat: From Dense Token to Sparse Memory for...	2023-07-31	Code
36	MovieChat	45.7	No	MovieChat: From Dense Token to Sparse Memory for...	2023-07-31	Code
37	Video-LLaVA	45.3	No	Video-LLaVA: Learning United Visual Representati...	2023-11-16	Code
38	Video-LLaVA	45.3	No	Video-LLaVA: Learning United Visual Representati...	2023-11-16	Code
39	TESTA (ViT-B/16)	45	Yes	TESTA: Temporal-Spatial Token Aggregation for Lo...	2023-10-29	Code
40	FrozenBiLM+	44.8	No	Open-vocabulary Video Question Answering: A New ...	2023-08-18	Code
41	VindLU	44.7	Yes	VindLU: A Recipe for Effective Video-and-Languag...	2022-12-09	Code
42	Singularity-temporal	44.1	Yes	Revealing Single Frame Bias for Video-and-Langua...	2022-06-07	Code
43	Elysium	43.4	No	Elysium: Exploring Object-level Perception in Vi...	2024-03-25	Code
44	FrozenBiLM	43.2	Yes	Zero-Shot Video Question Answering via Frozen Bi...	2022-06-16	Code
45	Singularity	43.1	Yes	Revealing Single Frame Bias for Video-and-Langua...	2022-06-07	Code
46	Text + Text (no Multimodal Pretext Training)	41.4	No	Towards Fast Adaptation of Pretrained Contrastiv...	2022-06-05	Code
47	All-in-one+	40	No	Open-vocabulary Video Question Answering: A New ...	2023-08-18	Code
48	VIOLET+	39.7	No	Open-vocabulary Video Question Answering: A New ...	2023-08-18	Code
49	Just Ask (fine-tune)	38.9	No	Just Ask: Learning to Answer Questions from Mill...	2020-12-01	Code
50	LocVLM-Vid-B+	38.2	No	Learning to Localize Objects Improves Spatial Re...	2024-04-11	Code
51	LocVLM-Vid-B	37.4	No	Learning to Localize Objects Improves Spatial Re...	2024-04-11	Code
52	Video-ChatGPT	35.2	No	Video-ChatGPT: Towards Detailed Video Understand...	2023-06-08	Code
53	Video-ChatGPT	35.2	No	Video-ChatGPT: Towards Detailed Video Understand...	2023-06-08	Code
54	LLaMA Adapter V2	34.2	No	LLaMA-Adapter V2: Parameter-Efficient Visual Ins...	2023-04-28	Code
55	LLaMA Adapter	34.2	No	LLaMA-Adapter V2: Parameter-Efficient Visual Ins...	2023-04-28	Code
56	E-SA	31.8	No	ActivityNet-QA: A Dataset for Understanding Comp...	2019-06-06	Code
57	E-MN	27.1	No	ActivityNet-QA: A Dataset for Understanding Comp...	2019-06-06	Code
58	Video Chat	26.5	No	VideoChat: Chat-Centric Video Understanding	2023-05-10	Code
59	Video Chat	26.5	No	VideoChat: Chat-Centric Video Understanding	2023-05-10	Code
60	FrozenBiLM (0-shot)	25.9	No	Zero-Shot Video Question Answering via Frozen Bi...	2022-06-16	Code
61	E-VQA	25.1	No	ActivityNet-QA: A Dataset for Understanding Comp...	2019-06-06	Code
62	FrozenBiLM	24.7	No	Zero-Shot Video Question Answering via Frozen Bi...	2022-06-16	Code
63	Video LLaMA	12.4	No	Video-LLaMA: An Instruction-tuned Audio-Visual L...	2023-06-05	Code
64	Just Ask (0-shot)	12.2	No	Just Ask: Learning to Answer Questions from Mill...	2020-12-01	Code

#1Tarsier (34B)SOTA
61.6
Accuracy· 2024-06-30
Tarsier: Recipes for Training and Evaluating Large Video Description Models Code
#2GPT-2 + CLIP-14 + CLIP-multilingual (Zero-Shot)SOTA
61.2
Accuracy· 2022-10-20
Composing Ensembles of Pre-trained Models via Iterative Consensus
#3PLLaVA (34B)
60.9
Accuracy· 2024-04-25
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning Code
#4PPLLaVA-7B
60.7
Accuracy· 2024-11-04
PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance Code
#5LinVT-Qwen2-VL(7B)
60.1
Accuracy· 2024-12-06
LinVT: Empower Your Image-level Large Language Model to Understand Videos Code
#6SlowFast-LLaVA-34B
59.2
Accuracy· 2024-07-22
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models Code
#7TS-LLaVA-34B
58.9
Accuracy· 2024-11-17
TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models Code
#8GPT-2 + CLIP-32 (Zero-Shot)
58.4
Accuracy· 2022-10-20
Composing Ensembles of Pre-trained Models via Iterative Consensus
#9IG-VLM
58.4
Accuracy· 2024-03-27
An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM Code
#10VideoCoCa
56.1
Accuracy· Extra Data· 2022-12-09
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners
#11LLaVA-Mini
53.5
Accuracy· 2025-01-07
LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token Code
#12Flash-VStream
51.9
Accuracy· 2024-06-12
Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams Code
#13Mirasol3B
51.13
Accuracy· 2023-11-09
Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities
#14ST-LLM
50.9
Accuracy· 2024-03-30
ST-LLM: Large Language Models Are Effective Temporal Learners Code
#15VideoGPT+
50.6
Accuracy· 2024-06-13
VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding Code
#16VAST
50.4
Accuracy· Extra Data· 2023-05-29
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset Code
#17CAT-7B
50.2
Accuracy· 2024-03-07
CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios Code
#18Video-LaVIT
50.1
Accuracy· 2024-02-05
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization Code
#19COSA
49.9
Accuracy· Extra Data· 2023-06-15
COSA: Concatenated Sample Pretrained Vision-Language Foundation Model Code
#20MA-LMM
49.8
Accuracy· 2024-04-08
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding Code
#21VideoChat2
49.1
Accuracy· 2023-11-28
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark Code
#22VideoChat2
49.1
Accuracy· 2023-11-28
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark Code
#23VALOR
48.6
Accuracy· Extra Data· 2023-04-17
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset Code
#24UMT-L (ViT-L/16)
47.9
Accuracy· Extra Data· 2023-03-28
Unmasked Teacher: Towards Training-Efficient Video Foundation Models Code
#25LLaMA-VID-13B (2 Token)
47.5
Accuracy· 2023-11-28
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models Code
#26LLaMA-VID-13B (2 Token)
47.5
Accuracy· 2023-11-28
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models Code
#27LLaMA-VID-7B (2 Token)
47.4
Accuracy· 2023-11-28
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models Code
#28LLaMA-VID-7B (2 Token)
47.4
Accuracy· 2023-11-28
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models Code
#29Chat-UniVi-13B
46.4
Accuracy· 2023-11-14
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding Code
#30Chat-UniVi-13B
46.4
Accuracy· 2023-11-14
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding Code
#31MiniGPT4-video-7B
46.3
Accuracy· 2024-04-04
MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens Code
#32BT-Adapter (zero-shot)
46.1
Accuracy· 2023-09-27
BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning Code
#33Chat-UniVi
46.1
Accuracy· 2023-11-14
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding Code
#34BT-Adapter (zero-shot)
46.1
Accuracy· 2023-09-27
BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning Code
#35MovieChat
45.7
Accuracy· 2023-07-31
MovieChat: From Dense Token to Sparse Memory for Long Video Understanding Code
#36MovieChat
45.7
Accuracy· 2023-07-31
MovieChat: From Dense Token to Sparse Memory for Long Video Understanding Code
#37Video-LLaVA
45.3
Accuracy· 2023-11-16
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection Code
#38Video-LLaVA
45.3
Accuracy· 2023-11-16
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection Code
#39TESTA (ViT-B/16)
45
Accuracy· Extra Data· 2023-10-29
TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding Code
#40FrozenBiLM+
44.8
Accuracy· 2023-08-18
Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models Code
#41VindLU
44.7
Accuracy· Extra Data· 2022-12-09
VindLU: A Recipe for Effective Video-and-Language Pretraining Code
#42Singularity-temporalSOTA
44.1
Accuracy· Extra Data· 2022-06-07
Revealing Single Frame Bias for Video-and-Language Learning Code
#43Elysium
43.4
Accuracy· 2024-03-25
Elysium: Exploring Object-level Perception in Videos via MLLM Code
#44FrozenBiLM
43.2
Accuracy· Extra Data· 2022-06-16
Zero-Shot Video Question Answering via Frozen Bidirectional Language Models Code
#45Singularity
43.1
Accuracy· Extra Data· 2022-06-07
Revealing Single Frame Bias for Video-and-Language Learning Code
#46Text + Text (no Multimodal Pretext Training)SOTA
41.4
Accuracy· 2022-06-05
Towards Fast Adaptation of Pretrained Contrastive Models for Multi-channel Video-Language Retrieval Code
#47All-in-one+
40
Accuracy· 2023-08-18
Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models Code
#48VIOLET+
39.7
Accuracy· 2023-08-18
Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models Code
#49Just Ask (fine-tune)SOTA
38.9
Accuracy· 2020-12-01
Just Ask: Learning to Answer Questions from Millions of Narrated Videos Code
#50LocVLM-Vid-B+
38.2
Accuracy· 2024-04-11
Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs Code
#51LocVLM-Vid-B
37.4
Accuracy· 2024-04-11
Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs Code
#52Video-ChatGPT
35.2
Accuracy· 2023-06-08
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models Code
#53Video-ChatGPT
35.2
Accuracy· 2023-06-08
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models Code
#54LLaMA Adapter V2
34.2
Accuracy· 2023-04-28
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model Code
#55LLaMA Adapter
34.2
Accuracy· 2023-04-28
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model Code
#56E-SASOTA
31.8
Accuracy· 2019-06-06
ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering Code
#57E-MN
27.1
Accuracy· 2019-06-06
ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering Code
#58Video Chat
26.5
Accuracy· 2023-05-10
VideoChat: Chat-Centric Video Understanding Code
#59Video Chat
26.5
Accuracy· 2023-05-10
VideoChat: Chat-Centric Video Understanding Code
#60FrozenBiLM (0-shot)
25.9
Accuracy· 2022-06-16
Zero-Shot Video Question Answering via Frozen Bidirectional Language Models Code
#61E-VQA
25.1
Accuracy· 2019-06-06
ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering Code
#62FrozenBiLM
24.7
Accuracy· 2022-06-16
Zero-Shot Video Question Answering via Frozen Bidirectional Language Models Code
#63Video LLaMA
12.4
Accuracy· 2023-06-05
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding Code
#64Just Ask (0-shot)
12.2
Accuracy· 2020-12-01
Just Ask: Learning to Answer Questions from Millions of Narrated Videos Code