Video Question Answering on TGIF-QA

Metric: Accuracy (higher is better)

LeaderboardDataset

Loading chart...

Results

Sort:

#	Model↕	Accuracy▼	Extra Data	Paper	Date↕	Code
1	Tarsier (34B)	82.5	No	Tarsier: Recipes for Training and Evaluating Lar...	2024-06-30	Code
2	LinVT-Qwen2-VL (7B)	81.3	No	LinVT: Empower Your Image-level Large Language M...	2024-12-06	Code
3	TS-LLaVA-34B	81	No	TS-LLaVA: Constructing Visual Tokens through Thu...	2024-11-17	Code
4	PLLaVA	80.6	No	PLLaVA : Parameter-free LLaVA Extension from Ima...	2024-04-25	Code
5	SlowFast-LLaVA-34B	80.6	No	SlowFast-LLaVA: A Strong Training-Free Baseline ...	2024-07-22	Code
6	IG-VLM	79.1	No	An Image Grid Can Be Worth a Video: Zero-shot Vi...	2024-03-27	Code
7	VideoGPT+	74.6	No	VideoGPT+: Integrating Image and Video Encoders ...	2024-06-13	Code
8	MiniGPT4-video-7B	72.22	No	MiniGPT4-Video: Advancing Multimodal LLMs for Vi...	2024-04-04	Code
9	Video-LLaVA-7B	70	No	Video-LLaVA: Learning United Visual Representati...	2023-11-16	Code
10	Chat-UniVi-7B	69	No	Chat-UniVi: Unified Visual Representation Empowe...	2023-11-14	Code
11	Elysium	66.6	No	Elysium: Exploring Object-level Perception in Vi...	2024-03-25	Code
12	LocVLM-Vid-B	51.8	No	Learning to Localize Objects Improves Spatial Re...	2024-04-11	Code
13	Video-ChatGPT-7B	51.4	No	Video-ChatGPT: Towards Detailed Video Understand...	2023-06-08	Code
14	FrozenBiLM	41.9	No	Zero-Shot Video Question Answering via Frozen Bi...	2022-06-16	Code
15	Video Chat-7B	34.4	No	VideoChat: Chat-Centric Video Understanding	2023-05-10	Code

#1Tarsier (34B)SOTA
82.5
Accuracy· 2024-06-30
Tarsier: Recipes for Training and Evaluating Large Video Description Models Code
#2LinVT-Qwen2-VL (7B)
81.3
Accuracy· 2024-12-06
LinVT: Empower Your Image-level Large Language Model to Understand Videos Code
#3TS-LLaVA-34B
81
Accuracy· 2024-11-17
TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models Code
#4PLLaVASOTA
80.6
Accuracy· 2024-04-25
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning Code
#5SlowFast-LLaVA-34B
80.6
Accuracy· 2024-07-22
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models Code
#6IG-VLMSOTA
79.1
Accuracy· 2024-03-27
An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM Code
#7VideoGPT+
74.6
Accuracy· 2024-06-13
VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding Code
#8MiniGPT4-video-7B
72.22
Accuracy· 2024-04-04
MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens Code
#9Video-LLaVA-7BSOTA
70
Accuracy· 2023-11-16
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection Code
#10Chat-UniVi-7BSOTA
69
Accuracy· 2023-11-14
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding Code
#11Elysium
66.6
Accuracy· 2024-03-25
Elysium: Exploring Object-level Perception in Videos via MLLM Code
#12LocVLM-Vid-B
51.8
Accuracy· 2024-04-11
Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs Code
#13Video-ChatGPT-7BSOTA
51.4
Accuracy· 2023-06-08
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models Code
#14FrozenBiLMSOTA
41.9
Accuracy· 2022-06-16
Zero-Shot Video Question Answering via Frozen Bidirectional Language Models Code
#15Video Chat-7B
34.4
Accuracy· 2023-05-10
VideoChat: Chat-Centric Video Understanding Code