Video Question Answering on ActivityNet-QA

Metric: Confidence score (lower is better)

LeaderboardDataset

Loading chart...

Results

Submit a result

Sort:

#	Model↕	Confidence score▲	Extra Data	Paper	Date↕	Code
1	Video Chat	2.2	No	VideoChat: Chat-Centric Video Understanding	2023-05-10	Code
2	Video-ChatGPT	2.7	No	Video-ChatGPT: Towards Detailed Video Understand...	2023-06-08	Code
3	LLaMA Adapter V2	2.7	No	LLaMA-Adapter V2: Parameter-Efficient Visual Ins...	2023-04-28	Code
4	MovieChat	3.1	No	MovieChat: From Dense Token to Sparse Memory for...	2023-07-31	Code
5	VideoChat2	3.3	No	MVBench: A Comprehensive Multi-modal Video Under...	2023-11-28	Code
6	LLaMA-VID-13B (2 Token)	3.3	No	LLaMA-VID: An Image is Worth 2 Tokens in Large L...	2023-11-28	Code
7	LLaMA-VID-7B (2 Token)	3.3	No	LLaMA-VID: An Image is Worth 2 Tokens in Large L...	2023-11-28	Code
8	Chat-UniVi-13B	3.3	No	Chat-UniVi: Unified Visual Representation Empowe...	2023-11-14	Code
9	Video-LLaVA	3.3	No	Video-LLaVA: Learning United Visual Representati...	2023-11-16	Code
10	BT-Adapter (zero-shot)	3.6	No	BT-Adapter: Video Conversation is Feasible Witho...	2023-09-27	Code

#1Video ChatSOTA
2.2
Confidence score· 2023-05-10
VideoChat: Chat-Centric Video Understanding Code
#2Video-ChatGPT
2.7
Confidence score· 2023-06-08
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models Code
#3LLaMA Adapter V2SOTA
2.7
Confidence score· 2023-04-28
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model Code
#4MovieChat
3.1
Confidence score· 2023-07-31
MovieChat: From Dense Token to Sparse Memory for Long Video Understanding Code
#5VideoChat2
3.3
Confidence score· 2023-11-28
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark Code
#6LLaMA-VID-13B (2 Token)
3.3
Confidence score· 2023-11-28
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models Code
#7LLaMA-VID-7B (2 Token)
3.3
Confidence score· 2023-11-28
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models Code
#8Chat-UniVi-13B
3.3
Confidence score· 2023-11-14
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding Code
#9Video-LLaVA
3.3
Confidence score· 2023-11-16
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection Code
#10BT-Adapter (zero-shot)
3.6
Confidence score· 2023-09-27
BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning Code