Video Question Answering on STAR Benchmark

Metric: Average Accuracy (higher is better)

LeaderboardDataset

Loading chart...

Results

Sort:

#	Model↕	Average Accuracy▼	Extra Data	Paper	Date↕	Code
1	VLAP (4 frames)	67.1	No	ViLA: Efficient Video-Language Alignment for Vid...	2023-12-13	Code
2	LLaMA-VQA	65.4	No	Large Language Models are Temporal and Causal Re...	2023-10-24	Code
3	SeViLA	64.9	No	Self-Chained Image-Language Model for Video Loca...	2023-05-11	Code
4	InternVideo	58.7	No	InternVideo: General Video Foundation Models via...	2022-12-06	Code
5	GF(sup)	53.94	No	Glance and Focus: Memory Prompting for Multi-Eve...	2024-01-03	Code
6	GF(uns)	53.86	No	Glance and Focus: Memory Prompting for Multi-Eve...	2024-01-03	Code
7	MIST	51.13	No	MIST: Multi-modal Iterative Spatial-Temporal Tra...	2022-12-19	Code
8	Temp[ATP]	48.37	No	Revisiting the "Video" in Video-Language Underst...	2022-06-03	Code
9	AnyMAL-70B (0-shot)	48.2	No	AnyMAL: An Efficient and Scalable Any-Modality A...	2023-09-27	Code
10	All-in-one	47.5	No	All in One: Exploring Unified Video-Language Pre...	2022-03-14	Code
11	TraveLER (0-shot)	44.9	No	TraveLER: A Modular Multi-LMM Agent Framework fo...	2024-04-01	Code
12	SeViLA (0-shot)	44.6	No	Self-Chained Image-Language Model for Video Loca...	2023-05-11	Code
13	Flamingo-9B (4-shot)	42.8	No	Flamingo: a Visual Language Model for Few-Shot L...	2022-04-29	Code
14	Flamingo-80B (4-shot)	42.4	No	Flamingo: a Visual Language Model for Few-Shot L...	2022-04-29	Code
15	Flamingo-9B (0-shot)	41.8	No	Flamingo: a Visual Language Model for Few-Shot L...	2022-04-29	Code
16	Flamingo-80B (0-shot)	39.7	No	Flamingo: a Visual Language Model for Few-Shot L...	2022-04-29	Code
17	SHG-VQA (trained from scratch)	39.47	No	Learning Situation Hyper-Graphs for Video Questi...	2023-04-18	Code

#1VLAP (4 frames)SOTA
67.1
Average Accuracy· 2023-12-13
ViLA: Efficient Video-Language Alignment for Video Question Answering Code
#2LLaMA-VQASOTA
65.4
Average Accuracy· 2023-10-24
Large Language Models are Temporal and Causal Reasoners for Video Question Answering Code
#3SeViLASOTA
64.9
Average Accuracy· 2023-05-11
Self-Chained Image-Language Model for Video Localization and Question Answering Code
#4InternVideoSOTA
58.7
Average Accuracy· 2022-12-06
InternVideo: General Video Foundation Models via Generative and Discriminative Learning Code
#5GF(sup)
53.94
Average Accuracy· 2024-01-03
Glance and Focus: Memory Prompting for Multi-Event Video Question Answering Code
#6GF(uns)
53.86
Average Accuracy· 2024-01-03
Glance and Focus: Memory Prompting for Multi-Event Video Question Answering Code
#7MIST
51.13
Average Accuracy· 2022-12-19
MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering Code
#8Temp[ATP]SOTA
48.37
Average Accuracy· 2022-06-03
Revisiting the "Video" in Video-Language Understanding Code
#9AnyMAL-70B (0-shot)
48.2
Average Accuracy· 2023-09-27
AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model Code
#10All-in-oneSOTA
47.5
Average Accuracy· 2022-03-14
All in One: Exploring Unified Video-Language Pre-training Code
#11TraveLER (0-shot)
44.9
Average Accuracy· 2024-04-01
TraveLER: A Modular Multi-LMM Agent Framework for Video Question-Answering Code
#12SeViLA (0-shot)
44.6
Average Accuracy· 2023-05-11
Self-Chained Image-Language Model for Video Localization and Question Answering Code
#13Flamingo-9B (4-shot)
42.8
Average Accuracy· 2022-04-29
Flamingo: a Visual Language Model for Few-Shot Learning Code
#14Flamingo-80B (4-shot)
42.4
Average Accuracy· 2022-04-29
Flamingo: a Visual Language Model for Few-Shot Learning Code
#15Flamingo-9B (0-shot)
41.8
Average Accuracy· 2022-04-29
Flamingo: a Visual Language Model for Few-Shot Learning Code
#16Flamingo-80B (0-shot)
39.7
Average Accuracy· 2022-04-29
Flamingo: a Visual Language Model for Few-Shot Learning Code
#17SHG-VQA (trained from scratch)
39.47
Average Accuracy· 2023-04-18
Learning Situation Hyper-Graphs for Video Question Answering Code