Tasks
SotA
Datasets
Papers
Methods
Submit
About
SotA
/
Reasoning
/
Video Question Answering
/
STAR Benchmark
Video Question Answering on STAR Benchmark
Metric: Average Accuracy (higher is better)
Leaderboard
Dataset
Loading chart...
Results
Submit a result
Export CSV
Sort:
Average Accuracy (best first)
Average Accuracy (worst first)
Date (newest first)
Date (oldest first)
Model name (A→Z)
#
Model
↕
Average Accuracy
▼
Extra Data
Paper
Date
↕
Code
1
VLAP (4 frames)
67.1
No
ViLA: Efficient Video-Language Alignment for Vid...
2023-12-13
Code
2
LLaMA-VQA
65.4
No
Large Language Models are Temporal and Causal Re...
2023-10-24
Code
3
SeViLA
64.9
No
Self-Chained Image-Language Model for Video Loca...
2023-05-11
Code
4
InternVideo
58.7
No
InternVideo: General Video Foundation Models via...
2022-12-06
Code
5
GF(sup)
53.94
No
Glance and Focus: Memory Prompting for Multi-Eve...
2024-01-03
Code
6
GF(uns)
53.86
No
Glance and Focus: Memory Prompting for Multi-Eve...
2024-01-03
Code
7
MIST
51.13
No
MIST: Multi-modal Iterative Spatial-Temporal Tra...
2022-12-19
Code
8
Temp[ATP]
48.37
No
Revisiting the "Video" in Video-Language Underst...
2022-06-03
Code
9
AnyMAL-70B (0-shot)
48.2
No
AnyMAL: An Efficient and Scalable Any-Modality A...
2023-09-27
Code
10
All-in-one
47.5
No
All in One: Exploring Unified Video-Language Pre...
2022-03-14
Code
11
TraveLER (0-shot)
44.9
No
TraveLER: A Modular Multi-LMM Agent Framework fo...
2024-04-01
Code
12
SeViLA (0-shot)
44.6
No
Self-Chained Image-Language Model for Video Loca...
2023-05-11
Code
13
Flamingo-9B (4-shot)
42.8
No
Flamingo: a Visual Language Model for Few-Shot L...
2022-04-29
Code
14
Flamingo-80B (4-shot)
42.4
No
Flamingo: a Visual Language Model for Few-Shot L...
2022-04-29
Code
15
Flamingo-9B (0-shot)
41.8
No
Flamingo: a Visual Language Model for Few-Shot L...
2022-04-29
Code
16
Flamingo-80B (0-shot)
39.7
No
Flamingo: a Visual Language Model for Few-Shot L...
2022-04-29
Code
17
SHG-VQA (trained from scratch)
39.47
No
Learning Situation Hyper-Graphs for Video Questi...
2023-04-18
Code
#1
VLAP (4 frames)
SOTA
67.1
Average Accuracy
· 2023-12-13
ViLA: Efficient Video-Language Alignment for Video Question Answering
Code
#2
LLaMA-VQA
SOTA
65.4
Average Accuracy
· 2023-10-24
Large Language Models are Temporal and Causal Reasoners for Video Question Answering
Code
#3
SeViLA
SOTA
64.9
Average Accuracy
· 2023-05-11
Self-Chained Image-Language Model for Video Localization and Question Answering
Code
#4
InternVideo
SOTA
58.7
Average Accuracy
· 2022-12-06
InternVideo: General Video Foundation Models via Generative and Discriminative Learning
Code
#5
GF(sup)
53.94
Average Accuracy
· 2024-01-03
Glance and Focus: Memory Prompting for Multi-Event Video Question Answering
Code
#6
GF(uns)
53.86
Average Accuracy
· 2024-01-03
Glance and Focus: Memory Prompting for Multi-Event Video Question Answering
Code
#7
MIST
51.13
Average Accuracy
· 2022-12-19
MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering
Code
#8
Temp[ATP]
SOTA
48.37
Average Accuracy
· 2022-06-03
Revisiting the "Video" in Video-Language Understanding
Code
#9
AnyMAL-70B (0-shot)
48.2
Average Accuracy
· 2023-09-27
AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model
Code
#10
All-in-one
SOTA
47.5
Average Accuracy
· 2022-03-14
All in One: Exploring Unified Video-Language Pre-training
Code
#11
TraveLER (0-shot)
44.9
Average Accuracy
· 2024-04-01
TraveLER: A Modular Multi-LMM Agent Framework for Video Question-Answering
Code
#12
SeViLA (0-shot)
44.6
Average Accuracy
· 2023-05-11
Self-Chained Image-Language Model for Video Localization and Question Answering
Code
#13
Flamingo-9B (4-shot)
42.8
Average Accuracy
· 2022-04-29
Flamingo: a Visual Language Model for Few-Shot Learning
Code
#14
Flamingo-80B (4-shot)
42.4
Average Accuracy
· 2022-04-29
Flamingo: a Visual Language Model for Few-Shot Learning
Code
#15
Flamingo-9B (0-shot)
41.8
Average Accuracy
· 2022-04-29
Flamingo: a Visual Language Model for Few-Shot Learning
Code
#16
Flamingo-80B (0-shot)
39.7
Average Accuracy
· 2022-04-29
Flamingo: a Visual Language Model for Few-Shot Learning
Code
#17
SHG-VQA (trained from scratch)
39.47
Average Accuracy
· 2023-04-18
Learning Situation Hyper-Graphs for Video Question Answering
Code