Tasks
SotA
Datasets
Papers
Methods
Submit
About
SotA
/
Reasoning
/
Video Question Answering
/
EgoSchema (subset)
Video Question Answering on EgoSchema (subset)
Metric: Accuracy (higher is better)
Leaderboard
Dataset
Loading chart...
Results
Submit a result
Export CSV
Sort:
Accuracy (best first)
Accuracy (worst first)
Date (newest first)
Date (oldest first)
Model name (A→Z)
#
Model
↕
Accuracy
▼
Extra Data
Paper
Date
↕
Code
1
Tarsier (34B)
68.6
No
Tarsier: Recipes for Training and Evaluating Lar...
2024-06-30
Code
2
VideoChat-T (7B)
68.4
No
TimeSuite: Improving MLLMs for Long Video Unders...
2024-10-25
Code
3
LangRepo (12B)
66.2
No
Language Repository for Long Video Understanding
2024-03-21
Code
4
VideoTree (GPT4)
66.2
No
VideoTree: Adaptive Tree-based Video Representat...
2024-05-29
Code
5
LVNet
66
No
Too Many Frames, Not All Useful: Efficient Strat...
2024-06-13
Code
6
VideoChat2_HD_mistral
65.6
No
MVBench: A Comprehensive Multi-modal Video Under...
2023-11-28
Code
7
VideoChat2_mistral
63.6
No
MVBench: A Comprehensive Multi-modal Video Under...
2023-11-28
Code
8
MVU (13B)
60.3
No
Understanding Long Videos with Multimodal Langua...
2024-03-25
Code
9
TS-LLaVA-34B
57.8
No
TS-LLaVA: Constructing Visual Tokens through Thu...
2024-11-17
Code
10
LLoVi (GPT-3.5)
57.6
No
A Simple LLM Framework for Long-Range Video Ques...
2023-12-28
Code
11
LLoVi (7B)
50.8
No
A Simple LLM Framework for Long-Range Video Ques...
2023-12-28
Code
12
SlowFast-LLaVA-34B
47.2
No
SlowFast-LLaVA: A Strong Training-Free Baseline ...
2024-07-22
Code
13
SeViLA (4B)
25.7
No
Self-Chained Image-Language Model for Video Loca...
2023-05-11
Code
14
Random
20
No
CREPE: Can Vision-Language Foundation Models Rea...
2022-12-13
Code
#1
Tarsier (34B)
SOTA
68.6
Accuracy
· 2024-06-30
Tarsier: Recipes for Training and Evaluating Large Video Description Models
Code
#2
VideoChat-T (7B)
68.4
Accuracy
· 2024-10-25
TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning
Code
#3
LangRepo (12B)
SOTA
66.2
Accuracy
· 2024-03-21
Language Repository for Long Video Understanding
Code
#4
VideoTree (GPT4)
66.2
Accuracy
· 2024-05-29
VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos
Code
#5
LVNet
66
Accuracy
· 2024-06-13
Too Many Frames, Not All Useful: Efficient Strategies for Long-Form Video QA
Code
#6
VideoChat2_HD_mistral
SOTA
65.6
Accuracy
· 2023-11-28
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
Code
#7
VideoChat2_mistral
63.6
Accuracy
· 2023-11-28
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
Code
#8
MVU (13B)
60.3
Accuracy
· 2024-03-25
Understanding Long Videos with Multimodal Language Models
Code
#9
TS-LLaVA-34B
57.8
Accuracy
· 2024-11-17
TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models
Code
#10
LLoVi (GPT-3.5)
57.6
Accuracy
· 2023-12-28
A Simple LLM Framework for Long-Range Video Question-Answering
Code
#11
LLoVi (7B)
50.8
Accuracy
· 2023-12-28
A Simple LLM Framework for Long-Range Video Question-Answering
Code
#12
SlowFast-LLaVA-34B
47.2
Accuracy
· 2024-07-22
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models
Code
#13
SeViLA (4B)
SOTA
25.7
Accuracy
· 2023-05-11
Self-Chained Image-Language Model for Video Localization and Question Answering
Code
#14
Random
SOTA
20
Accuracy
· 2022-12-13
CREPE: Can Vision-Language Foundation Models Reason Compositionally?
Code