Tasks
SotA
Datasets
Papers
Methods
Submit
About
SotA
/
Natural Language Processing
/
Question Answering
/
EgoSchema (fullset)
Question Answering on EgoSchema (fullset)
Metric: Accuracy (higher is better)
Leaderboard
Dataset
Loading chart...
Results
Submit a result
Hide extra data
Export CSV
#
Model
↕
Accuracy
▼
Extra Data
Paper
Date
↕
Code
1
BIMBA-LLaVA-Qwen2-7B
71.14
No
BIMBA: Selective-Scan Compression for Long-Range...
2025-03-12
Code
2
LinVT-Qwen2-VL(7B)
69.5
No
LinVT: Empower Your Image-level Large Language M...
2024-12-06
Code
3
Qwen2.5-Omni
68.6
Yes
Qwen2.5-Omni Technical Report
2025-03-26
Code
4
LongVU (7B)
67.6
No
LongVU: Spatiotemporal Adaptive Compression for ...
2024-10-22
Code
5
Video-RAG (Based on LLaVA-Video)
66.7
No
Video-RAG: Visually-aligned Retrieval-Augmented ...
2024-11-20
Code
6
VideoLLaMA2 (72B)
63.9
No
VideoLLaMA 2: Advancing Spatial-Temporal Modelin...
2024-06-11
Code
7
Tarsier (34B)
61.7
No
Tarsier: Recipes for Training and Evaluating Lar...
2024-06-30
Code
8
LVNet
61.1
No
Too Many Frames, Not All Useful: Efficient Strat...
2024-06-13
Code
9
VideoTree (GPT4)
61.1
No
VideoTree: Adaptive Tree-based Video Representat...
2024-05-29
Code
10
InternVideo2-6B
60.2
No
InternVideo2: Scaling Foundation Models for Mult...
2024-03-22
Code
11
VideoChat-T (7B)
60
No
TimeSuite: Improving MLLMs for Long Video Unders...
2024-10-25
Code
12
VideoChat2_phi3
56.7
No
MVBench: A Comprehensive Multi-modal Video Under...
2023-11-28
Code
13
VideoChat2_HD_mistral
55.8
No
MVBench: A Comprehensive Multi-modal Video Under...
2023-11-28
Code
14
VideoChat2_mistral
54.4
No
MVBench: A Comprehensive Multi-modal Video Under...
2023-11-28
Code
15
Vamos (GPT-4o)
53.6
No
Vamos: Versatile Action Models for Video Underst...
2023-11-22
Code
16
TraveLER
53.3
No
TraveLER: A Modular Multi-LMM Agent Framework fo...
2024-04-01
Code
17
LLoVi (GPT-3.5)
50.3
No
A Simple LLM Framework for Long-Range Video Ques...
2023-12-28
Code
18
Video ReCap
50.23
No
Video ReCap: Recursive Captioning of Hour-Long V...
2024-02-20
Code
19
Vamos (GPT-4)
48.3
No
Vamos: Versatile Action Models for Video Underst...
2023-11-22
Code
20
LangRepo (12B)
41.2
No
Language Repository for Long Video Understanding
2024-03-21
Code
21
MVU (13B)
37.6
No
Understanding Long Videos with Multimodal Langua...
2024-03-25
Code
22
Vamos (13B)
36.7
No
Vamos: Versatile Action Models for Video Underst...
2023-11-22
Code
23
LLoVi (7B)
33.5
No
A Simple LLM Framework for Long-Range Video Ques...
2023-12-28
Code
24
TimeChat (7B)
33
No
TimeChat: A Time-sensitive Multimodal Large Lang...
2023-12-04
Code
25
InternVideo
32.1
No
InternVideo: General Video Foundation Models via...
2022-12-06
Code
26
mPLUG-Owl (7B)
31.1
No
mPLUG-Owl: Modularization Empowers Large Languag...
2023-04-27
Code
27
FrozenBiLM
26.9
No
Zero-Shot Video Question Answering via Frozen Bi...
2022-06-16
Code
28
SeViLA (4B)
22.7
No
Self-Chained Image-Language Model for Video Loca...
2023-05-11
Code
29
Random
20
No
CREPE: Can Vision-Language Foundation Models Rea...
2022-12-13
Code