TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

SotA/Natural Language Processing/Question Answering/EgoSchema (fullset)

Question Answering on EgoSchema (fullset)

Metric: Accuracy (higher is better)

LeaderboardDataset
Loading chart...

Results

Submit a result
#Model↕Accuracy▼Extra DataPaperDate↕Code
1BIMBA-LLaVA-Qwen2-7B71.14NoBIMBA: Selective-Scan Compression for Long-Range...2025-03-12Code
2LinVT-Qwen2-VL(7B)69.5NoLinVT: Empower Your Image-level Large Language M...2024-12-06Code
3Qwen2.5-Omni68.6YesQwen2.5-Omni Technical Report2025-03-26Code
4LongVU (7B)67.6NoLongVU: Spatiotemporal Adaptive Compression for ...2024-10-22Code
5Video-RAG (Based on LLaVA-Video)66.7NoVideo-RAG: Visually-aligned Retrieval-Augmented ...2024-11-20Code
6VideoLLaMA2 (72B)63.9NoVideoLLaMA 2: Advancing Spatial-Temporal Modelin...2024-06-11Code
7Tarsier (34B)61.7NoTarsier: Recipes for Training and Evaluating Lar...2024-06-30Code
8LVNet61.1NoToo Many Frames, Not All Useful: Efficient Strat...2024-06-13Code
9VideoTree (GPT4)61.1NoVideoTree: Adaptive Tree-based Video Representat...2024-05-29Code
10InternVideo2-6B60.2NoInternVideo2: Scaling Foundation Models for Mult...2024-03-22Code
11VideoChat-T (7B)60NoTimeSuite: Improving MLLMs for Long Video Unders...2024-10-25Code
12VideoChat2_phi356.7NoMVBench: A Comprehensive Multi-modal Video Under...2023-11-28Code
13VideoChat2_HD_mistral55.8NoMVBench: A Comprehensive Multi-modal Video Under...2023-11-28Code
14VideoChat2_mistral54.4NoMVBench: A Comprehensive Multi-modal Video Under...2023-11-28Code
15Vamos (GPT-4o)53.6NoVamos: Versatile Action Models for Video Underst...2023-11-22Code
16TraveLER53.3NoTraveLER: A Modular Multi-LMM Agent Framework fo...2024-04-01Code
17LLoVi (GPT-3.5)50.3NoA Simple LLM Framework for Long-Range Video Ques...2023-12-28Code
18Video ReCap50.23NoVideo ReCap: Recursive Captioning of Hour-Long V...2024-02-20Code
19Vamos (GPT-4)48.3NoVamos: Versatile Action Models for Video Underst...2023-11-22Code
20LangRepo (12B)41.2NoLanguage Repository for Long Video Understanding2024-03-21Code
21MVU (13B)37.6NoUnderstanding Long Videos with Multimodal Langua...2024-03-25Code
22Vamos (13B)36.7NoVamos: Versatile Action Models for Video Underst...2023-11-22Code
23LLoVi (7B)33.5NoA Simple LLM Framework for Long-Range Video Ques...2023-12-28Code
24TimeChat (7B)33NoTimeChat: A Time-sensitive Multimodal Large Lang...2023-12-04Code
25InternVideo32.1NoInternVideo: General Video Foundation Models via...2022-12-06Code
26mPLUG-Owl (7B)31.1NomPLUG-Owl: Modularization Empowers Large Languag...2023-04-27Code
27FrozenBiLM26.9NoZero-Shot Video Question Answering via Frozen Bi...2022-06-16Code
28SeViLA (4B)22.7NoSelf-Chained Image-Language Model for Video Loca...2023-05-11Code
29Random20NoCREPE: Can Vision-Language Foundation Models Rea...2022-12-13Code