TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

SotA/Reasoning/Video Question Answering/ActivityNet-QA

Video Question Answering on ActivityNet-QA

Metric: Accuracy (higher is better)

LeaderboardDataset
Loading chart...

Results

Submit a result
#Model↕Accuracy▼Extra DataPaperDate↕Code
1Tarsier (34B)61.6NoTarsier: Recipes for Training and Evaluating Lar...2024-06-30Code
2GPT-2 + CLIP-14 + CLIP-multilingual (Zero-Shot)61.2NoComposing Ensembles of Pre-trained Models via It...2022-10-20-
3PLLaVA (34B)60.9NoPLLaVA : Parameter-free LLaVA Extension from Ima...2024-04-25Code
4PPLLaVA-7B60.7NoPPLLaVA: Varied Video Sequence Understanding Wit...2024-11-04Code
5LinVT-Qwen2-VL(7B)60.1NoLinVT: Empower Your Image-level Large Language M...2024-12-06Code
6SlowFast-LLaVA-34B59.2NoSlowFast-LLaVA: A Strong Training-Free Baseline ...2024-07-22Code
7TS-LLaVA-34B58.9NoTS-LLaVA: Constructing Visual Tokens through Thu...2024-11-17Code
8GPT-2 + CLIP-32 (Zero-Shot)58.4NoComposing Ensembles of Pre-trained Models via It...2022-10-20-
9IG-VLM58.4NoAn Image Grid Can Be Worth a Video: Zero-shot Vi...2024-03-27Code
10VideoCoCa56.1YesVideoCoCa: Video-Text Modeling with Zero-Shot Tr...2022-12-09-
11LLaVA-Mini53.5NoLLaVA-Mini: Efficient Image and Video Large Mult...2025-01-07Code
12Flash-VStream51.9NoFlash-VStream: Memory-Based Real-Time Understand...2024-06-12Code
13Mirasol3B51.13NoMirasol3B: A Multimodal Autoregressive model for...2023-11-09-
14ST-LLM50.9NoST-LLM: Large Language Models Are Effective Temp...2024-03-30Code
15VideoGPT+50.6NoVideoGPT+: Integrating Image and Video Encoders ...2024-06-13Code
16VAST50.4YesVAST: A Vision-Audio-Subtitle-Text Omni-Modality...2023-05-29Code
17CAT-7B50.2NoCAT: Enhancing Multimodal Large Language Model t...2024-03-07Code
18Video-LaVIT50.1NoVideo-LaVIT: Unified Video-Language Pre-training...2024-02-05Code
19COSA49.9YesCOSA: Concatenated Sample Pretrained Vision-Lang...2023-06-15Code
20MA-LMM49.8NoMA-LMM: Memory-Augmented Large Multimodal Model ...2024-04-08Code
21VideoChat249.1NoMVBench: A Comprehensive Multi-modal Video Under...2023-11-28Code
22VideoChat249.1NoMVBench: A Comprehensive Multi-modal Video Under...2023-11-28Code
23VALOR48.6YesVALOR: Vision-Audio-Language Omni-Perception Pre...2023-04-17Code
24UMT-L (ViT-L/16)47.9YesUnmasked Teacher: Towards Training-Efficient Vid...2023-03-28Code
25LLaMA-VID-13B (2 Token)47.5NoLLaMA-VID: An Image is Worth 2 Tokens in Large L...2023-11-28Code
26LLaMA-VID-13B (2 Token)47.5NoLLaMA-VID: An Image is Worth 2 Tokens in Large L...2023-11-28Code
27LLaMA-VID-7B (2 Token)47.4NoLLaMA-VID: An Image is Worth 2 Tokens in Large L...2023-11-28Code
28LLaMA-VID-7B (2 Token)47.4NoLLaMA-VID: An Image is Worth 2 Tokens in Large L...2023-11-28Code
29Chat-UniVi-13B46.4NoChat-UniVi: Unified Visual Representation Empowe...2023-11-14Code
30Chat-UniVi-13B46.4NoChat-UniVi: Unified Visual Representation Empowe...2023-11-14Code
31MiniGPT4-video-7B46.3NoMiniGPT4-Video: Advancing Multimodal LLMs for Vi...2024-04-04Code
32BT-Adapter (zero-shot)46.1NoBT-Adapter: Video Conversation is Feasible Witho...2023-09-27Code
33Chat-UniVi46.1NoChat-UniVi: Unified Visual Representation Empowe...2023-11-14Code
34BT-Adapter (zero-shot)46.1NoBT-Adapter: Video Conversation is Feasible Witho...2023-09-27Code
35MovieChat45.7NoMovieChat: From Dense Token to Sparse Memory for...2023-07-31Code
36MovieChat45.7NoMovieChat: From Dense Token to Sparse Memory for...2023-07-31Code
37Video-LLaVA45.3NoVideo-LLaVA: Learning United Visual Representati...2023-11-16Code
38Video-LLaVA45.3NoVideo-LLaVA: Learning United Visual Representati...2023-11-16Code
39TESTA (ViT-B/16)45YesTESTA: Temporal-Spatial Token Aggregation for Lo...2023-10-29Code
40FrozenBiLM+44.8NoOpen-vocabulary Video Question Answering: A New ...2023-08-18Code
41VindLU44.7YesVindLU: A Recipe for Effective Video-and-Languag...2022-12-09Code
42Singularity-temporal44.1YesRevealing Single Frame Bias for Video-and-Langua...2022-06-07Code
43Elysium43.4NoElysium: Exploring Object-level Perception in Vi...2024-03-25Code
44FrozenBiLM43.2YesZero-Shot Video Question Answering via Frozen Bi...2022-06-16Code
45Singularity43.1YesRevealing Single Frame Bias for Video-and-Langua...2022-06-07Code
46Text + Text (no Multimodal Pretext Training)41.4NoTowards Fast Adaptation of Pretrained Contrastiv...2022-06-05Code
47All-in-one+40NoOpen-vocabulary Video Question Answering: A New ...2023-08-18Code
48VIOLET+39.7NoOpen-vocabulary Video Question Answering: A New ...2023-08-18Code
49Just Ask (fine-tune)38.9NoJust Ask: Learning to Answer Questions from Mill...2020-12-01Code
50LocVLM-Vid-B+38.2NoLearning to Localize Objects Improves Spatial Re...2024-04-11Code
51LocVLM-Vid-B37.4NoLearning to Localize Objects Improves Spatial Re...2024-04-11Code
52Video-ChatGPT35.2NoVideo-ChatGPT: Towards Detailed Video Understand...2023-06-08Code
53Video-ChatGPT35.2NoVideo-ChatGPT: Towards Detailed Video Understand...2023-06-08Code
54LLaMA Adapter V234.2NoLLaMA-Adapter V2: Parameter-Efficient Visual Ins...2023-04-28Code
55LLaMA Adapter34.2NoLLaMA-Adapter V2: Parameter-Efficient Visual Ins...2023-04-28Code
56E-SA31.8NoActivityNet-QA: A Dataset for Understanding Comp...2019-06-06Code
57E-MN27.1NoActivityNet-QA: A Dataset for Understanding Comp...2019-06-06Code
58Video Chat26.5NoVideoChat: Chat-Centric Video Understanding2023-05-10Code
59Video Chat26.5NoVideoChat: Chat-Centric Video Understanding2023-05-10Code
60FrozenBiLM (0-shot)25.9NoZero-Shot Video Question Answering via Frozen Bi...2022-06-16Code
61E-VQA25.1NoActivityNet-QA: A Dataset for Understanding Comp...2019-06-06Code
62FrozenBiLM24.7NoZero-Shot Video Question Answering via Frozen Bi...2022-06-16Code
63Video LLaMA12.4NoVideo-LLaMA: An Instruction-tuned Audio-Visual L...2023-06-05Code
64Just Ask (0-shot)12.2NoJust Ask: Learning to Answer Questions from Mill...2020-12-01Code