Question-Answering Dense Video Events

Hangyu Qin, Junbin Xiao, Angela Yao

2024-09-06Zero-Shot Video Question Answer Question Answering Benchmarking

Abstract

This paper presents question-answering on dense video events, a novel task that answers and grounds dense-event questions in long videos, thus challenging MLLMs to faithfully comprehend and reason about multiple events over extended periods of time. To facilitate the study, we construct DeVE-QA -- a dataset featuring 78K questions about 26K events on 10.6K long videos. Our benchmarking shows that state-of-the-art MLLMs struggle on DeVE-QA. For improvement, we propose DeVi, a novel training-free MLLM approach that highlights a hierarchical captioning module, a temporal event memory module, and a self-consistency checking module to respectively detect, contextualize and memorize, and ground dense-events in long videos for question answering. Extensive experiments show that DeVi is superior at answering dense-event questions and grounding relevant video moments. Compared with existing MLLMs, it achieves a notable increase of 4.8% and 2.1% for G(round)QA accuracy on DeVE-QA and NExT-GQA, respectively. Data and code are available at https://github.com/QHUni/DeVE-QA.

Results

Task	Dataset	Metric	Value	Model
Question Answering	NExT-GQA	Acc@GQA	28.9	DeVi (Gemini 2.0)
Question Answering	NExT-GQA	Acc@GQA	28	DeVi (GPT-4)
Video Question Answering	NExT-GQA	Acc@GQA	28.9	DeVi (Gemini 2.0)
Video Question Answering	NExT-GQA	Acc@GQA	28	DeVi (GPT-4)

Related Papers

Visual Place Recognition for Large-Scale UAV Applications2025-07-20 From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17 Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17 Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17 City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17 Training Transformers with Enforced Lipschitz Constants2025-07-17 Disentangling coincident cell events using deep transfer learning and compressive sensing2025-07-17 MUPAX: Multidimensional Problem Agnostic eXplainable AI2025-07-17