TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/A Simple LLM Framework for Long-Range Video Question-Answe...

A Simple LLM Framework for Long-Range Video Question-Answering

Ce Zhang, Taixi Lu, Md Mohaiminul Islam, Ziyang Wang, Shoubin Yu, Mohit Bansal, Gedas Bertasius

2023-12-28Zero-Shot Video Question AnswerQuestion AnsweringLong-range modelingVideo Question AnsweringLarge Language ModelVideo UnderstandingLanguage Modelling
PaperPDFCode(official)

Abstract

We present LLoVi, a language-based framework for long-range video question-answering (LVQA). Unlike prior long-range video understanding methods, which are often costly and require specialized long-range video modeling design (e.g., memory queues, state-space layers, etc.), our approach uses a frame/clip-level visual captioner (e.g., BLIP2, LaViLa, LLaVA) coupled with a Large Language Model (GPT-3.5, GPT-4) leading to a simple yet surprisingly effective LVQA framework. Specifically, we decompose short and long-range modeling aspects of LVQA into two stages. First, we use a short-term visual captioner to generate textual descriptions of short video clips (0.5-8s in length) densely sampled from a long input video. Afterward, an LLM aggregates the densely extracted short-term captions to perform long-range temporal reasoning needed to understand the whole video and answer a question. To analyze what makes our simple framework so effective, we thoroughly evaluate various components of our system. Our empirical analysis reveals that the choice of the visual captioner and LLM is critical for good LVQA performance. Furthermore, we show that a specialized prompt that asks the LLM first to summarize the noisy short-term visual captions and then answer a given input question leads to a significant LVQA performance boost. On EgoSchema, which is best known as a very long-form video question-answering benchmark, our method achieves 50.3% accuracy, outperforming the previous best-performing approach by 18.1% (absolute gain). In addition, our approach outperforms the previous state-of-the-art by 4.1% and 3.1% on NeXT-QA and IntentQA. We also extend LLoVi to grounded LVQA and show that it outperforms all prior methods on the NeXT-GQA dataset. We will release our code at https://github.com/CeeZh/LLoVi.

Results

TaskDatasetMetricValueModel
Question AnsweringNExT-QAAccuracy67.7LLoVi (GPT-4)
Question AnsweringNExT-QAAccuracy54.3LLoVi (7B)
Question AnsweringNExT-GQAAcc@GQA26.8LLoVi (GPT-4)
Question AnsweringNExT-GQAAcc@GQA11.2LLoVi (7B)
Question AnsweringIntentQAAccuracy64LLoVi (GPT-4)
Question AnsweringIntentQAAccuracy53.6LLoVi (7B)
Question AnsweringEgoSchema (fullset)Accuracy50.3LLoVi (GPT-3.5)
Question AnsweringEgoSchema (fullset)Accuracy33.5LLoVi (7B)
Question AnsweringEgoSchema (subset)Accuracy57.6LLoVi (GPT-3.5)
Question AnsweringEgoSchema (subset)Accuracy50.8LLoVi (7B)
Video Question AnsweringNExT-QAAccuracy67.7LLoVi (GPT-4)
Video Question AnsweringNExT-QAAccuracy54.3LLoVi (7B)
Video Question AnsweringNExT-GQAAcc@GQA26.8LLoVi (GPT-4)
Video Question AnsweringNExT-GQAAcc@GQA11.2LLoVi (7B)
Video Question AnsweringIntentQAAccuracy64LLoVi (GPT-4)
Video Question AnsweringIntentQAAccuracy53.6LLoVi (7B)
Video Question AnsweringEgoSchema (fullset)Accuracy50.3LLoVi (GPT-3.5)
Video Question AnsweringEgoSchema (fullset)Accuracy33.5LLoVi (7B)
Video Question AnsweringEgoSchema (subset)Accuracy57.6LLoVi (GPT-3.5)
Video Question AnsweringEgoSchema (subset)Accuracy50.8LLoVi (7B)

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21DENSE: Longitudinal Progress Note Generation with Temporal Modeling of Heterogeneous Clinical Notes Across Hospital Visits2025-07-18From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17GeoReg: Weight-Constrained Few-Shot Regression for Socio-Economic Estimation using LLM2025-07-17The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17