TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Language Repository for Long Video Understanding

Language Repository for Long Video Understanding

Kumara Kahatapitiya, Kanchana Ranasinghe, Jongwoo Park, Michael S. Ryoo

2024-03-21Zero-Shot Video Question AnswerQuestion AnsweringVideo UnderstandingVisual Question Answering
PaperPDFCode(official)

Abstract

Language has become a prominent modality in computer vision with the rise of LLMs. Despite supporting long context-lengths, their effectiveness in handling long-term information gradually declines with input length. This becomes critical, especially in applications such as long-form video understanding. In this paper, we introduce a Language Repository (LangRepo) for LLMs, that maintains concise and structured information as an interpretable (i.e., all-textual) representation. Our repository is updated iteratively based on multi-scale video chunks. We introduce write and read operations that focus on pruning redundancies in text, and extracting information at various temporal scales. The proposed framework is evaluated on zero-shot visual question-answering benchmarks including EgoSchema, NExT-QA, IntentQA and NExT-GQA, showing state-of-the-art performance at its scale. Our code is available at https://github.com/kkahatapitiya/LangRepo.

Results

TaskDatasetMetricValueModel
Question AnsweringNExT-QAAccuracy60.9LangRepo (12B)
Question AnsweringNExT-GQAAcc@GQA17.1LangRepo (12B)
Question AnsweringIntentQAAccuracy59.1LangRepo (12B)
Question AnsweringEgoSchema (fullset)Accuracy41.2LangRepo (12B)
Question AnsweringEgoSchema (subset)Accuracy66.2LangRepo (12B)
Video Question AnsweringNExT-QAAccuracy60.9LangRepo (12B)
Video Question AnsweringNExT-GQAAcc@GQA17.1LangRepo (12B)
Video Question AnsweringIntentQAAccuracy59.1LangRepo (12B)
Video Question AnsweringEgoSchema (fullset)Accuracy41.2LangRepo (12B)
Video Question AnsweringEgoSchema (subset)Accuracy66.2LangRepo (12B)

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17Enter the Mind Palace: Reasoning and Planning for Long-term Active Embodied Question Answering2025-07-17Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It2025-07-17City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning2025-07-17VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding2025-07-17Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16Is This Just Fantasy? Language Model Representations Reflect Human Judgments of Event Plausibility2025-07-16UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks2025-07-15