TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Understanding Long Videos with Multimodal Language Models

Understanding Long Videos with Multimodal Language Models

Kanchana Ranasinghe, Xiang Li, Kumara Kahatapitiya, Michael S. Ryoo

2024-03-25Zero-Shot Video Question AnswerFine-grained Action Recognitionzero-shot long video question answeringWorld KnowledgeVideo UnderstandingAction RecognitionLanguage ModellingMultiple-choice
PaperPDFCode(official)

Abstract

Large Language Models (LLMs) have allowed recent LLM-based approaches to achieve excellent performance on long-video understanding benchmarks. We investigate how extensive world knowledge and strong reasoning skills of underlying LLMs influence this strong performance. Surprisingly, we discover that LLM-based approaches can yield surprisingly good accuracy on long-video tasks with limited video information, sometimes even with no video specific information. Building on this, we exploring injecting video-specific information into an LLM-based framework. We utilize off-the-shelf vision tools to extract three object-centric information modalities from videos and then leverage natural language as a medium for fusing this information. Our resulting Multimodal Video Understanding (MVU) framework demonstrates state-of-the-art performance across multiple video understanding benchmarks. Strong performance also on robotics domain tasks establish its strong generality. Our code will be released publicly.

Results

TaskDatasetMetricValueModel
Question AnsweringNExT-QAAccuracy55.2MVU (13B)
Question AnsweringEgoSchema (fullset)Accuracy37.6MVU (13B)
Question AnsweringEgoSchema (subset)Accuracy60.3MVU (13B)
Question AnsweringEgoSchema (subset)Inference Speed (s)2.42MVU (13B)
Video Question AnsweringNExT-QAAccuracy55.2MVU (13B)
Video Question AnsweringEgoSchema (fullset)Accuracy37.6MVU (13B)
Video Question AnsweringEgoSchema (subset)Accuracy60.3MVU (13B)
Video Question AnsweringEgoSchema (subset)Inference Speed (s)2.42MVU (13B)

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21HRSeg: High-Resolution Visual Perception and Enhancement for Reasoning Segmentation2025-07-17Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes2025-07-17VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding2025-07-17A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17Making Language Model a Hierarchical Classifier and Generator2025-07-17VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17