Understanding Long Videos with Multimodal Language Models

Kanchana Ranasinghe, Xiang Li, Kumara Kahatapitiya, Michael S. Ryoo

2024-03-25Zero-Shot Video Question Answer Fine-grained Action Recognition zero-shot long video question answering World Knowledge Video Understanding Action Recognition Language Modelling Multiple-choice

Paper PDF Code(official)

Abstract

Large Language Models (LLMs) have allowed recent LLM-based approaches to achieve excellent performance on long-video understanding benchmarks. We investigate how extensive world knowledge and strong reasoning skills of underlying LLMs influence this strong performance. Surprisingly, we discover that LLM-based approaches can yield surprisingly good accuracy on long-video tasks with limited video information, sometimes even with no video specific information. Building on this, we exploring injecting video-specific information into an LLM-based framework. We utilize off-the-shelf vision tools to extract three object-centric information modalities from videos and then leverage natural language as a medium for fusing this information. Our resulting Multimodal Video Understanding (MVU) framework demonstrates state-of-the-art performance across multiple video understanding benchmarks. Strong performance also on robotics domain tasks establish its strong generality. Our code will be released publicly.

Results

Task	Dataset	Metric	Value	Model
Question Answering	NExT-QA	Accuracy	55.2	MVU (13B)
Question Answering	EgoSchema (fullset)	Accuracy	37.6	MVU (13B)
Question Answering	EgoSchema (subset)	Accuracy	60.3	MVU (13B)
Question Answering	EgoSchema (subset)	Inference Speed (s)	2.42	MVU (13B)
Video Question Answering	NExT-QA	Accuracy	55.2	MVU (13B)
Video Question Answering	EgoSchema (fullset)	Accuracy	37.6	MVU (13B)
Video Question Answering	EgoSchema (subset)	Accuracy	60.3	MVU (13B)
Video Question Answering	EgoSchema (subset)	Inference Speed (s)	2.42	MVU (13B)

Related Papers

Visual-Language Model Knowledge Distillation Method for Image Quality Assessment2025-07-21 HRSeg: High-Resolution Visual Perception and Enhancement for Reasoning Segmentation2025-07-17 Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes2025-07-17 VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding2025-07-17 A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17 Making Language Model a Hierarchical Classifier and Generator2025-07-17 VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17 The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17