Towards Long-Form Video Understanding

Chao-yuan Wu, Philipp Krähenbühl

2021-06-21CVPR 2021 1Video Recognition Form Video Understanding Action Recognition

Abstract

Our world offers a never-ending stream of visual stimuli, yet today's vision systems only accurately recognize patterns within a few seconds. These systems understand the present, but fail to contextualize it in past or future events. In this paper, we study long-form video understanding. We introduce a framework for modeling long-form videos and develop evaluation protocols on large-scale datasets. We show that existing state-of-the-art short-term models are limited for long-form tasks. A novel object-centric transformer-based video recognition architecture performs significantly better on 7 diverse tasks. It also outperforms comparable state-of-the-art on the AVA dataset.

Results

Task	Dataset	Metric	Value	Model
Activity Recognition	AVA v2.2	mAP	31	Object Transformer
Action Recognition	AVA v2.2	mAP	31	Object Transformer

Related Papers

VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding2025-07-17 A Real-Time System for Egocentric Hand-Object Interaction Detection in Industrial Domains2025-07-17 DVFL-Net: A Lightweight Distilled Video Focal Modulation Network for Spatio-Temporal Action Recognition2025-07-16 UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks2025-07-15 EmbRACE-3K: Embodied Reasoning and Action in Complex Environments2025-07-14 Chat with AI: The Surprising Turn of Real-time Video Communication from Human to AI2025-07-14 FreeAudio: Training-Free Timing Planning for Controllable Long-Form Text-to-Audio Generation2025-07-11 Beyond Appearance: Geometric Cues for Robust Video Instance Segmentation2025-07-08