Number it: Temporal Grounding Videos like Flipping Manga

Yongliang Wu, Xinting Hu, Yuyang Sun, Yizhou Zhou, Wenbo Zhu, Fengyun Rao, Bernt Schiele, Xu Yang

2024-11-15CVPR 2025 1Highlight Detection Moment Retrieval Temporal Localization

Abstract

Video Large Language Models (Vid-LLMs) have made remarkable advancements in comprehending video content for QA dialogue. However, they struggle to extend this visual understanding to tasks requiring precise temporal localization, known as Video Temporal Grounding (VTG). To address this gap, we introduce Number-Prompt (NumPro), a novel method that empowers Vid-LLMs to bridge visual comprehension with temporal grounding by adding unique numerical identifiers to each video frame. Treating a video as a sequence of numbered frame images, NumPro transforms VTG into an intuitive process: flipping through manga panels in sequence. This allows Vid-LLMs to "read" event timelines, accurately linking visual content with corresponding temporal information. Our experiments demonstrate that NumPro significantly boosts VTG performance of top-tier Vid-LLMs without additional computational cost. Furthermore, fine-tuning on a NumPro-enhanced dataset defines a new state-of-the-art for VTG, surpassing previous top-performing methods by up to 6.9\% in mIoU for moment retrieval and 8.5\% in mAP for highlight detection. The code will be available at https://github.com/yongliang-wu/NumPro.

Results

Task	Dataset	Metric	Value	Model
Highlight Detection	QVHighlights	Hit@1	70.71	NumPro
Highlight Detection	QVHighlights	mAP	40.54	NumPro
16k	QVHighlights	Hit@1	70.71	NumPro
16k	QVHighlights	mAP	40.54	NumPro

Related Papers

DeSPITE: Exploring Contrastive Deep Skeleton-Pointcloud-IMU-Text Embeddings for Advanced Point Cloud Human Activity Understanding2025-06-16 Fine-Tuning Large Audio-Language Models with LoRA for Precise Temporal Localization of Prolonged Exposure Therapy Elements2025-06-11 VideoMolmo: Spatio-Temporal Grounding Meets Pointing2025-06-05 DisTime: Distribution-based Time Representation for Video Large Language Models2025-05-30 Unsupervised Transcript-assisted Video Summarization and Highlight Detection2025-05-29 Rhapsody: A Dataset for Highlight Detection in Podcasts2025-05-26 Gameplay Highlights Generation2025-05-12 Retrieval Augmented Generation Evaluation for Health Documents2025-05-07