TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Number it: Temporal Grounding Videos like Flipping Manga

Number it: Temporal Grounding Videos like Flipping Manga

Yongliang Wu, Xinting Hu, Yuyang Sun, Yizhou Zhou, Wenbo Zhu, Fengyun Rao, Bernt Schiele, Xu Yang

2024-11-15CVPR 2025 1Highlight DetectionMoment RetrievalTemporal Localization
PaperPDFCode(official)

Abstract

Video Large Language Models (Vid-LLMs) have made remarkable advancements in comprehending video content for QA dialogue. However, they struggle to extend this visual understanding to tasks requiring precise temporal localization, known as Video Temporal Grounding (VTG). To address this gap, we introduce Number-Prompt (NumPro), a novel method that empowers Vid-LLMs to bridge visual comprehension with temporal grounding by adding unique numerical identifiers to each video frame. Treating a video as a sequence of numbered frame images, NumPro transforms VTG into an intuitive process: flipping through manga panels in sequence. This allows Vid-LLMs to "read" event timelines, accurately linking visual content with corresponding temporal information. Our experiments demonstrate that NumPro significantly boosts VTG performance of top-tier Vid-LLMs without additional computational cost. Furthermore, fine-tuning on a NumPro-enhanced dataset defines a new state-of-the-art for VTG, surpassing previous top-performing methods by up to 6.9\% in mIoU for moment retrieval and 8.5\% in mAP for highlight detection. The code will be available at https://github.com/yongliang-wu/NumPro.

Results

TaskDatasetMetricValueModel
Highlight DetectionQVHighlightsHit@170.71NumPro
Highlight DetectionQVHighlightsmAP40.54NumPro
16kQVHighlightsHit@170.71NumPro
16kQVHighlightsmAP40.54NumPro

Related Papers

DeSPITE: Exploring Contrastive Deep Skeleton-Pointcloud-IMU-Text Embeddings for Advanced Point Cloud Human Activity Understanding2025-06-16Fine-Tuning Large Audio-Language Models with LoRA for Precise Temporal Localization of Prolonged Exposure Therapy Elements2025-06-11VideoMolmo: Spatio-Temporal Grounding Meets Pointing2025-06-05DisTime: Distribution-based Time Representation for Video Large Language Models2025-05-30Unsupervised Transcript-assisted Video Summarization and Highlight Detection2025-05-29Rhapsody: A Dataset for Highlight Detection in Podcasts2025-05-26Gameplay Highlights Generation2025-05-12Retrieval Augmented Generation Evaluation for Health Documents2025-05-07