TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/UniVTG: Towards Unified Video-Language Temporal Grounding

UniVTG: Towards Unified Video-Language Temporal Grounding

Kevin Qinghong Lin, Pengchuan Zhang, Joya Chen, Shraman Pramanick, Difei Gao, Alex Jinpeng Wang, Rui Yan, Mike Zheng Shou

2023-07-31ICCV 2023 1Video SummarizationHighlight DetectionMoment RetrievalRetrievalNatural Language Moment Retrieval
PaperPDFCode(official)

Abstract

Video Temporal Grounding (VTG), which aims to ground target clips from videos (such as consecutive intervals or disjoint shots) according to custom language queries (e.g., sentences or words), is key for video browsing on social media. Most methods in this direction develop taskspecific models that are trained with type-specific labels, such as moment retrieval (time interval) and highlight detection (worthiness curve), which limits their abilities to generalize to various VTG tasks and labels. In this paper, we propose to Unify the diverse VTG labels and tasks, dubbed UniVTG, along three directions: Firstly, we revisit a wide range of VTG labels and tasks and define a unified formulation. Based on this, we develop data annotation schemes to create scalable pseudo supervision. Secondly, we develop an effective and flexible grounding model capable of addressing each task and making full use of each label. Lastly, thanks to the unified framework, we are able to unlock temporal grounding pretraining from large-scale diverse labels and develop stronger grounding abilities e.g., zero-shot grounding. Extensive experiments on three tasks (moment retrieval, highlight detection and video summarization) across seven datasets (QVHighlights, Charades-STA, TACoS, Ego4D, YouTube Highlights, TVSum, and QFVS) demonstrate the effectiveness and flexibility of our proposed framework. The codes are available at https://github.com/showlab/UniVTG.

Results

TaskDatasetMetricValueModel
VideoTACoSR@1,IoU=0.351.44UniVTG
VideoTACoSR@1,IoU=0.534.97UniVTG
VideoTACoSR@1,IoU=0.721.07UniVTG
VideoTACoSmIoU35.76UniVTG
Moment RetrievalQVHighlightsR@1 IoU=0.565.43UniVTG (w/ PT)
Moment RetrievalQVHighlightsR@1 IoU=0.750.06UniVTG (w/ PT)
Moment RetrievalQVHighlightsmAP43.63UniVTG (w/ PT)
Moment RetrievalQVHighlightsmAP@0.564.06UniVTG (w/ PT)
Moment RetrievalQVHighlightsmAP@0.7545.02UniVTG (w/ PT)
Moment RetrievalQVHighlightsR@1 IoU=0.558.86UniVTG
Moment RetrievalQVHighlightsR@1 IoU=0.740.86UniVTG
Moment RetrievalQVHighlightsmAP35.47UniVTG
Moment RetrievalQVHighlightsmAP@0.557.6UniVTG
Moment RetrievalQVHighlightsmAP@0.7535.59UniVTG
Highlight DetectionQVHighlightsHit@166.28UniVTG (w/ PT)
Highlight DetectionQVHighlightsmAP40.54UniVTG (w/ PT)
Highlight DetectionQVHighlightsHit@160.96UniVTG
Highlight DetectionQVHighlightsmAP38.2UniVTG
16kQVHighlightsHit@166.28UniVTG (w/ PT)
16kQVHighlightsmAP40.54UniVTG (w/ PT)
16kQVHighlightsHit@160.96UniVTG
16kQVHighlightsmAP38.2UniVTG

Related Papers

From Roots to Rewards: Dynamic Tree Reasoning with RL2025-07-17HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals2025-07-17A Survey of Context Engineering for Large Language Models2025-07-17MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval2025-07-17Developing Visual Augmented Q&A System using Scalable Vision Embedding Retrieval & Late Interaction Re-ranker2025-07-16Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos2025-07-16Context-Aware Search and Retrieval Over Erasure Channels2025-07-16Seq vs Seq: An Open Suite of Paired Encoders and Decoders2025-07-15