UniVTG: Towards Unified Video-Language Temporal Grounding

Kevin Qinghong Lin, Pengchuan Zhang, Joya Chen, Shraman Pramanick, Difei Gao, Alex Jinpeng Wang, Rui Yan, Mike Zheng Shou

2023-07-31ICCV 2023 1Video Summarization Highlight Detection Moment Retrieval Retrieval Natural Language Moment Retrieval

Paper PDF Code(official)

Abstract

Video Temporal Grounding (VTG), which aims to ground target clips from videos (such as consecutive intervals or disjoint shots) according to custom language queries (e.g., sentences or words), is key for video browsing on social media. Most methods in this direction develop taskspecific models that are trained with type-specific labels, such as moment retrieval (time interval) and highlight detection (worthiness curve), which limits their abilities to generalize to various VTG tasks and labels. In this paper, we propose to Unify the diverse VTG labels and tasks, dubbed UniVTG, along three directions: Firstly, we revisit a wide range of VTG labels and tasks and define a unified formulation. Based on this, we develop data annotation schemes to create scalable pseudo supervision. Secondly, we develop an effective and flexible grounding model capable of addressing each task and making full use of each label. Lastly, thanks to the unified framework, we are able to unlock temporal grounding pretraining from large-scale diverse labels and develop stronger grounding abilities e.g., zero-shot grounding. Extensive experiments on three tasks (moment retrieval, highlight detection and video summarization) across seven datasets (QVHighlights, Charades-STA, TACoS, Ego4D, YouTube Highlights, TVSum, and QFVS) demonstrate the effectiveness and flexibility of our proposed framework. The codes are available at https://github.com/showlab/UniVTG.

Results

Task	Dataset	Metric	Value	Model
Video	TACoS	R@1,IoU=0.3	51.44	UniVTG
Video	TACoS	R@1,IoU=0.5	34.97	UniVTG
Video	TACoS	R@1,IoU=0.7	21.07	UniVTG
Video	TACoS	mIoU	35.76	UniVTG
Moment Retrieval	QVHighlights	R@1 IoU=0.5	65.43	UniVTG (w/ PT)
Moment Retrieval	QVHighlights	R@1 IoU=0.7	50.06	UniVTG (w/ PT)
Moment Retrieval	QVHighlights	mAP	43.63	UniVTG (w/ PT)
Moment Retrieval	QVHighlights	mAP@0.5	64.06	UniVTG (w/ PT)
Moment Retrieval	QVHighlights	mAP@0.75	45.02	UniVTG (w/ PT)
Moment Retrieval	QVHighlights	R@1 IoU=0.5	58.86	UniVTG
Moment Retrieval	QVHighlights	R@1 IoU=0.7	40.86	UniVTG
Moment Retrieval	QVHighlights	mAP	35.47	UniVTG
Moment Retrieval	QVHighlights	mAP@0.5	57.6	UniVTG
Moment Retrieval	QVHighlights	mAP@0.75	35.59	UniVTG
Highlight Detection	QVHighlights	Hit@1	66.28	UniVTG (w/ PT)
Highlight Detection	QVHighlights	mAP	40.54	UniVTG (w/ PT)
Highlight Detection	QVHighlights	Hit@1	60.96	UniVTG
Highlight Detection	QVHighlights	mAP	38.2	UniVTG
16k	QVHighlights	Hit@1	66.28	UniVTG (w/ PT)
16k	QVHighlights	mAP	40.54	UniVTG (w/ PT)
16k	QVHighlights	Hit@1	60.96	UniVTG
16k	QVHighlights	mAP	38.2	UniVTG

UniVTG: Towards Unified Video-Language Temporal Grounding

Abstract

Results

Related Papers

UniVTG: Towards Unified Video-Language Temporal Grounding

Abstract

Results

Related Papers