Bridging the Gap: A Unified Video Comprehension Framework for Moment Retrieval and Highlight Detection

Yicheng Xiao, Zhuoyan Luo, Yong liu, Yue Ma, Hengwei Bian, Yatai Ji, Yujiu Yang, Xiu Li

2023-11-28CVPR 2024 1Video Grounding Highlight Detection Contrastive Learning Moment Retrieval Retrieval Temporal Action Localization Natural Language Moment Retrieval

Paper PDF Code(official)

Abstract

Video Moment Retrieval (MR) and Highlight Detection (HD) have attracted significant attention due to the growing demand for video analysis. Recent approaches treat MR and HD as similar video grounding problems and address them together with transformer-based architecture. However, we observe that the emphasis of MR and HD differs, with one necessitating the perception of local relationships and the other prioritizing the understanding of global contexts. Consequently, the lack of task-specific design will inevitably lead to limitations in associating the intrinsic specialty of two tasks. To tackle the issue, we propose a Unified Video COMprehension framework (UVCOM) to bridge the gap and jointly solve MR and HD effectively. By performing progressive integration on intra and inter-modality across multi-granularity, UVCOM achieves the comprehensive understanding in processing a video. Moreover, we present multi-aspect contrastive learning to consolidate the local relation modeling and global knowledge accumulation via well aligned multi-modal space. Extensive experiments on QVHighlights, Charades-STA, TACoS , YouTube Highlights and TVSum datasets demonstrate the effectiveness and rationality of UVCOM which outperforms the state-of-the-art methods by a remarkable margin.

Results

Task	Dataset	Metric	Value	Model
Video	TACoS	R@1,IoU=0.5	36.39	UVCOM
Video	TACoS	R@1,IoU=0.7	23.32	UVCOM
Moment Retrieval	Charades-STA	R@1 IoU=0.5	59.25	UVCOM
Moment Retrieval	Charades-STA	R@1 IoU=0.7	36.64	UVCOM
Moment Retrieval	QVHighlights	R@1 IoU=0.5	64.53	UVCOM (w/ PT ASR Captions)
Moment Retrieval	QVHighlights	R@1 IoU=0.7	48.31	UVCOM (w/ PT ASR Captions)
Moment Retrieval	QVHighlights	mAP	43.8	UVCOM (w/ PT ASR Captions)
Moment Retrieval	QVHighlights	mAP@0.5	64.78	UVCOM (w/ PT ASR Captions)
Moment Retrieval	QVHighlights	mAP@0.75	43.65	UVCOM (w/ PT ASR Captions)
Moment Retrieval	QVHighlights	R@1 IoU=0.5	63.55	UVCOM
Moment Retrieval	QVHighlights	R@1 IoU=0.7	47.47	UVCOM
Moment Retrieval	QVHighlights	mAP	43.18	UVCOM
Moment Retrieval	QVHighlights	mAP@0.5	63.37	UVCOM
Moment Retrieval	QVHighlights	mAP@0.75	42.67	UVCOM
Highlight Detection	TvSum	mAP	86.3	UVCOM (train from scratch)
Highlight Detection	YouTube Highlights	mAP	77.4	UVCOM
16k	TvSum	mAP	86.3	UVCOM (train from scratch)
16k	YouTube Highlights	mAP	77.4	UVCOM

Bridging the Gap: A Unified Video Comprehension Framework for Moment Retrieval and Highlight Detection

Abstract

Results

Related Papers

Bridging the Gap: A Unified Video Comprehension Framework for Moment Retrieval and Highlight Detection

Abstract

Results

Related Papers